Data Science Tutorial: Age vs. Height - vanslooten.com Data Science Tutorial: Age vs. Height

Objective: Apply your newly acquired Data Science knowledge to explore a simple real-world relationship and build a basic predictive model. This is a quick hands-on exercise to get you comfortable with the tools. This tutorial requires basic knowledge of Python and Data Science.

We will use sklearn (from the scikit-learn distribution), which is based on SciPy, focusing specifically on machine learning algorithms. The functionality that scikit-learn provides includes: Regression, including Linear and Logistic Regression. Classification, including K-Nearest Neighbors.

Before you start coding, consider creating a virtual environment first. That has been explained in this tutorial.

Refresher & Tools

Briefly revisit the “Linear Regression” and “R-squared” sections on W3Schools Data Science if you need a quick reminder
Ensure your Python environment is ready and you can import libraries like pandas, matplotlib, and scikit-learn (for sklearn).

To install a library, you must:

pip install scikit-learn

In Thonny, you can install libraries via Tools > Manage packages.

Make sure to install all necessary libraries.

Get Your Data

For this tutorial we selected a small dataset from Kaggle, that contains Age, Height, Weight and Bmi data.

Download this small example dataset (it’s a common dataset from Kaggle).
Place the .csv file in the same directory as your Python script

Next, create a new Python script and import the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

1. Load and Sample Your Data

Load the .csv file into a Pandas DataFrame

Beware that datasets might be large. To ease development and testing, consider working with a small random sample*:

df = df_full.sample(n=50, random_state=42)

However, if you use the given dataset, you can skip this.

* Another way to reduce the size is to split the data into training and testing sets.

Test if the dataset loads properly, by printing some info:

print(df.head())

Run the script.

2. Explore and Visualize

Print a bit more information about the dataset. For example, what is the age range? (min. and max. of the Age?) What is the average height?

Create a scatter plot with ‘Age’ on the X-axis and ‘Height’ on the Y-axis.

Does there appear to be a relationship?

Example code:

plt.scatter(df['Age'], df['Height'])
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height')
plt.show()

3. Build Your Model

Using sklearn.linear_model.LinearRegression, create a simple linear regression model.

Train your model (if needed, using a training set) to predict Height based on Age.

Example code:

# Assuming your columns are named 'Age' and 'Height'
# Reshape 'Age' for sklearn if it's a single feature
X = df[['Age']] # Input feature must be 2D
y = df['Height'] # Target variable

model = LinearRegression()
model.fit(X, y)

Evaluate Your Model

Calculate the R-squared value for your model.

What is your R-squared value? What does this number generally tell you about how well age explains height in this dataset?

Example code:

r_squared = model.score(X, y) # For this exercise, we evaluate on training data
print(f"R-squared value: {r_squared:.8f}")

Make a Prediction

Use your trained model to predict the height for a new, arbitrary age (e.g., predict height for someone who is 7 years old, or 25 years old). Remember to reshape the input age for the model.

Print out the predicted height.

Example code:

# Example prediction for a 12-year-old
new_age = np.array([[12]]) # Input must be 2D
predicted_height = model.predict(new_age)
print(f"Predicted height for age 12: {predicted_height[0]:.2f} cm")

Visualize the outcome

Visualize the regression line in the scatter plot for better interpretation. Adjusted part of the code that adds the regression line to the scatter plot:

# Visualize the regression line in a scatter plot for better interpretation.
plt.scatter(df['Age'], df['Height'])
plt.plot(df['Age'], model.predict(X), color='red', linewidth=2)
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height with Regression Line')
plt.show()

Reflection

What do you think of this prediction?
What conclusions can you draw from this exercise?
Is this a useful approach?
What would you do different?

Learn from other findings and examples

Take a look at the dataset page on Kaggle. Apart from the extensive description of the dataset on the first tab, it has a tab ‘Code’, where you can find code examples (mostly Jupyter notebooks) from others using that dataset. For example the “Basic Linear Regression to predict height from age” looks very similar to what we have done.

Apart from nice examples, it also contains some analysis. Like the one from “Translate data into Insights – XGBoost”. There you find some conclusions in between the code samples:

You can learn a lot from these Code examples!