Data Science Tutorial: Age vs. Height

Objective: Apply your newly acquired Data Science knowledge to explore a simple real-world relationship and build a basic predictive model. This is a quick hands-on exercise to get you comfortable with the tools. This tutorial requires basic knowledge of Python and Data Science.

We will use sklearn (from the scikit-learn distribution), which is based on SciPy, focusing specifically on machine learning algorithms. The functionality that scikit-learn provides includes: Regression, including Linear and Logistic Regression. Classification, including K-Nearest Neighbors.

Before you start coding, consider creating a virtual environment first. That has been explained in this tutorial.

Refresher & Tools

  • Briefly revisit the “Linear Regression” and “R-squared” sections on W3Schools Data Science if you need a quick reminder
  • Ensure your Python environment is ready and you can import libraries like pandas, matplotlib, and scikit-learn (for sklearn).

To install a library, you must:

pip install scikit-learn

In Thonny, you can install libraries via Tools > Manage packages.

Make sure to install all necessary libraries.

Get Your Data

Next, create a new Python script and import the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

1. Load and Sample Your Data

df = df_full.sample(n=50, random_state=42)

However, if you use the given dataset, you can skip this.

* Another way to reduce the size is to split the data into training and testing sets.

Test if it works, by printing some info:

print(df.head())

Run the script.

2. Explore and Visualize

  • Create a scatter plot with ‘Age’ on the X-axis and ‘Height’ (or your height column name) on the Y-axis.
  • Does there appear to be a relationship?

Example code:

plt.scatter(df['Age'], df['Height'])
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height')
plt.show()

3. Build Your Model

  • Using sklearn.linear_model.LinearRegression, create a simple linear regression model.
  • Train your model (if needed, using a training set) to predict Height based on Age.

Example code:

from sklearn.linear_model import LinearRegression
import numpy as np

# Assuming your columns are named 'Age' and 'Height'
# Reshape 'Age' for sklearn if it's a single feature
X = df[['Age']] # Input feature must be 2D
y = df['Height'] # Target variable

model = LinearRegression()
model.fit(X, y)

Evaluate Your Model

  • Calculate the R-squared value for your model.
  • What is your R-squared value? What does this number generally tell you about how well age explains height in this dataset?

Example code:

r_squared = model.score(X, y) # For this exercise, we evaluate on training data
print(f"R-squared value: {r_squared:.2f}")

Make a Prediction

  • Use your trained model to predict the height for a new, arbitrary age (e.g., predict height for someone who is 7 years old, or 25 years old). Remember to reshape the input age for the model.
  • Print out the predicted height.
# Example prediction for a 12-year-old
new_age = np.array([[12]]) # Input must be 2D
predicted_height = model.predict(new_age)
print(f"Predicted height for age 12: {predicted_height[0]:.2f} cm")

Visualize the outcome

Visualize the regression line in the scatter plot for better interpretation. Adjusted part of the code that adds the regression line to the scatter plot:

# Visualize the regression line in a scatter plot for better interpretation.
plt.scatter(df['Age'], df['Height'])
plt.plot(df['Age'], model.predict(X), color='red', linewidth=2)
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height with Regression Line')
plt.show()

Reflection

  • What do you think of this prediction?
  • What conclusions can you draw from this exercise?
  • Is this a useful approach?
  • What would you do different?