Objective: Apply your newly acquired Data Science knowledge to explore a simple real-world relationship and build a basic predictive model. This is a quick hands-on exercise to get you comfortable with the tools. This tutorial requires basic knowledge of Python and Data Science.
We will use sklearn (from the scikit-learn
distribution), which is based on SciPy, focusing specifically on machine learning algorithms. The functionality that scikit-learn provides includes: Regression, including Linear and Logistic Regression. Classification, including K-Nearest Neighbors.
Before you start coding, consider creating a virtual environment first. That has been explained in this tutorial.
Refresher & Tools
- Briefly revisit the “Linear Regression” and “R-squared” sections on W3Schools Data Science if you need a quick reminder
- Ensure your Python environment is ready and you can import libraries like
pandas
,matplotlib
, andscikit-learn
(forsklearn
).
To install a library, you must:
pip install scikit-learn
In Thonny, you can install libraries via Tools > Manage packages.
Make sure to install all necessary libraries.
Get Your Data
For this tutorial we selected a small dataset from Kaggle, that contains Age, Height, Weight and Bmi data.
- Download this small example dataset (it’s a common dataset from Kaggle).
- Place the
.csv
file in the same directory as your Python script
Next, create a new Python script and import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
1. Load and Sample Your Data
Load the .csv
file into a Pandas DataFrame
Beware that datasets might be large. To ease development and testing, consider working with a small random sample*:
df = df_full.sample(n=50, random_state=42)
However, if you use the given dataset, you can skip this.
* Another way to reduce the size is to split the data into training and testing sets.
Test if the dataset loads properly, by printing some info:
print(df.head())
Run the script.
2. Explore and Visualize
Print a bit more information about the dataset. For example, what is the age range? (min. and max. of the Age?) What is the average height?
Create a scatter plot with ‘Age’ on the X-axis and ‘Height’ on the Y-axis.
Does there appear to be a relationship?
Example code:
plt.scatter(df['Age'], df['Height'])
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height')
plt.show()
3. Build Your Model
Using sklearn.linear_model.LinearRegression
, create a simple linear regression model.
Train your model (if needed, using a training set) to predict Height
based on Age
.
Example code:
# Assuming your columns are named 'Age' and 'Height'
# Reshape 'Age' for sklearn if it's a single feature
X = df[['Age']] # Input feature must be 2D
y = df['Height'] # Target variable
model = LinearRegression()
model.fit(X, y)
Evaluate Your Model
Calculate the R-squared value for your model.
What is your R-squared value? What does this number generally tell you about how well age explains height in this dataset?
Example code:
r_squared = model.score(X, y) # For this exercise, we evaluate on training data
print(f"R-squared value: {r_squared:.8f}")
Make a Prediction
Use your trained model to predict the height for a new, arbitrary age (e.g., predict height for someone who is 7 years old, or 25 years old). Remember to reshape the input age for the model.
Print out the predicted height.
Example code:
# Example prediction for a 12-year-old
new_age = np.array([[12]]) # Input must be 2D
predicted_height = model.predict(new_age)
print(f"Predicted height for age 12: {predicted_height[0]:.2f} cm")
Visualize the outcome
Visualize the regression line in the scatter plot for better interpretation. Adjusted part of the code that adds the regression line to the scatter plot:
# Visualize the regression line in a scatter plot for better interpretation.
plt.scatter(df['Age'], df['Height'])
plt.plot(df['Age'], model.predict(X), color='red', linewidth=2)
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height with Regression Line')
plt.show()
Reflection
- What do you think of this prediction?
- What conclusions can you draw from this exercise?
- Is this a useful approach?
- What would you do different?
Learn from other findings and examples
Take a look at the dataset page on Kaggle. Apart from the extensive description of the dataset on the first tab, it has a tab ‘Code’, where you can find code examples (mostly Jupyter notebooks) from others using that dataset. For example the “Basic Linear Regression to predict height from age” looks very similar to what we have done.
Apart from nice examples, it also contains some analysis. Like the one from “Translate data into Insights – XGBoost”. There you find some conclusions in between the code samples:

You can learn a lot from these Code examples!