Objective: Apply your newly acquired Data Science knowledge to explore a simple real-world relationship and build a basic predictive model. This is a quick hands-on exercise to get you comfortable with the tools. This tutorial requires basic knowledge of Python and Data Science.
We will use sklearn (from the scikit-learn
distribution), which is based on SciPy, focusing specifically on machine learning algorithms. The functionality that scikit-learn provides includes: Regression, including Linear and Logistic Regression. Classification, including K-Nearest Neighbors.
Before you start coding, consider creating a virtual environment first. That has been explained in this tutorial.
Refresher & Tools
- Briefly revisit the “Linear Regression” and “R-squared” sections on W3Schools Data Science if you need a quick reminder
- Ensure your Python environment is ready and you can import libraries like
pandas
,matplotlib
, andscikit-learn
(forsklearn
).
To install a library, you must:
pip install scikit-learn
In Thonny, you can install libraries via Tools > Manage packages.
Make sure to install all necessary libraries.
Get Your Data
- Download this small example dataset (it’s a common dataset from Kaggle).
- Place the
.csv
file in the same directory as your Python script
Next, create a new Python script and import the necessary libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
1. Load and Sample Your Data
- Load the
.csv
file into a Pandas DataFrame. - Beware that datasets might be large. To ease development and testing, consider working with a small random sample*:
df = df_full.sample(n=50, random_state=42)
However, if you use the given dataset, you can skip this.
* Another way to reduce the size is to split the data into training and testing sets.
Test if it works, by printing some info:
print(df.head())
Run the script.
2. Explore and Visualize
- Create a scatter plot with ‘Age’ on the X-axis and ‘Height’ (or your height column name) on the Y-axis.
- Does there appear to be a relationship?
Example code:
plt.scatter(df['Age'], df['Height'])
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height')
plt.show()
3. Build Your Model
- Using
sklearn.linear_model.LinearRegression
, create a simple linear regression model. - Train your model (if needed, using a training set) to predict
Height
based onAge
.
Example code:
from sklearn.linear_model import LinearRegression
import numpy as np
# Assuming your columns are named 'Age' and 'Height'
# Reshape 'Age' for sklearn if it's a single feature
X = df[['Age']] # Input feature must be 2D
y = df['Height'] # Target variable
model = LinearRegression()
model.fit(X, y)
Evaluate Your Model
- Calculate the R-squared value for your model.
- What is your R-squared value? What does this number generally tell you about how well age explains height in this dataset?
Example code:
r_squared = model.score(X, y) # For this exercise, we evaluate on training data
print(f"R-squared value: {r_squared:.2f}")
Make a Prediction
- Use your trained model to predict the height for a new, arbitrary age (e.g., predict height for someone who is 7 years old, or 25 years old). Remember to reshape the input age for the model.
- Print out the predicted height.
# Example prediction for a 12-year-old
new_age = np.array([[12]]) # Input must be 2D
predicted_height = model.predict(new_age)
print(f"Predicted height for age 12: {predicted_height[0]:.2f} cm")
Visualize the outcome
Visualize the regression line in the scatter plot for better interpretation. Adjusted part of the code that adds the regression line to the scatter plot:
# Visualize the regression line in a scatter plot for better interpretation.
plt.scatter(df['Age'], df['Height'])
plt.plot(df['Age'], model.predict(X), color='red', linewidth=2)
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Scatter Plot of Age vs. Height with Regression Line')
plt.show()
Reflection
- What do you think of this prediction?
- What conclusions can you draw from this exercise?
- Is this a useful approach?
- What would you do different?