top of page
Gradient With Circle
Image by Nick Morrison

Insights Across Technology, Software, and AI

Discover articles across technology, software, and AI. From core concepts to modern tech and practical implementations.

Exploring the Boston Housing Dataset with TensorFlow in Python

  • Aug 15, 2024
  • 6 min read

Updated: 6 days ago

Housing price prediction is a common example used to demonstrate regression in machine learning. The Boston Housing dataset contains information about housing conditions and neighborhood characteristics, making it a useful dataset for learning how predictive models work.


In this tutorial, we’ll use TensorFlow to build a simple regression model that predicts housing prices. Along the way, we’ll cover data preprocessing, building the neural network, training the model, and evaluating its performance.


Boston housing dataset with tensorflow in python

What is Boston Housing Dataset

The Boston Housing Dataset in TensorFlow is one of the most well-known datasets used in machine learning for demonstrating regression techniques. It contains information collected by the U.S. Census Service about housing in different neighborhoods of Boston, Massachusetts. Researchers and practitioners often use this dataset to explore how various factors influence housing prices.


The dataset includes 506 records, with each record representing a suburb or neighborhood in Boston. For every instance, there are 13 input features that describe aspects of the area such as crime rate, housing conditions, accessibility to highways, and environmental factors. The goal of the dataset is to predict the median value of owner-occupied homes, which serves as the target variable.


The target variable, MEDV, represents the median house price in units of $1000. By analyzing the relationships between the input features and this target variable, machine learning models can learn to estimate housing prices based on the characteristics of a neighborhood.

Below is a brief description of the dataset features:


  1. CRIM: Per capita crime rate by town

  2. ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.

  3. INDUS: Proportion of non-retail business acres per town

  4. CHAS: Charles River dummy variable (1 if tract bounds river, 0 otherwise)

  5. NOX: Nitric oxide concentration (parts per 10 million)

  6. RM: Average number of rooms per dwelling

  7. AGE: Proportion of owner-occupied units built before 1940

  8. DIS: Weighted distances to five Boston employment centers

  9. RAD: Index of accessibility to radial highways

  10. TAX: Full-value property tax rate per $10,000

  11. PTRATIO: Pupil-teacher ratio by town

  12. B: 1000(Bk − 0.63)² where Bk is the proportion of Black residents by town

  13. LSTAT: Percentage of lower status of the population

  14. MEDV: Median value of owner-occupied homes in $1000s

Exploring the Boston Housing Dataset with TensorFlow in Python

Exploring the Boston Housing Dataset with TensorFlow in Python provides a practical way to understand regression using neural networks. This well-known dataset contains several socio-economic and geographical features related to housing in Boston, Massachusetts. The objective is to use these features to predict the median value of homes in different neighborhoods.

By using TensorFlow, we can load the dataset, preprocess the data, build a neural network model, and evaluate its performance. This process also highlights the importance of preparing data properly, especially through techniques such as feature scaling, which helps improve the efficiency and stability of neural network training.


1. Loading and Preprocessing the Boston Housing Dataset

The first step is to load the dataset and perform basic preprocessing. TensorFlow provides the Boston Housing dataset through its keras.datasets module, which allows us to quickly access the training and testing data.

Since neural networks perform better when features are on a similar scale, we standardize the dataset using StandardScaler. This transformation ensures that each feature has a mean of 0 and a standard deviation of 1, which helps the model train more effectively.

import tensorflow as tf
from tensorflow.keras.datasets import boston_housing
from tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the dataset
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# Standardize the data
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)

# Print the shape of the data
print(f'Training data shape: {train_data.shape}')
print(f'Test data shape: {test_data.shape}')

Running this code produces the following output:

Training data shape: (404, 13)Test data shape: (102, 13)

This indicates that the training dataset contains 404 samples with 13 features each, while the test dataset contains 102 samples. These features will be used by the neural network to learn patterns that help predict housing prices.


2. Building the Sequential Model

Next, we build a simple feedforward neural network using TensorFlow’s Keras API. The model consists of fully connected layers that learn patterns in the dataset and estimate housing prices based on the input features.

The network starts with a dense layer containing 64 neurons and a ReLU activation function. ReLU helps the model learn complex relationships in the data by introducing non-linearity. A dropout layer is added after the first dense layer to reduce overfitting by randomly disabling some neurons during training. Another dense layer is then added to further refine the learned features.

Finally, the model ends with a single neuron in the output layer. Since this is a regression problem, the output layer produces a continuous value representing the predicted housing price.

# Build the model
model = Sequential([
    Dense(64, activation='relu', 
    input_shape=(train_data.shape[1],)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(1)  # Output layer for regression])

# Compile the model
model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae'])

The model is compiled using the Adam optimizer, while mean squared error (MSE) is used as the loss function. We also track mean absolute error (MAE) as an additional performance metric.


3. Training the Model

Once the model architecture is defined, the next step is to train it using the training dataset. During training, the model learns patterns in the data by adjusting its internal weights to minimize prediction error.

In this example, the model is trained for 100 epochs, meaning the dataset is passed through the network 100 times. We also reserve 20% of the training data for validation, which helps monitor how well the model performs on unseen data during training.

# Train the model
history = model.fit(
    train_data,
    train_targets,
    epochs=100,
    validation_split=0.2,
    batch_size=32,
    verbose=1)

The training output displays the loss and mean absolute error for both the training and validation datasets as the model learns.

Output from the final training epochs:

Epoch 97/10011/11 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 19.4921 - mae: 3.3381 - val_loss: 13.5337 - val_mae: 2.7524
...
Epoch 100/10011/11 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 19.2941 - mae: 3.2998 - val_loss: 14.3015 - val_mae: 2.8350

These metrics show how the model’s prediction error changes as training progresses and help determine whether the model is learning effectively.


4. Evaluating the Model

After training the neural network, the next step is to evaluate its performance on the test dataset. The test data contains samples that were not used during training, making it useful for measuring how well the model generalizes to unseen data.

Using the evaluate() function, TensorFlow calculates the loss and the mean absolute error (MAE) based on the model’s predictions.

# Evaluate the model
model.evaluate(test_data, test_targets)

Running this code produces the following output:

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 20.8195 - mae: 3.3060[25.54439926147461, 3.5264275074005127]

The mean absolute error (MAE) indicates the average difference between the predicted and actual housing prices. In this case, the model’s predictions differ from the actual values by roughly 3.5 units on average, where each unit represents $1,000 in housing price.


5. Making Predictions

Once the model has been evaluated, we can use it to make predictions on new data. In this step, we generate predictions for the test dataset and compare them with the actual housing prices.

# Make predictions
predictions = model.predict(test_data)

# Print some predictions
for i in range(5):
    print(f'Predicted value: {predictions[i][0]:.2f}, Actual value: 
		  {test_targets[i]:.2f}')

Example output from the model:

4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 20ms/step
Predicted value: 8.02, Actual value: 7.20
Predicted value: 16.52, Actual value: 18.80
Predicted value: 20.18, Actual value: 19.00
Predicted value: 31.13, Actual value: 27.00P
redicted value: 24.20, Actual value: 22.20

These results show how the model’s predictions compare with the actual housing prices. While the predictions are not perfectly accurate, they are generally close to the real values, indicating that the neural network has successfully learned patterns from the dataset.


Conclusion

We've walked through the process of building a simple regression model using TensorFlow to predict housing prices from the Boston Housing dataset. We started by loading and preprocessing the data, then built and trained a neural network, and finally evaluated its performance.

This example demonstrates how easy it is to get started with TensorFlow for regression tasks. The Boston Housing dataset is just one of many datasets available for experimentation, and TensorFlow's powerful yet intuitive API makes it a great tool for both beginners and experts alike.


Whether you're interested in building more complex models or experimenting with different datasets, TensorFlow provides the flexibility and performance to help you achieve your goals in machine learning.

Get in touch for customized mentorship, research and freelance solutions tailored to your needs.

bottom of page