Implementing VGG on CIFAR-10 Dataset in Python

Samul Black
Aug 27, 2024
8 min read

Updated: Aug 22

The Visual Geometry Group (VGG) network is a powerful Convolutional Neural Network (CNN) architecture that has been widely used in computer vision tasks. Known for its simplicity and depth, VGG has achieved state-of-the-art performance on many benchmarks. In this blog, we'll explore how to implement a VGG network using the CIFAR-10 dataset in Python.

Introduction to VGG
Understanding the CIFAR-10 Dataset
Setting Up the Environment
Implementing VGG on CIFAR-10 Dataset in Python -keras
Training the Model
Evaluating the Model
Conclusion

Implementing VGG on CIFAR-10 Dataset in Python - colabcodes

1. Introduction to VGG

VGG was introduced by the Visual Geometry Group at Oxford and has become one of the most influential CNN architectures. The architecture is characterized by its use of small (3x3) convolution filters, depth, and simplicity. The most commonly used variants of VGG are VGG16 and VGG19, which refer to the number of layers in the network.The VGG network is constructed with very small convolutional filters. The VGG-16 consists of 13 convolutional layers and three fully connected layers. Let’s take a brief look at the architecture of VGG:

Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition, the creators of the model cropped out the center 224×224 patch in each image to keep the input size of the image consistent.
Convolutional Layers: VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the smallest possible size that still captures up/down and left/right. Moreover, there are also 1×1 convolution filters acting as a linear transformation of the input. This is followed by a ReLU unit, which is a huge innovation from AlexNet that reduces training time. ReLU stands for rectified linear unit activation function; it is a piecewise linear function that will output the input if positive; otherwise, the output is zero. The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually leverage Local Response Normalization (LRN) as it increases memory consumption and training time. Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three layers, the first two have 4096 channels each, and the third has 1000 channels, 1 for each class. In case of Cifar - 10 Dataset there are only 10 classes so the final layer will have 10 channels.

2. Understanding the CIFAR-10 Dataset

CIFAR-10 consists of 60,000 color images, each sized 32x32 pixels, distributed across 10 distinct classes. These classes represent a diverse range of everyday objects and animals, making the dataset both challenging and representative of real-world scenarios.

The 10 Classes in CIFAR-10

The images are evenly distributed across the following 10 classes:

Airplane: Images of various types of aircraft.
Automobile: Includes cars, trucks, and other vehicles.
Bird: Various species of birds in different poses.
Cat: Domestic and wild cats in various settings.
Deer: Images of deer in different environments.
Dog: Various dog breeds, often in natural poses.
Frog: Frogs in different postures and backgrounds.
Horse: Images of horses, often in motion or standing.
Ship: Includes various watercraft like boats and ships.
Truck: Heavy vehicles such as lorries and trucks.

Each class has 6,000 images, making it a well-balanced dataset. The data is divided into 50,000 training images and 10,000 testing images, allowing for a robust evaluation of model performance. The CIFAR-10 dataset is structured as follows:

Training Set: 50,000 images (5,000 images per class)
Test Set: 10,000 images (1,000 images per class)

Each image in the dataset is a 32x32 pixel color image, represented by three channels (Red, Green, and Blue) with pixel values ranging from 0 to 255.

CIFAR-10 is often chosen for educational purposes and research due to its simplicity and manageable size. It allows researchers and students to quickly train and test models without the need for extensive computational resources. Moreover, it provides a clear benchmark for comparing different algorithms and approaches.

3. Setting Up the Environment

Before we start implementing VGG, let's ensure that our environment is set up correctly. We’ll be using the following libraries:

TensorFlow/Keras: For building and training the neural network.
NumPy: For numerical computations.
Matplotlib: For visualization.

Install these dependencies using pip:

pip install tensorflow numpy matplotlib

4. Implementing VGG on CIFAR-10 Dataset in Python -keras

Now, let’s dive into the implementation. We'll use Keras (part of TensorFlow) to build our VGG-like model. Due to the complexity of the VGG19 architecture, we'll implement a simplified version, but the core ideas remain the same.

Import necessary libraries

These import statements bring in the necessary libraries for building and training a deep learning model with TensorFlow. We import tensorflow as tf, which is the standard alias. From TensorFlow’s high-level Keras API, we import layers for building different neural network layers, and models to create model architectures. cifar10 is a built-in dataset consisting of 60,000 small color images in 10 classes, and to_categorical is a utility that helps convert class labels into a one-hot encoded format suitable for classification tasks.

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical

Load and Preprocess the CIFAR-10 Dataset

This block loads the CIFAR-10 dataset, which is split into training and testing sets. Each image has pixel values ranging from 0 to 255, so we normalize them by dividing by 255.0, bringing all pixel values into the [0, 1] range, which makes training more stable and faster. The labels are originally integers (e.g., 0 for airplane, 1 for automobile, etc.), and we convert them into one-hot encoded vectors using to_categorical. This is essential when using a softmax activation function in the final layer and a categorical cross-entropy loss function.

# Load and preprocess the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize the data
y_train, y_test = to_categorical(y_train), to_categorical(y_test)

Define the VGG-like Model

Here, we define a function called build_vgg_model() that returns a VGG-style CNN. We use the Sequential() model from Keras, which allows us to stack layers one after another in a linear manner. This is ideal for models where each layer flows directly into the next without complex branching.

# Define the VGG-like model
def build_vgg_model():
    model = models.Sequential()

This is the first convolutional block. It consists of two convolutional layers with 64 filters each, a kernel size of 3x3, and ReLU activations. The padding='same' ensures the output has the same width and height as the input. The input_shape=(32, 32, 3) specifies the input image size: 32x32 pixels with 3 color channels. After the convolutional layers, a max pooling layer with a 2x2 window downsamples the spatial dimensions by a factor of 2, reducing computation and capturing dominant features.

    model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))

This is the second block, which follows the same structure as the first. However, the number of filters has increased to 128, allowing the model to capture more complex patterns and features. This increase in depth enables the model to handle more abstract information as the network gets deeper. Again, a max pooling layer reduces the spatial dimensions.

    model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))

The third convolutional block consists of three convolutional layers with 256 filters each. As we go deeper, the network becomes more expressive and capable of learning more detailed patterns. The use of three convolutions before each pooling layer follows the pattern used in the original VGG architecture. After the convolutions, a max pooling operation again reduces the size of the feature maps.

    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))

This is the fourth block, increasing the depth to 512 filters per convolutional layer. These layers can extract very high-level features like shapes or object parts. By this point, the spatial size of the data is much smaller, but the depth (number of channels) is large, capturing rich information.

    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))

The fifth and final convolutional block is structurally identical to the fourth. Keeping the same depth helps the network consolidate the learned high-level features. Another max pooling layer is used at the end to reduce the feature map dimensions to a manageable size before passing into the fully connected layers.

    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))

After the last pooling layer, the feature maps are flattened into a one-dimensional vector to be fed into dense layers. Two fully connected layers follow, each with 4096 neurons and ReLU activations. These dense layers are responsible for high-level reasoning based on the features extracted by the convolutional layers. The final output layer is also a dense layer with 10 neurons and a softmax activation, which produces a probability distribution across the 10 CIFAR-10 classes, suitable for multi-class classification.

    model.add(layers.Flatten())
    model.add(layers.Dense(4096, activation='relu'))
    model.add(layers.Dense(4096, activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))
    return model

Compile and Summarize the Model

Here, the build_vgg_model() function is called to instantiate the model. We compile it using the Adam optimizer, which is efficient and adaptive for deep networks. The loss function used is categorical_crossentropy, which is standard for multi-class classification tasks with one-hot encoded labels. The model is set to track accuracy during training and evaluation. Finally, model.summary() prints a detailed summary of the architecture, showing each layer's name, output shape, and the number of parameters, providing insight into the structure and size of the network.

# Instantiate and compile the model
model = build_vgg_model()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model's architecture
model.summary()

Output for the above code:

5. Training the Model

After defining the model, it's time to train it. We'll train the model for 50 epochs using the Adam optimizer.

# VGG  model model training
history = model.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_test, y_test))

6. Evaluating the Model

Once the model is trained, we can evaluate its performance on the test dataset.

Epoch 1/5
782/782 ━━━━━━━━━━━━━━━━━━━━ 5617s 7s/step - accuracy: 0.0999 - loss: 2.3030 - val_accuracy: 0.1000 - val_loss: 2.3026
Epoch 2/5
782/782 ━━━━━━━━━━━━━━━━━━━━ 5689s 7s/step - accuracy: 0.0993 - loss: 2.3027 - val_accuracy: 0.1000 - val_loss: 2.3026
Epoch 3/5
782/782 ━━━━━━━━━━━━━━━━━━━━ 5645s 7s/step - accuracy: 0.0999 - loss: 2.3026 - val_accuracy: 0.1000 - val_loss: 2.3026
Epoch 4/5
782/782 ━━━━━━━━━━━━━━━━━━━━ 5667s 7s/step - accuracy: 0.0991 - loss: 2.3027 - val_accuracy: 0.1000 - val_loss: 2.3026
Epoch 5/5
195/782 ━━━━━━━━━━━━━━━━━━━━ 1:09:10 7s/step - accuracy: 0.0978 - loss: 2.3028

Full Code For Implementing VGG on CIFAR-10 Dataset in Python

This Python script demonstrates how to implement a VGG-style Convolutional Neural Network (CNN) using TensorFlow and Keras to classify images from the CIFAR-10 dataset. The CIFAR-10 dataset consists of 60,000 32x32 color images across 10 different classes. The model architecture follows the classic VGG pattern with stacked convolutional layers, max pooling, and fully connected dense layers at the end. This full implementation includes data loading, preprocessing, model construction, compilation, and a summary of the architecture—ready for training and evaluation.

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical

# Load and preprocess the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize the data
y_train, y_test = to_categorical(y_train), to_categorical(y_test)

# Define the VGG-like model
def build_vgg_model():
    model = models.Sequential()
	model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
   model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(128, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.Conv2D(512, (3, 3), activation='relu', padding='same'))
    model.add(layers.MaxPooling2D((2, 2), strides=(2, 2)))
    model.add(layers.Flatten())
    model.add(layers.Dense(4096, activation='relu'))
    model.add(layers.Dense(4096, activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))
    return model

# Instantiate and compile the model
model = build_vgg_model()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Display the model's architecture
model.summary()

# VGG  model model training
history = model.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_test, y_test))

Conclusion

In this blog, we successfully implemented a simplified VGG model on the CIFAR-10 dataset using Keras. The VGG architecture, despite its depth and complexity, is powerful for image classification tasks. Through this exercise, you should now have a solid understanding of how to build and train deep convolutional networks in Python.

The implementation provided here is a simplified version of the VGG model due to the resource constraints of training a full-scale VGG16 or VGG19 model. However, this serves as a great starting point for more complex projects.