Implementing AlexNet with PyTorch’s torchvision in Python using Cifar-10 Dataset
- Aug 26, 2024
- 6 min read
Updated: Mar 10
Deep learning has revolutionized image classification, and convolutional neural networks (CNNs) lie at the heart of this progress. AlexNet, one of the pioneering CNN architectures, demonstrated the power of deep learning on large-scale image datasets.
In this tutorial, we’ll implement AlexNet using PyTorch’s torchvision library, understand the architecture and train it on the CIFAR-10 dataset. You’ll learn how to load and preprocess data, adapt a classic architecture for smaller images, train the model, and evaluate its performance—all while gaining hands-on experience with PyTorch’s high-level APIs and GPU acceleration.
By the end, you’ll have a working AlexNet model trained on CIFAR-10 and a deeper understanding of CNN architecture and workflows, from convolutional layers to fully connected outputs.

What Is AlexNet?
AlexNet was a milestone in the evolution of deep convolutional neural networks (CNNs) for image classification. Its architecture consists of multiple convolutional layers that learn spatial hierarchies of features from input images, max pooling layers that reduce feature map dimensionality and add robustness to small translations, and fully connected layers that perform the final classification.
The model made a breakthrough in 2012 by winning the ImageNet competition, showing the power of deep CNNs in visual recognition tasks. It popularized the use of the ReLU (Rectified Linear Unit) activation function, which mitigated the vanishing gradient problem and enabled training of deeper networks. AlexNet also demonstrated the advantages of GPU acceleration for training large-scale models, which drastically reduced training times.
Its architectural principles, such as stacked convolutional layers and max-pooling, laid the foundation for subsequent models like VGG, ResNet, and Inception. Beyond architecture, AlexNet brought deep learning into mainstream applications, influencing advancements in autonomous vehicles, facial recognition, and medical imaging.
It also set a benchmark for future models, establishing a new standard for performance in image recognition.
Understanding the AlexNet Architecture
AlexNet is a deep convolutional neural network designed for large-scale image classification. Its architecture set a benchmark for modern CNNs and introduced several key innovations that are still widely used today. At a high level, AlexNet consists of stacked convolutional layers for feature extraction, max-pooling layers for downsampling, and fully connected layers for classification.
The network starts with an input layer that accepts images resized to 224×224 pixels with three color channels (RGB). The first few layers are convolutional layers, which learn spatial hierarchies of features such as edges, textures, and shapes. AlexNet applies the ReLU activation function after each convolution to introduce non-linearity, enabling the network to model complex patterns.
Max-pooling layers follow certain convolutional layers to reduce the spatial dimensions of feature maps. This not only decreases computational cost but also makes the network more robust to small translations or distortions in the input.
After the convolutional and pooling stages, the feature maps are flattened and passed through fully connected layers, which act as a classifier by combining the learned features into class probabilities.

AlexNet also incorporates dropout layers in the fully connected section to reduce overfitting, improving generalization on unseen data.
Some key highlights of the AlexNet architecture include:
Five convolutional layers with varying kernel sizes, designed to capture features at different scales.
Three max-pooling layers to progressively downsample feature maps.
Three fully connected layers at the end, with the final layer outputting predictions for 1,000 ImageNet classes (or adapted for a smaller dataset like CIFAR-10).
ReLU activations for faster convergence and better gradient flow.
Dropout regularization in the classifier to prevent overfitting.
This combination of convolution, pooling, and fully connected layers allows AlexNet to learn hierarchical feature representations, making it a powerful and flexible model for image classification tasks.
Implementing AlexNet on CIFAR-10 with PyTorch
Having explored the architecture and significance of AlexNet, we’re ready to put theory into practice. In this hands-on section, we’ll walk through implementing AlexNet using PyTorch and applying it to the CIFAR-10 dataset, a standard benchmark for image classification. We’ll start by loading and preprocessing the dataset, resizing the images to match AlexNet’s expected input, and setting up data loaders for efficient training.
Next, we’ll define the model, leveraging pre-trained weights for feature extraction or fine-tuning for better performance on CIFAR-10. Finally, we’ll train the network, monitor its learning through loss metrics, and evaluate its accuracy on the test set.
This practical exercise will give you a clear view of how convolutional and fully connected layers interact in a deep CNN and provide hands-on experience with one of the most influential architectures in computer vision.
1. Preparing the Environment
Before training AlexNet on CIFAR-10, make sure PyTorch and torchvision are installed. PyTorch provides the core deep learning framework, while torchvision includes datasets, model architectures, and useful image transformations. Install them using:
pip install torch torchvision2. Loading AlexNet from torchvision
PyTorch’s torchvision makes it simple to access a pre-trained AlexNet model. Using pretrained=True, the model comes with weights trained on the ImageNet dataset, which contains 1.2 million images across 1,000 classes. You can load and inspect the model as follows:
import torch
import torchvision.models as models
# Load the pre-trained AlexNet model
alexnet = models.alexnet(pretrained=True)
# Display the model's architecture
print(alexnet)The output shows the architecture divided into two main components: features, which contains convolutional and pooling layers for feature extraction, and classifier, which includes fully connected layers for classification. Dropout and ReLU activations are used to improve generalization and introduce non-linearity. This pre-trained model provides a strong starting point for transfer learning or fine-tuning on datasets like CIFAR-10.
AlexNet(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
(3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(4): ReLU(inplace=True)
(5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU(inplace=True)
(8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU(inplace=True)
(10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
(classifier): Sequential(
(0): Dropout(p=0.5, inplace=False)
(1): Linear(in_features=9216, out_features=4096, bias=True)
(2): ReLU(inplace=True)
(3): Dropout(p=0.5, inplace=False)
(4): Linear(in_features=4096, out_features=4096, bias=True)
(5): ReLU(inplace=True)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)The pretrained=True argument loads the model with weights pre-trained on ImageNet, which includes 1.2 million images across 1,000 classes.
3. Using AlexNet for Feature Extraction
AlexNet’s convolutional layers can serve as a powerful feature extractor for new datasets, particularly smaller ones where training from scratch would be impractical. By freezing the convolutional layers, you retain the pre-learned ImageNet features while replacing the classifier to adapt to your dataset. For CIFAR-10, which has 10 classes, you can replace the original fully connected layers with a custom classifier. This approach allows the network to leverage learned visual features while fine-tuning the final layers for the specific task.
from torch import nn
# Freeze the convolutional layers
for param in alexnet.features.parameters():
param.requires_grad = False
# Replace the classifier with a custom classifier for CIFAR-10
alexnet.classifier = nn.Sequential(
nn.Dropout(),
nn.Linear(256 * 6 * 6, 4096),
nn.ReLU(inplace=True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, 10))
# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
alexnet.to(device)This setup ensures that only the classifier layers are updated during training, making fine-tuning faster and more efficient while still leveraging AlexNet’s robust feature extraction capabilities.
4. Fine-Tuning AlexNet for Custom Datasets
When working with a new dataset, fine-tuning a pre-trained model like AlexNet can save a lot of time and improve performance. The idea is simple: keep the early convolutional layers frozen so they continue to extract general features like edges and textures, and allow the deeper layers to adapt to the specifics of your data.
# Unfreeze the last few convolutional layers
for param in alexnet.features[-3:].parameters():
param.requires_grad = True
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(alexnet.parameters(), lr=1e-4)With this approach, AlexNet’s powerful feature extraction is preserved, while the top layers learn patterns unique to your dataset. It’s a practical way to get high accuracy without training the entire network from scratch.
5. Training AlexNet on the CIFAR-10 Dataset
With the model ready and fine-tuned, we can now train AlexNet on the CIFAR-10 dataset. Since CIFAR-10 images are only 32x32 pixels, we resize them to 224x224 so they match AlexNet’s expected input size. The dataset is normalized and converted to tensors for PyTorch to handle efficiently.
The training loop runs over multiple epochs, performing forward passes, computing the loss, backpropagating gradients, and updating the weights. With each epoch, the model gradually learns to classify the images more accurately.
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
running_loss = 0.0
for inputs, labels in trainloader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = alexnet(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(trainloader):.4f}")
print("Finished Training")The output shows a steady decrease in loss, indicating that AlexNet is successfully adapting its learned features to the CIFAR-10 dataset while leveraging the pre-trained convolutional layers for faster and more accurate learning.
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz...
Files already downloaded and verified
Epoch [1/10], Loss: 0.6980
Epoch [2/10], Loss: 0.4352
...
Epoch [10/10], Loss: 0.0701
Finished TrainingThis script fine-tunes AlexNet on the CIFAR-10 dataset, allowing the pre-trained model to adapt to this new classification task.
Conclusion
Fine-tuning AlexNet on the CIFAR-10 dataset demonstrates the power of transfer learning in deep learning. By using a pre-trained model, we leveraged rich feature representations learned from ImageNet, allowing the network to adapt quickly to a new dataset without training from scratch.
We walked through loading the pre-trained AlexNet, freezing its convolutional layers for feature extraction, customizing the classifier for our 10-class task, and training it on CIFAR-10. This hands-on approach highlights how modern CNN architectures can be efficiently applied to new image classification problems, providing a strong foundation for exploring more advanced models and datasets in computer vision.





