Vision Transformer in Python: Working, Architecture, and Code
- 1 day ago
- 16 min read
Vision Transformers (ViTs) have rapidly transformed the field of computer vision by bringing transformer-based architectures into image processing tasks. Originally designed for natural language processing, transformers introduced the concept of self-attention, allowing models to capture global relationships more effectively than traditional convolution-based approaches. Today, Vision Transformers are widely used in image classification, object detection, and advanced generative AI systems.
Unlike Convolutional Neural Networks (CNNs), which process images through localized filters, Vision Transformers divide images into smaller patches and treat them as sequential input tokens. This shift in architecture enables the model to learn long-range visual dependencies and contextual relationships across an entire image. As transformer-based models continue to evolve, Vision Transformers are becoming a key component of modern deep learning and computer vision research.
In this blog, we will explore how Vision Transformer work, understand their architecture step by step, and examine the role of patch embeddings, positional encoding, and self-attention mechanisms. We will also build a practical Vision Transformer model in Python, discuss its implementation using deep learning frameworks, and explore how Vision Transformers are used across modern computer vision tasks such as image recognition, image segmentation, object detection, video understanding, and image generation.

What Are Vision Transformers?
Vision Transformers (ViTs) are deep learning architectures that apply transformer-based modeling techniques to computer vision tasks. Inspired by the success of transformers in natural language processing (NLP), Vision Transformers process images using self-attention mechanisms instead of relying entirely on convolution operations like traditional Convolutional Neural Networks (CNNs).
Rather than analyzing an image through sliding convolutional filters, a Vision Transformer divides an image into smaller fixed-size patches and treats each patch as an input token. These image patches are then converted into embeddings and passed through transformer encoder layers, allowing the model to learn relationships between different regions of the image. This approach enables Vision Transformers to capture both local and global contextual information more effectively across the entire visual input.
The core idea behind Vision Transformers is borrowed directly from transformer architectures used in NLP models such as BERT and GPT. In NLP, transformers process sequences of words, while in Vision Transformers, the sequence is formed using image patches. By leveraging self-attention, the model can determine which parts of an image are more relevant during feature extraction and decision-making.
Vision Transformers gained widespread attention after demonstrating competitive performance on large-scale image recognition benchmarks. Since then, they have evolved into an important architecture for modern computer vision research and are now used in tasks such as image classification, object detection, image segmentation, video understanding, and generative modeling.
Because apparently the machine learning community saw transformers succeeding in text processing and collectively decided, What if pixels were just spicy words? Disturbingly effective conclusion, to be honest.
How Vision Transformers Work
Vision Transformers process images differently from traditional Convolutional Neural Networks (CNNs). Instead of extracting features using convolutional filters, Vision Transformers transform an image into a sequence of smaller image patches and process them using transformer encoder layers powered by self-attention mechanisms. This allows the model to learn relationships between different regions of an image while capturing broader contextual information.
At a high level of abstraction, the workflow of a Vision Transformer consists of the following stages:
Image Splitting into Patches
Patch Embedding Generation
Positional Encoding Addition
Transformer Encoder Processing
Self-Attention-Based Feature Learning
Final Prediction Layer
Each component plays an important role in enabling the model to interpret visual data effectively.
Step 1: Splitting the Image into Patches
The first step in a Vision Transformer is dividing the input image into smaller fixed-size patches. Instead of processing the image as one large grid of pixels, the model breaks it into multiple smaller regions of equal dimensions. For example, a standard 224×224 RGB image may be divided into 16×16 patches, generating a sequence of smaller visual blocks extracted from the original image.
Each image patch contains a portion of the visual information present in the image, including textures, edges, colors, shapes, and local spatial patterns. These patches act similarly to tokens in Natural Language Processing (NLP), where each token represents a word or subword within a sentence. In Vision Transformers, image patches become the fundamental units of representation that the transformer learns from.
The total number of patches generated depends on both the image size and the selected patch size. For a 224×224 image with 16×16 patches, the Vision Transformer processes the image as a sequence of 196 visual tokens.
Unlike Convolutional Neural Networks (CNNs), which analyze images using localized
convolution operations, Vision Transformers treat image understanding as a sequence modeling problem. After patch extraction, the transformer processes these patches sequentially in a manner similar to how transformer models process word embeddings in NLP tasks.
Step 2: Patch Embeddings
After splitting the image into smaller patches, each patch is flattened into a one-dimensional vector. However, raw pixel values alone are not sufficient for transformer-based learning because the transformer encoder expects structured numerical representations with meaningful feature relationships. To make these image patches suitable for learning, they are transformed into dense vector representations known as patch embeddings.
Each flattened patch is passed through a learnable linear projection layer that converts the high-dimensional pixel vector into a lower-dimensional embedding space. This process is conceptually similar to word embeddings in Natural Language Processing (NLP), where words are mapped into dense semantic vectors before being processed by transformer models.
The patch embedding operation can be represented as:
Patch Embedding = Flattened Patch × Learnable Projection Matrix
The purpose of patch embeddings is not simply dimensionality reduction. These embeddings learn feature-rich representations of visual regions during training. Over time, the embedding layer learns to encode important visual characteristics such as:
edges,
textures,
shapes,
color distributions,
and local structural patterns.
Unlike handcrafted feature extraction methods used in classical computer vision, patch embeddings are learned automatically through backpropagation and optimization during model training.
An important property of patch embeddings is that all image patches are projected into the same embedding space using shared learnable parameters. This allows the transformer to process every patch consistently while learning relationships between different image regions through self-attention mechanisms.
Once all image patches are converted into embeddings, the model combines them into a sequential representation: [z1,z2,z3,...,zN]. This sequence of patch embeddings becomes the primary input to the transformer encoder layers.
Patch embeddings form the foundation of Vision Transformers because they bridge the gap between raw visual data and transformer-based sequence learning.
Step 3: Positional Encoding
Unlike CNNs, transformers do not naturally understand spatial structure or positional relationships between image regions. Since the input is treated as a sequence of patches, positional information must be added explicitly.
This is achieved using positional encoding, where a positional vector is added to each patch embedding. These positional embeddings help the model understand:
The location of each patch,
Spatial relationships between regions,
And the overall arrangement of visual features.
Without positional encoding, the transformer would treat image patches as unordered tokens, which would severely limit spatial understanding. Humans already struggle with spatial awareness while parallel parking, as it turns out machines need a little assistance too.
Step 4: Transformer Encoder Layers
After positional information is added to the patch embeddings, the resulting sequence is passed through multiple Transformer Encoder Layers, which form the computational core of the Vision Transformer architecture. These encoder layers are responsible for learning complex relationships between different image regions and progressively refining the visual feature representations extracted from the input image.
A Vision Transformer typically stacks several encoder blocks sequentially, allowing the model to learn increasingly abstract and high-level visual patterns as information flows deeper through the network. Each encoder processes the entire sequence of image patch embeddings simultaneously rather than focusing only on localized regions.
A standard transformer encoder layer usually consists of four major components:
Multi-Head Self-Attention (MHSA)
Layer Normalization
Feed-Forward Neural Networks (FFN)
Residual Connections
Together, these components allow the model to capture both local visual structures and global contextual dependencies across the image. Unlike Convolutional Neural Networks (CNNs), which primarily learn through localized receptive fields, transformer encoders model relationships globally across the entire image from the very beginning. This ability to capture long-range contextual dependencies is one of the major reasons Vision Transformers have become highly influential in modern computer vision research.
Step 5: Self-Attention Mechanism
The self-attention mechanism is the core component of Vision Transformers. Self-attention allows the model to determine how strongly different image patches are related to one another. For every patch, the model computes:
Query (Q)
Key (K)
Value (V)
These representations are used to calculate attention scores that determine which patches should receive greater focus during feature extraction.
The self-attention operation can be expressed as:

This mechanism enables the model to capture both local and global feature dependencies dynamically.
For example, when analyzing an image containing a person riding a bicycle, the transformer can learn relationships between distant visual regions such as the rider, wheels, and handlebars simultaneously. CNNs usually require multiple convolutional layers to gradually capture such broader relationships. To read more about how attention works, refer to this blog : The Attention Mechanism: Foundations, Evolution, and Transformer Architecture
Step 6: Classification or Task-Specific Output
After passing through multiple transformer encoder layers, the Vision Transformer produces refined feature representations that capture both local visual details and global contextual relationships. These learned representations are then used for downstream computer vision tasks.
Depending on the application, the output may be used for:
Image classification,
Object detection,
Image segmentation,
Video understanding,
Image generation.
For other tasks such as object detection or segmentation, the transformer outputs are passed through specialized task-specific heads that generate structured predictions. One of the key advantages of Vision Transformers is that the same encoder architecture can be adapted across multiple computer vision tasks with only minor architectural modifications.
Vision Transformer in Python Using PyTorch
After understanding the architecture and working principles of Vision Transformers, we can now build a practical Vision Transformer model in Python using the PyTorch framework. In this section, we will implement the core components of a Vision Transformer step by step, including patch embeddings, positional encoding, transformer encoder layers, and the final prediction head.
Using PyTorch allows us to efficiently construct and train transformer-based architectures while leveraging GPU acceleration and deep learning utilities. The implementation will demonstrate how images are converted into patch sequences, processed through self-attention mechanisms, and transformed into meaningful visual predictions using transformer encoders.
By the end of this section, we will have a working Vision Transformer model that provides a practical understanding of how modern transformer-based computer vision architectures are implemented in real-world deep learning workflows.
Importing Libraries and Configuring the Training Device
The first step in building a Vision Transformer in Python is importing the required libraries for deep learning, dataset handling, numerical computation, and visualization. In this implementation, PyTorch is used as the primary deep learning framework for constructing and training the Vision Transformer architecture, while Torchvision provides utilities for working with computer vision datasets and image transformations.
The code imports modules for neural network construction (torch.nn), optimization (torch.optim), dataset preprocessing (torchvision.transforms), and visualization using Matplotlib. NumPy is also included for numerical operations and array manipulation during visualization and attention map processing.
In addition to importing libraries, the implementation configures the computational device to automatically use GPU acceleration if CUDA-enabled hardware is available; otherwise, computation falls back to the CPU.
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.data import DataLoader, random_split
# Device Configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using Device:", device)Loading and Preprocessing the EuroSAT Dataset
Before training the Vision Transformer in Python, the input images must be preprocessed into a format suitable for deep learning. In this implementation, the EuroSAT dataset is used, which contains satellite imagery across multiple land-use and geographical categories. Using EuroSAT makes the project more aligned with modern computer vision research tasks involving remote sensing and geospatial image understanding.
The preprocessing pipeline is created using transforms.Compose(), where multiple image transformations are applied sequentially. First, all images are resized to 64×64 dimensions to ensure consistent input size across the dataset. The images are then converted into PyTorch tensors using ToTensor(), allowing them to be processed by neural network layers.
Normalization is also applied to scale pixel values into a normalized range, helping stabilize gradient updates during training and improving convergence behavior in transformer architectures.
After preprocessing, the EuroSAT dataset is downloaded and split into training and testing subsets using an 80-20 ratio. DataLoaders are then created to efficiently batch and shuffle the data during model training and evaluation.
Batch processing is particularly important in Vision Transformers because transformer operations are computationally expensive and benefit heavily from parallel GPU computation.
# Data Preprocessing
transform = transforms.Compose([
transforms.Resize((64, 64)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5),
(0.5, 0.5, 0.5))
])
# EuroSAT Dataset
dataset = torchvision.datasets.EuroSAT(
root='./data',
download=True,
transform=transform
)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(
dataset,
[train_size, test_size]
)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
classes = dataset.classesOnce all that is done we finally visualize a few sample images from the dataset. This helps verify that preprocessing has been applied correctly and provides an initial understanding of the different image categories present in the EuroSAT dataset.
Deep learning practitioners routinely spend hours building architectures only to later discover their preprocessing pipeline accidentally transformed every image into numerical soup, so visualization checks are less optional than people usually tend to think.
# Visualizing Sample Images
images, labels = next(iter(train_loader))
fig, axes = plt.subplots(2, 4, figsize=(10, 5))
axes = axes.flatten()
for i in range(8):
img = images[i] / 2 + 0.5
npimg = img.numpy()
axes[i].imshow(np.transpose(npimg, (1, 2, 0)))
axes[i].set_title(classes[labels[i]])
axes[i].axis('off')
plt.suptitle("Sample EuroSAT Images")
plt.tight_layout()
plt.show()Few samples from the EuroSAT images dataset is given below as an output from the code above:

Visualizing Image Patches in Vision Transformers
One of the key differences between Vision Transformers and traditional Convolutional Neural Networks (CNNs) is the way images are processed. Instead of analyzing the entire image using convolutional filters, Vision Transformers divide the image into smaller fixed-size patches that act as sequential input tokens for the transformer encoder.
To better understand this process, the implementation includes a patch visualization function that splits a EuroSAT image into multiple smaller image regions and displays them individually. The function first rearranges the tensor dimensions into a format compatible with Matplotlib visualization and then iteratively extracts smaller patches based on the specified patch size.
The visualization helps illustrate one of the core concepts behind Vision Transformers: transforming images into sequences. Instead of treating the image as a continuous spatial grid, the model interprets it as an ordered collection of visual tokens, similar to how transformer models process sequences of words in Natural Language Processing (NLP).
Patch size also plays an important role in Vision Transformer performance.
# Patch Visualization Function
def visualize_patches(image, patch_size=8):
image = image.permute(1, 2, 0).numpy()
h, w, _ = image.shape
fig, axes = plt.subplots(h // patch_size,
w // patch_size,
figsize=(7, 7))
idx = 0
for i in range(0, h, patch_size):
for j in range(0, w, patch_size):
patch = image[i:i+patch_size, j:j+patch_size]
ax = axes[idx // (w // patch_size)][idx % (w // patch_size)]
ax.imshow((patch * 0.5) + 0.5)
ax.axis('off')
idx += 1
plt.suptitle("EuroSAT Image Patches")
plt.tight_layout()
plt.show()
visualize_patches(images[0].cpu())Smaller patches preserve more fine-grained visual detail but increase the sequence length and computational cost of self-attention operations. Larger patches reduce computation but may lose important local information. So naturally, deep learning engineers spend considerable amounts of time negotiating with square sizes while GPUs quietly prepare for thermal transcendence.

After splitting the image into smaller patches, the Vision Transformer converts each patch into a dense numerical representation called a patch embedding. In this implementation, the PatchEmbedding class uses a convolutional layer with the kernel size and stride equal to the patch size, allowing the model to efficiently extract non-overlapping image patches while simultaneously projecting them into a higher-dimensional embedding space.
# Patch Embedding Layer
class PatchEmbedding(nn.Module):
def __init__(self, img_size=64, patch_size=8,
in_channels=3, embed_dim=128):
super().__init__()
self.num_patches = (img_size // patch_size) ** 2
self.projection = nn.Conv2d(
in_channels,
embed_dim,
kernel_size=patch_size,
stride=patch_size
)
def forward(self, x):
x = self.projection(x)
x = x.flatten(2)
x = x.transpose(1, 2)
return xThe generated feature maps are then flattened and rearranged into a sequential format compatible with transformer encoder layers. This transforms the input image into a sequence of visual tokens with the shape (Batch Size, Number of Patches, Embedding Dimension), enabling the transformer to process image patches similarly to how transformer models process word embeddings in Natural Language Processing (NLP). A surprisingly bold idea, honestly: convert satellite imagery into token sequences and hope attention mechanisms develop opinions about farmland.
Implementing the Vision Transformer Architecture
The VisionTransformer class defines the complete Vision Transformer (ViT) architecture by combining patch embeddings, positional encoding, transformer encoder layers, and a classification head into a unified deep learning model. After generating patch embeddings, the model introduces a learnable CLS token that aggregates global contextual information from all image patches during self-attention operations. Since transformers do not inherently understand spatial structure, learnable positional embeddings are added to preserve the spatial arrangement of image patches.
The embedded patch sequence is then processed through multiple transformer encoder layers containing Multi-Head Self-Attention (MHSA), feed-forward neural networks, layer normalization, and residual connections, enabling the model to learn complex relationships between different image regions. Finally, the output corresponding to the CLS token is passed through a classification head to generate predictions for the target classes.
# Vision Transformer Model
class VisionTransformer(nn.Module):
def __init__(
self,
img_size=64,
patch_size=8,
in_channels=3,
num_classes=10,
embed_dim=128,
depth=4,
num_heads=4,
mlp_dim=256,
dropout=0.1
):
super().__init__()
self.patch_embed = PatchEmbedding(
img_size,
patch_size,
in_channels,
embed_dim
)
num_patches = self.patch_embed.num_patches
self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
self.pos_embedding = nn.Parameter(
torch.randn(1, num_patches + 1, embed_dim)
)
self.dropout = nn.Dropout(dropout)
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=mlp_dim,
dropout=dropout,
activation='gelu',
batch_first=True
)
self.transformer_encoder = nn.TransformerEncoder(
encoder_layer,
num_layers=depth
)
self.mlp_head = nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, num_classes)
)
def forward(self, x):
batch_size = x.shape[0]
x = self.patch_embed(x)
cls_tokens = self.cls_token.expand(batch_size, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
x = x + self.pos_embedding
x = self.dropout(x)
x = self.transformer_encoder(x)
cls_output = x[:, 0]
output = self.mlp_head(cls_output)
return outputIn essence, the entire architecture transforms satellite images into token sequences, learns spatial relationships using attention mechanisms, and produces semantic predictions without relying on traditional convolution-heavy pipelines. Which remains one of the more audacious ideas in modern AI: treat pieces of an image like words and trust the transformer to figure out the rest.
Training the Vision Transformer Model
After defining the Vision Transformer architecture, the model is initialized and moved to the configured computation device for training. The implementation uses the CrossEntropyLoss function, which is commonly applied to multi-class classification problems, while the Adam optimizer is used to update the model parameters during backpropagation.
# Model Initialization
model = VisionTransformer(
num_classes=len(classes)
).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
print(model)
During training, batches of EuroSAT images are passed through the Vision Transformer, where the model generates predictions based on patch embeddings and self-attention-based feature learning.
The computed loss measures the difference between predicted outputs and ground-truth labels, and gradients are propagated backward through the network to optimize the learnable parameters.
# Training Loop
num_epochs = 5
train_losses = []
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
epoch_loss = running_loss / len(train_loader)
train_losses.append(epoch_loss)
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}")The training loop runs for multiple epochs while tracking the average training loss after each iteration.

Finally, the loss values are visualized using a training loss curve, providing insight into how effectively the model is learning and converging during optimization.
# Training Loss Visualization
plt.figure(figsize=(7, 4))
plt.plot(range(1, num_epochs + 1), train_losses, marker='o')
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss Curve")
plt.grid(True)
plt.show()As the no. of epochs increase in the model training the loss curve goes down signifying the improvement of the model while training.

Evaluating the Vision Transformer Model
After training is completed, the Vision Transformer is evaluated on the test dataset to measure its classification performance on unseen satellite images. The model is switched to evaluation mode using model.eval(), which disables behaviors such as dropout and ensures consistent inference behavior during testing. Gradient computation is also disabled using torch.no_grad() to reduce memory usage and improve computational efficiency during evaluation.
For each batch of test images, the model generates predictions, and the class with the highest probability is selected using torch.max(). The predicted labels are then compared with the ground-truth labels to calculate the overall classification accuracy of the Vision Transformer on the EuroSAT dataset. This evaluation step helps determine how effectively the transformer architecture generalizes beyond the training data instead of merely memorizing patches like an overachieving statistical parrot.
# Model Evaluation
model.eval()
copied = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
copied += (predicted == labels).sum().item()
accuracy = 100 * copied / total
print(f"Test Accuracy: {accuracy:.2f}%")To better understand the model’s performance, we also visualizes sample predictions generated by the Vision Transformer on unseen test images. A batch of satellite images is passed through the trained model, and the predicted class labels are displayed alongside the actual ground-truth labels for comparison. The visualization provides an intuitive way to inspect the model’s behavior and identify cases where the transformer successfully captures meaningful spatial and semantic patterns within satellite imagery.

Through this implementation we demonstrated how Vision Transformers can be built in Python using PyTorch for practical computer vision tasks. From patch embeddings and transformer encoder layers to attention visualization and prediction analysis, the project provided a complete workflow for understanding how transformer-based architectures process visual information. More importantly, it illustrated how self-attention mechanisms enable Vision Transformers to learn rich contextual relationships directly from image patches without relying entirely on traditional convolutional operations.
Applications of Vision Transformers in Modern Computer Vision
Vision Transformers have rapidly become one of the most influential architectures in modern computer vision due to their ability to model global contextual relationships through self-attention mechanisms. Unlike traditional Convolutional Neural Networks (CNNs), which primarily focus on local spatial patterns, Vision Transformers can learn interactions between distant image regions more effectively, making them highly adaptable across a wide range of visual learning tasks.
1. Image Recognition
One of the earliest and most widely adopted applications of Vision Transformers is image recognition and image classification. In these tasks, the model learns to identify and categorize visual objects or scene patterns from input images. By processing image patches as sequential tokens, Vision Transformers can capture both local visual structures and long-range contextual dependencies, leading to strong performance on large-scale image recognition benchmarks.
2. Image Segmentation
Vision Transformers are also widely used in image segmentation tasks, where the objective is to assign semantic labels to individual pixels or image regions. Transformer-based segmentation architectures leverage self-attention to better capture spatial relationships across the entire image, enabling more accurate boundary detection and region understanding. This has made Vision Transformers increasingly important in medical imaging, satellite image analysis, and semantic scene understanding.
3. Object Detection
In object detection tasks, Vision Transformers are used to identify and localize multiple objects within an image simultaneously. Transformer architectures help model complex relationships between objects and surrounding visual context, improving detection performance in cluttered or visually complex scenes. Attention mechanisms allow the model to dynamically focus on relevant image regions instead of relying solely on fixed receptive fields.
4. Video Understanding
Vision Transformers have also expanded into video understanding tasks, where models process both spatial and temporal information across sequences of video frames. By learning relationships between frames using self-attention mechanisms, transformer-based architectures can capture motion dynamics, temporal dependencies, and activity patterns more effectively. These models are commonly applied in action recognition, event understanding, and video classification systems.
5. Image Generation
In generative computer vision tasks, Vision Transformers are used in image synthesis and generation frameworks, including transformer-based generative models and diffusion architectures. Self-attention mechanisms help capture complex structural and semantic relationships within images, enabling the generation of high-quality and contextually coherent visual outputs. Transformer-based image generation models have become increasingly important in modern generative AI research, where apparently the long-term plan is to ensure no pixel remains unmodeled by a neural network.
Conclusion
Vision Transformers have significantly reshaped modern computer vision by introducing transformer-based sequence modeling into visual learning tasks. By dividing images into patches and processing them through self-attention mechanisms, Vision Transformers are able to capture both local visual patterns and long-range contextual relationships more effectively than many traditional convolution-based approaches.
As deep learning continues to evolve, Vision Transformers are becoming increasingly important in both research and real-world AI systems due to their scalability, flexibility, and strong representation learning capabilities. While they often require substantial computational resources and large-scale training data, their ability to model complex visual relationships has made them one of the foundational architectures driving the next generation of computer vision and multimodal AI applications.
Which is a very elegant way of saying the future of visual AI now involves feeding enormous quantities of image patches into attention layers until the GPUs begin sounding emotionally distressed.





