Understanding VGG Neural Networks: Architecture and Implementation

6 min readAug 14, 2023

VGG (Visual Geometry Group) is a convolutional neural network architecture that was proposed by researchers from the University of Oxford in 2014. It gained popularity and recognition for its simplicity and effectiveness in image classification tasks.

Computer Vision if it was painted by Pablo Picasso, Dall-E

The Need for Deep Architectures

At its core, the challenge in image recognition revolves around capturing intricate features and patterns in images. Traditional approaches faltered when trying to identify complex relationships within data, which prompted the exploration of deeper architectures.

VGG tackled this challenge head-on by advocating for depth. By stacking numerous layers, VGG created a hierarchy of features that allowed it to grasp the essence of images. One of the primary motivations for deep architectures was the realization that visual data contains features at various levels of abstraction. A shallow network, such as LeNet or AlexNet, might capture only simple features, failing to encapsulate the intricate details that define the essence of an image.

VGG’s significance comes into sharper focus when examining the limitations of shallow architectures. LeNet and AlexNet, while groundbreaking in their own right, encountered bottlenecks when confronted with complex images. Their inability to capture high-level features hindered their performance on intricate tasks like object detection and recognition in cluttered scenes.

VGG Explained

The VGG neural network architecture, as described in the original paper aims to address the challenge of image classification using deep convolutional neural networks. The primary focus of the VGG architecture is on increasing the depth of the network while using simple and uniform convolutional layers.

The paper introduces various VGG architectures, including VGG11, VGG13, VGG16, and VGG19, where the numbers represent the total layers in each architecture. These networks consist primarily of convolutional layers interspersed with max-pooling layers and are followed by a few fully connected layers.

Convolutional Layers: VGG networks utilize small 3x3 convolutional filters throughout the architecture. This choice allows the network to capture a wide range of features at different scales, enabling richer representations.
Pooling Layers: After every two to three convolutional layers, max-pooling layers are used for spatial down-sampling. Max-pooling helps in reducing the spatial dimensions while retaining important features.
Fully Connected Layers: The convolutional and pooling layers are followed by a few fully connected (dense) layers, which serve as the final classifier. These layers consolidate the learned features and produce class predictions.

Rectified linear units (ReLU) are used as activation functions throughout the network. ReLU introduces non-linearity and helps in modeling complex relationships within the data.

The authors observe that the deeper networks tend to perform better, but they also acknowledge the increased computational cost. The VGG19 model, with 19 layers, performs well on various image classification tasks, but even the simpler architectures like VGG16 and VGG13 provide competitive results.

The networks are trained using stochastic gradient descent (SGD) with momentum. Dropout, a regularization technique, is also applied to reduce overfitting.

The VGG architectures outperform the existing state-of-the-art models on various benchmark datasets, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) dataset, demonstrating the effectiveness of deeper architectures for image classification.

VGG Blocks: Building Depth

A VGG block is a modular unit within the VGG architecture that encapsulates a set of convolutional and pooling layers. These layers, operating in tandem, capture features of varying complexity, from basic edges and shapes to more elaborate textures and patterns. VGG recognized that by stacking these blocks together, the network could learn to discern high-level features that define the identity of objects in images.

The pivotal role of VGG blocks lies in their ability to add depth to the neural network. Depth, in this context, refers to the number of layers within the network. Deeper networks can capture more intricate and abstract features, enabling them to distinguish subtle differences in images.

By stacking multiple VGG blocks, the network evolves from a relatively shallow structure to a deep, hierarchical architecture.

VGG takes a systematic approach to configuring the blocks, ensuring a consistent and scalable architecture.

Implementation

Let’s use TensorFlow to define a custom VGG block.

We can create a class named Block that inherits from tf.keras.Model. using TensorFlow's high-level API.

class Block(tf.keras.Model):
    def __init__(self, filters, kernel_size, repetitions, pool_size=2, strides=2):
        super(Block, self).__init__()
        self.filters = filters
        self.kernel_size = kernel_size
        self.repetitions = repetitions
        
        # Define a conv2D_0, conv2D_1, etc based on the number of repetitions
        for i in range(repetitions):
            
            # Define a Conv2D layer, specifying filters, kernel_size, activation and padding.
            vars(self)[f'conv2D_{i}'] = tf.keras.layers.Conv2D(filters, kernel_size, activation='relu', padding='same')
        
        # Define the max pool layer that will be added after the Conv2D blocks
        self.max_pool = tf.keras.layers.MaxPooling2D(pool_size=pool_size, strides=strides)

We initialise the class and add a loop that iterates repetitions times. For each iteration, it creates a convolutional layer (Conv2D) and assigns it to an attribute named using a dynamic variable name (conv2D_0, conv2D_1, and so on).

After the convolutional layers are defined in the loop, a max pooling layer (MaxPooling2D) is defined. This layer will be applied after the convolutional layers in the block.

We add call method to the class above and define the forward pass. It takes an input tensor (inputs) and processes it through the convolutional layers and max pooling.

def call(self, inputs):
    conv2D_0 = vars(self)['conv2D_0']
    x = conv2D_0(inputs)

    for i in range(1, self.repetitions):
        conv2D_i = vars(self)[f'conv2D_{i}']
        x = conv2D_i(x)

    max_pool = self.max_pool(x)
    return max_pool

This custom block can be used within a larger neural network architecture to create a repeated pattern of convolutional layers followed by max pooling layers, which is a common structure in CNNs. It provides a flexible way to construct networks with varying depths and complexities.

class MyVGG(tf.keras.Model):

    def __init__(self, num_classes):
        super(MyVGG, self).__init__()

        # Creating blocks of VGG with the following 
        # (filters, kernel_size, repetitions) configurations
        self.block_a = Block(filters=64, kernel_size=3, repetitions=2)
        self.block_b = Block(filters=128, kernel_size=3, repetitions=2)
        self.block_c = Block(filters=256, kernel_size=3, repetitions=3)
        self.block_d = Block(filters=512, kernel_size=3, repetitions=3)
        self.block_e = Block(filters=512, kernel_size=3, repetitions=3)

        # Classification head
        # Define a Flatten layer
        self.flatten = tf.keras.layers.Flatten()
        # Create a Dense layer with 256 units and ReLU as the activation function
        self.fc = tf.keras.layers.Dense(256, activation='relu')
        # Finally add the softmax classifier using a Dense layer
        self.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')

    def call(self, inputs):
        # Chain all the layers one after the other
        x = self.block_a(inputs)
        x = self.block_b(x)
        x = self.block_c(x)
        x = self.block_d(x)
        x = self.block_e(x)
        x = self.flatten(x)
        x = self.fc(x)
        x = self.classifier(x)
        return x

The classification head is defined with a Flatten layer. Then, a fully connected (Dense) layer with 256 units and ReLU activation is added, followed by the final classifier layer with the number of units equal to the num_classes and a softmax activation function.

In the call method, the input tensor is passed through each block sequentially, followed by the flatten, fully connected, and classifier layers, creating a sequential chain of layers that define the forward pass of the network.

Training

Here’s a brief explanation of the steps involved in the process of training the custom VGG model.

# Initialize VGG with the number of classes 
vgg = MyVGG(num_classes=2)

# Compile with losses and metrics
vgg.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the custom VGG model
vgg.fit(dataset, epochs=10)

The model is compiled with the optimizer, loss function, and metrics. In this case, the Adam optimiser, sparse categorical cross-entropy loss, and accuracy metric are used.

Make sure to preprocess the dataset of images by resizing and normalising and then creating batches for training.

Final Notes

VGG’s deep architecture enabled it to surpass earlier models by accurately classifying diverse objects within images. The systematic layering of convolutional blocks paved the way for capturing intricate features, contributing to its superior accuracy.

However, VGGs are resource-intensive computations and the demand for computational resources limit its scalability, making it less practical for deployment on resource-constrained devices.

To overcome these limitations, the concept of transfer learning gained relevance. Developers can fine-tune VGG models pre-trained on large datasets on domain-specific data, achieving impressive results with significantly less training time.