Exploring Wide and Deep Networks

7 min readAug 14, 2023

In machine learning, neural networks have taken center stage as powerful tools for learning from data and making intelligent decisions. As we strive to extract valuable insights from the vast amounts data merging the best of two worlds: wide and deep networks comes into place. In this blog post, we’ll be discussing wide and deep networks.

Artificial Intelligence, if it was painted by Salvador Dali, Dall-E

Understanding Wide and Deep Networks

Wide and deep networks, often referred to as “wide and deep learning,” represent an architecture that elegantly combines the strengths of both shallow and deep learning techniques. At its core, this approach aims to strike a harmonious balance between memorization and generalization — two essential elements in the pursuit of accurate predictions and intelligent decision-making.

Imagine a scenario where you’re building a recommendation system for an online streaming platform. On one hand, you want to capture explicit user preferences, such as their age group, genre preferences, and viewing history. On the other hand, you also need to uncover hidden patterns and latent interests that might not be immediately apparent from the raw data.

By combining the wide and deep components, the wide and deep model is capable of handling both known and unknown relationships in the data.

Wide Component Explained

The wide component of the wide and deep model focuses on capturing explicit, known patterns or feature interactions in the data. It involves creating cross-product features that combine different categorical variables, allowing the model to learn relationships that might not be readily apparent from individual features.

In the wide component, feature crosses are created by taking combinations of categorical features and then applying one-hot encoding to these combinations. This essentially results in a set of binary features indicating the presence or absence of specific feature interactions. These binary features are then input to a linear model, which can weigh the importance of different feature interactions.

Examples of Wide Component Feature Crosses

1. Ad Recommendation: Imagine you’re building a recommendation system for online ads. You might have categorical features such as user demographics (age group, gender), ad category, and device type. The wide component can capture interactions like “Age Group: 25–34 AND Ad Category: Electronics” or “Gender: Female AND Device Type: Mobile.” These cross-product features would help the model explicitly learn which combinations of user characteristics and ad categories are more likely to result in clicks or conversions.

2. E-commerce Product Recommendations: In a product recommendation system for an e-commerce platform, features could include user behavior (viewed products, purchased categories) and product attributes (brand, price range). The wide component can learn feature interactions like “Viewed Product: Sneakers AND Price Range: $50-$100,” allowing the model to recommend products that align with users’ preferences based on these interactions.

3. Credit Card Fraud Detection: For fraud detection in credit card transactions, features might include transaction amount, merchant category, and location. The wide component can capture interactions like “Transaction Amount: $500+ AND Merchant Category: Electronics,” which could indicate potentially suspicious activities.

Benefits and Challenges of Wide Component:

Benefits:

Interpretability: The wide component’s feature crosses are interpretable, as they directly relate to specific combinations of input features.
Handling Known Patterns: The wide component is effective at capturing known, explicit relationships that domain experts or business knowledge might already hypothesise for.

Challenges:

Manual Feature Engineering: Creating effective feature crosses requires domain knowledge and expertise, and is manual.
Limited to Known Patterns: The wide component relies on predefined interactions, so it might not capture subtle or complex relationships that require deeper learning and the user is not aware of.

Example

Here’s a code example for a basic recommendation system:

import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras.layers import Dense, Input, concatenate
from tensorflow.keras.models import Model

# Sample data
data = {
    'user_age': [25, 30, 22, 28],
    'ad_category': ['Electronics', 'Fashion', 'Electronics', 'Beauty'],
    'gender': ['Male', 'Female', 'Male', 'Female'],
    'clicks': [1, 0, 1, 1]
}

# Feature columns
user_age = feature_column.numeric_column('user_age')
ad_category = feature_column.categorical_column_with_vocabulary_list('ad_category', ['Electronics', 'Fashion'])
gender = feature_column.categorical_column_with_vocabulary_list('gender', ['Male', 'Female'])

# Create feature crosses
crossed_features = [
    feature_column.crossed_column([ad_category, gender], hash_bucket_size=10)
]

# Input layer for the wide component
wide_inputs = {
    'crossed_features': tf.keras.layers.Input(name='crossed_features', shape=(len(crossed_features),), dtype='int32')
}

# Wide component
wide_outputs = tf.keras.layers.DenseFeatures(crossed_features)(wide_inputs['crossed_features'])

# Define and compile the wide model
wide_model = Model(inputs=[wide_inputs], outputs=[wide_outputs])
wide_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Training data
x_wide = {
    'crossed_features': tf.constant([[0, 1], [2, 3], [0, 1], [4, 3]])  # Sample feature crosses
}
y = tf.constant([1, 0, 1, 1])

# Train the wide model
wide_model.fit(x=x_wide, y=y, epochs=10)

Deep Component Explained

The deep component of the wide and deep model refers to the deep neural network layers that are used to capture complex and hierarchical patterns in the data. These layers are responsible for learning intricate features and representations that might not be immediately apparent from the raw input features. The deep component enhances the model’s ability to generalize well and make predictions on unseen data by extracting higher-level abstractions.

The deep component typically consists of multiple layers of neurons (also known as hidden layers) stacked on top of each other. Each neuron in a layer takes inputs from the neurons in the previous layer and applies a weighted sum, followed by an activation function. This process of combining and transforming inputs across layers allows the network to progressively learn more abstract features.

Deep Component Architecture

1. Input Layer: The input layer directly receives the raw features of your data. Each feature corresponds to a neuron in this layer.

2. Hidden Layers:These are the intermediate layers between the input and output layers. The number of hidden layers and the number of neurons in each layer are adjustable hyperparameters. As you move deeper into the network, neurons start to learn complex combinations of features.

3. Activation Functions: Activation functions introduce non-linearity to the network, enabling it to learn complex relationships. Common activation functions include ReLU (Rectified Linear Activation), sigmoid, and tanh.

4. Output Layer: The output layer produces the final predictions of the model. The number of neurons in this layer corresponds to the number of output classes or the nature of the prediction task (regression, binary classification, multi-class classification).

Benefits and Challenges of Deep Component

Benefits:

Learning Complex Patterns: The deep component is capable of learning intricate and non-linear patterns in the data that might be essential for accurate predictions.
Hierarchical Representations: The network can automatically create hierarchical representations of the input data, which can lead to better generalization.
End-to-End Learning: The deep component enables the model to automatically learn feature representations from the raw data, reducing the need for manual feature engineering.

Challenges:

Overfitting: Deeper networks have a higher risk of overfitting, where the model memorizes the training data instead of learning general patterns. Regularization techniques like dropout and L2 regularization are often used to mitigate this issue.
Training Complexity: Deeper networks can be computationally intensive to train and may require careful tuning of hyperparameters like learning rate and batch size.
Vanishing and Exploding Gradients: In very deep networks, gradients can become too small (vanishing gradients) or too large (exploding gradients), making training difficult. Techniques like batch normalisation and skip connections (residual networks) address these issues.

Example

Here is simplified example of building and training the deep component of a neural network using TensorFlow:


# Create the deep component
inputs = Input(shape=(input_dim,))
hidden1 = Dense(32, activation='relu')(inputs)
hidden2 = Dense(64, activation='relu')(hidden1)
hidden3 = Dense(128, activation='relu')(hidden2)
output = Dense(output_dim, activation='softmax')(hidden3)

# Define and compile the deep model
deep_model = Model(inputs=inputs, outputs=output)
deep_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the deep model
deep_model.fit(x=data, y=labels, epochs=10, batch_size=32)

In this example, the deep component consists of three hidden layers with increasing numbers of neurones.

Building a Wide and Deep Model

In the wide and deep model, the outputs of both the wide and deep components are combined to make final predictions, leveraging the complementary strengths of both components.

We will be defining two input layers: input_a for the wide component and input_b for the deep component. Each input takes a single feature. These inputs will later be used to feed the data into the model.

# Import necessary libraries
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model

# Define inputs for the wide and deep components
input_a = Input(shape=[1], name="Wide_Input")
input_b = Input(shape=[1], name="Deep_Input")

We will passinput_b through two hidden layers, each with 30 neurons and ReLU activation. This section represents the deep neural network component that captures complex patterns and features in the data.

# Define the deep path
hidden_1 = Dense(30, activation="relu")(input_b)
hidden_2 = Dense(30, activation="relu")(hidden_1)

Now we need to combine the outputs of the wide component (input_a) and the final hidden layer of the deep component (hidden_2). The concatenate function is used to concatenate these two layers. Then, a dense layer with a single neuron and a linear activation function (Dense(1)) is added to produce the main output of the model. This architecture allows the model to leverage both the explicitly defined feature interactions from the wide component and the learned representations from the deep component.

# Define the merged path
concat = concatenate([input_a, hidden_2])
output = Dense(1, name="Output")(concat)

The final hidden layer hidden_2 from the deep component is connected to a dense layer with a single neuron to produce the auxiliary output. This auxiliary output can be used for tasks that might benefit from intermediate representations learned by the deep component.

# Define another output for the deep path (auxiliary output)
aux_output = Dense(1, name="aux_Output")(hidden_2)

To build the wide and deep model we specify the input and output layers. The model takes two inputs (input_a and input_b) and produces two outputs (output and aux_output). This creates a multi-output model.

# Build the model
model = Model(inputs=[input_a, input_b], outputs=[output, aux_output])

Let’s visualise the model using plot_model function.

# Visualize the architecture
plot_model(model)

Google’s Use Case

Transforming Recommendation Systems

Google’s Wide & Deep Learning architecture, revolutionised recommendation systems by addressing the limitations of traditional methods. This innovation led to a significant enhancement in user engagement and recommendation quality. Users started receiving personalized recommendations that aligned with both their explicit preferences and hidden interests.

Google Research Blog. “Wide & Deep Learning: Better Together with TensorFlow.” : https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html

Final Thoughts

As we wrap up this discussion on wide and deep networks, wide and deep model architecture offers an amazing toolset for various machine learning tasks. The ability to capture both explicit and abstract patterns makes it a valuable addition to any machine learning practitioner’s toolkit.

Exploring Wide and Deep Networks

Understanding Wide and Deep Networks

Wide Component Explained

Examples of Wide Component Feature Crosses

Benefits and Challenges of Wide Component:

Example

Deep Component Explained

Deep Component Architecture

Benefits and Challenges of Deep Component

Example

Building a Wide and Deep Model

Google’s Use Case

Transforming Recommendation Systems

Final Thoughts

Written by Anna Alexandra Grigoryan

No responses yet