Accelerating in the deep

Anna Alexandra Grigoryan

3 min readAug 7, 2019

If deep learning is in place, acceleration comes first.

Deep learning is cool and very applicable to some problems, but it takes a lot of time to train a deep net.

To speed up this process accelerated hardware, such as Google’s Tensor Processing Unit (TPU) or Nvidia GPU can be used.

Why did deep learning become so popular?

Increase in computer processing capabilities
Massive amounts of data
Advances in machine learning algorithms and dramatic increases in computer research

Why is deep learning so slow?

Deep learning pipeline consists of three components: (1) preprocessing the data, (2) training the model and (3) deployment of the model

The pipeline is slow, because:

The process of training a deep learning model is slow itself
Building a deep neural network is an iterative process, needing tuning and optimization
The models need to be updated as new data gets fed through time.

The training phase is the most intensive task in the pipeline and consists of two main operations: (1) Forward pass and (2) Backpropagation. In the forward pass, we just feed the data, which passes through the network and output is generated. In the backward pass, the weights get updated to minimize the error we got from the forward pass. So, considering that training is an iterative process, and many weights get updated with every iteration, it involves extensive computations (matrix multiplication) thus requiring much computing power.

Sequential approach vs parallelism

It would take longer if we do those multiplications in a sequential approach. While parallelism, which is running the operations in parallel, will definitely speed up the process.

CPU vs GPU

The Central Processing Unit (CPU) is processor or mini processor of the computer system. The CPU is responsible for executing a sequence of stored instructions, in our case matrix multiplication.

Although CPU is good at working with small amounts of memory quickly, it is not very suitable for big matrices used in deep learning. To do processing in parallel clusters of CPUs should be used.

The Graphics Processing Unit (GPU) is a processor traditionally designed for rendering images, animations and videos. GPUs have many cores (up to 1000), thus they can handle many computations.

GPUs are good for deep learning because:

They perform well with big chunks of data, by handling high-dimensional matrices.
They are applicable for parallel running, i.e. the concurrent processing required for deep learning

GPUs don’t have full access to the data, while running the command on GPU, the data needs to be copied first and then the results are copied back to the memory.

If a TensorFlow has both CPU and GPU implementations, the GPU devices is given a priority.

Using GPU for deep learning

We can either explicitly determine if an operation should be run on GPU, leave it to TensorFlow to run it using GPU device when available. GPUs can over-perform CPUs, because GPUs are designed to handle these kind of matrix operations in parallel, while single core CPU does it one element at a time.

TensorFlow, supports three types of devices:

“/cpu:0”: CPU
“/gpu:0”: GPU (if you the machine has one)
“/device:XLA:0”: Optimized domain-specific compiler for linear algebra.

What is XLA?

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra. It improves speed, memory usage, and portability etc.

Before we start using GPU on Tensorflow, make sure you have the updated versions of Tensorflow and Tensorflow Nightly. (https://www.tensorflow.org/install)

import sys
import numpy as np
import tensorflow as tf
from datetime import datetime
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
config.log_device_placement=False

Then the we call the list of GPUs and CPUs on our machine:

from tensorflow.python.client import device_libdef get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]
get_available_gpus()

Even though, by default, GPU get priority for the operations which support running on it. We manually place an operation on a device to be run with tf.device command.