Companies planning to launch a computer vision project and integrate convolutional neural networks into applications need to approach the initiative carefully.

  • Can we obtain enough good quality, domain-specific data to train the model(s) on?
  • Do we know which model architecture suits our needs best?
  • Is it necessary that we opt for the latest, ultra-deep neural networks, with thousands of trainable parameters, or will something less computationally intensive serve our purposes fine?
  • Is it feasible to train our CNN in-house?
  • Once active, where is the network going to live? How to ensure its end-to-end responsiveness is high and make it accessible across applications?

These are just some of the questions firms need to address prior to taking on modeling tasks.

The article below describes how CNNs work, the fundamental differences between training and inference, the challenges firms tend to experience when deploying neural networks into real-world applications and how to overcome them.

Convolutional Neural Networks: Training vs Inference

Before convolutional neural network models (or any deep learning algorithms) can put out predictions, they must be thoroughly trained. CNNs, the technology of choice for computer vision platforms, are particularly sample inefficient i.e. they require large amounts of labeled visual data to be passed through their layers to achieve high levels of accuracy in regression and classification tasks. Their training phase is, therefore, computationally intensive.

Besides that, most companies find themselves having trouble acquiring big enough datasets of domain-specific data, so they’re likely to seek external assistance, utilize outside data sources, etc.

It is common to use GPUs to accelerate the process of building up parameters on the connections in Convolutional Neural Networks. Having multiple cores and enabling parallelization of computing, these processors are uniquely poised for simple matrix math calculations; they’re much faster in this regard than CPU processors, which, in machine learning, are typically utilized for sequentially conducted complex calculations.

While training, a model’s classifier runs forward passes through labeled dataset (one complete pass through all data objects is referred to as epoch) multiple times over. Input images are being operated on by the layers of the network which, at this point, all have random weights; the CNN produces its first attempts at predicting class labels and scores.

Then, based on the received values, the model computes the error via loss function. It compares its first outputs, the results of random initialization, to the actual labels to see how close (or far) it is from the desired values.

Afterward, the error is being backpropagated from the end to the start of the network so that the CNN can update its weights accordingly. This is typically done through one of Gradient Descent optimization (weight update) algorithms.

Convolutional Neural Networks: Training

The weights change only once per epoch and that’s why training a CNN model might turn out computationally very expensive. Given that datasets for modern CNNs tend to be large, it takes lots of time and resources to get through each sample.

So, to speed up the process and cut down on the costs, at least to a degree, data scientists often split the data into batches and have their models update weights after processing each batch. This approach requires far less time to converge and thus reduces the overall number of epochs needed for training.

Having the model operating on a batch of input samples (images, in the case of CNNs) simultaneously helps to prevent overfitting and, also, allows achieving greater operational efficiency by amortizing loading weights from GPU memory across multiple inputs.

Inference stage, on the other hand, refers to when the CNN, sufficiently trained, is applied to data outside the labeled dataset; when it makes predictions on raw inputs. It, too, includes forward propagation calculation, but, this time, without a backward pass function. The model that’s being used for inference is no longer modifying its parameters .

The numbers of data objects fed into CNNs during this stage are deliberately smaller: though big batches of data allow us to utilize the capacities of GPUs more efficiently, when it comes applications such as image-processing pipelines, which operate in real-time, latency matters too.

It might take long to stack up images into a sizeable batch while firms always strive to reduce their model’s overall end-to-end response time. So, the key here is finding the correct balance between throughput and latency; companies should aim to maximize useable batches and not allow latency to step over a given application-specific threshold.

It’s important to understand that training and inference stages are vastly different in how firms handle their execution. If a company has a real-time streaming system, its convolutional neural network isn’t likely to be trained inside it. Rather, the training process will happen offline (with saved labeled input-output pairs) and then the CNN will be set up to draw inferences from real data, in smaller batches, as it flows into the system.

Convolutional Neural Networks: Training vs Inference

What’s A Firm to Do When There’s a Lack of Data for a Computer Vision Project?

As we’ve mentioned, it’s a huge challenge to collect enough data to train a specific deep learning algorithm on. Especially when you’re working on some sort of a niche problem.

One way to tackle this is transfer learning.

In a nutshell, we don’t necessarily need to train our CNNs from scratch; we can repurpose a model that’s already been trained and has built up a core feature set. This works because CNNs tend to pick up very general features when processing images (facial features, text, etc.) and these features can then be leveraged for other use cases. Transfer learning helps companies reduce the computational burden and avoid having to utilize swaths of GPUs.

When we apply the method, we don’t initialize weights randomly in the beginning; we use the layer parameters the model has already learned on other data. Then, we proceed to train the model further on our domain-specific dataset, which, typically, isn’t exhaustive.

There are a few ways to go about this:

We can continue performing backpropagation and updating weights on the pre-trained model or fine-tune some of its layers selectively. It all begins with adjusting the higher layers. Then, we can gradually work our way back to earlier layers all the while assessing the network’s performance to see where we should stop.

We can use the pre-trained network as a feature extractor and train our support-vector machine, or other linear classifiers, with those features. This method is great when there’s not a whole lot of data available and we want to avoid overfitting.

Summing up

When developing a computer vision system, it’s crucial to be pragmatic about the choice of a CNN architecture. It might be best to just use some reliable CNN (AlexNet, ResNet, etc.) as a starting point in layer architecture. The goal here is finding a network that’s capable of modeling data related to your domain.

Afterward, we should think about how to deploy the model under specific data storage and security constraints, and how to ensure there’s always access to the right version of the CNN (if there are many ML workflows running simultaneously) across firm’s applications.

Want to launch a computer vision project? Need more information on how to integrate CNNs into your applications? Contact our expert for a free consultation.