In the previous blog of this series, we discussed how to choose between deploying your AI to the edge, extreme edge or cloud. In this blog, we’ll dive deeper into the hybrid approach and how to distribute the AI model between device and cloud with deep neural networks.
Dividing AI into pieces
To divide the AI model over different levels from extreme edge to cloud, you need to cut the model into pieces. You can achieve this by cutting the model into data reduction and prediction.
Let’s imagine you want to build your own Alexa home assistant and integrate it into a coffee machine. You want to call out and order your favorite coffee. Unfortunately, the coffee maker does not have enough computational power to host the entire neural network in the edge. But, you don’t want to send the whole voice segment to the cloud for processing either. So you decide to go for a hybrid approach by putting part of the model in the coffee machine and part in the cloud. To understand how you can cut the model into pieces, you need to imagine what the AI algorithm has to do to convert our words into a creamy cappuccino.
Data reduction & prediction process © Verhaert
Step 1: Data reduction, removing the noise
When talking to the device, the microphone picks up a mixture of proper voice signals, measurement noise and uninteresting background sounds. First, the algorithm should reduce the noisy signal to its most meaningful signal. This is the data reduction step. This new signal is void of unnecessary information or background sounds, nothing but the essence of you message: “Get my espresso pronto!”. We call this reduced data an ‘embedding’ or ‘encoding’ of our signal. This compressed encoded signal is less costly to send over our Wi-Fi connection and more secure.
In our daily lives, we perform data reduction all the time. Imagine driving your car in the pouring rain. Your vision provides the essential information for driving. To make it easier to drive, you will use the wipers to remove water from the windshield, removing the visual noise. But even then, you still perceive a lot of information that you don’t need, think billboards, buildings, people on the sidewalks, farmlands, and so much more. When you’re driving, your brain automatically removes this distracting information, making it easier for the second part of your brain algorithm to predict your following action.
Step 2: Prediction, deciding the outcome
The prediction step follows the reduction step and extracts the intended action, transforming the encoded output from step 1 into a useful task. In our coffee example, the algorithm takes the encoded signal and extracts the command to start brewing the type of coffee you want: espresso. In the case of driving, your brain will predict the right action: steer sideways, pump the gas or hit the breaks.
Teaching deep neural networks to separate
Now that you know you can divide the model into one part to process the measurements and another to perform the actual task, you need to find a way to separate the steps. A standard neural network typically wouldn’t have two perfectly separable parts for data reduction and prediction. Both steps are present, but the data reduction step gradually blends into the prediction step as the data flows through the layers of the network.
Splitting data reduction & prediction © Verhaert
Nevertheless, it is possible to force deep neural networks to split into distinguishable parts during the learning process. In such a network, earlier layers will be more focused on extracting interesting information from the measurements, while later layers will be more focused on making a decision. Using a bottleneck in the architecture is a handy method for forcing this behaviour. For instance, in a neural network that classifies images, the first layers will learn how to find lines and basic shapes in the picture, whereas later layers will focus on giving meaning to the combination of these shapes, classifying the image.
Deploying the model to edge and cloud
Deploying the model to edge and cloud © Verhaert
Since you can train the neural network to divide the network into two pieces, you can now deploy the first piece to the edge device and the second one to the cloud.
In the case of our smart coffee machine, you’d want the first part of the model to compress the data maximally with minimal computation power. This way, you send only the bare minimum required for the intent detection. AI engineers do this by cleverly selecting the model architecture and loss function while training the model. During training, we’ll force the second piece of the model to do the most challenging task: extracting the intent. This part requires significantly more computational power, which is why we deploy it to the cloud.
Hopefully, you now know how to think about the hybrid approach for dividing deep neural nets over the edge and the cloud. However, the hybrid approach isn’t always the best one. If your coffee maker only needs to recognize a limited set of commands, the AI model could reside entirely on the coffee machine. If your coffee maker needs intelligence comparable to Alexa, you would probably want to go for a hybrid or a full cloud implementation. Want to know more about how to decide between edge, cloud and hybrid? Read the first part of our blog series!