In my previous blog I talked about how demand for AI-based interfaces has become almost unavoidable—and that adding an AI-based interface like face-id to authorize access to a machine may at first seem like a huge leap, it doesn’t have be as difficult as you might think. There’s a wealth of network platform available, lots of training options and even open-source applications, such as that face-id example. You can be up and testing pretty quickly with a prototype you can run on your PC.
(Source:CEVA)
Constraints
Moving a trained network to your embedded app may seem like another huge hurdle. PC- or cloud-trained networks don’t optimize much for memory usage or power. They may use floating point or double words for network calculations, and they’ll lean heavily on off-chip memory accesses as they process sliding windows over an image. That’s not a concern for a prototype running on a high-performance PC plugged into wall power, but you need to be a lot thriftier in your end application, with no compromise in performance.
The essentials of optimizing
One key step in optimizing is called quantization. Switching weights from floating point to fixed point and reducing fixed-point size, say from 32 bit floating point to 8 bit integers, affects the size not only of the weights but also intermediate compute values. This alone can reduce memory footprint significantly, with little noticeable impact in recognition quality in most cases.
A second manipulation is to exploit sparsity in the weights with minimal accuracy impact. This practice will take advantage of the weights which are close to zero, and round them to zero, while keeping close track of the accuracy impact. Weights are used in multiplying partial sums, a pointless exercise when one of the factors is zero, so there’s no need to perform the operation.
In practical implementations, images are processed incrementally so weights must be updated as the calculation window moves across the image. That can make for a lot of updates and a lot of traffic. By forcing a large percentage of the weights to be zero, the weights array can be compressed, making it possible to store all or much of the array in on-chip SRAM, to decompress on demand. That in turn minimizes the need to go to main memory and therefore increases performance and reduces power. It also incidental reduces on-chip traffic when loading weights. Less traffic contention means higher throughput.
One more factor should be considered. Like most complex applications, neural nets depend on sophisticated libraries. You’ll need to use a library designed for use in microcontroller environments and compiler to your platform of choice. A good starting point might be an open-source library, such as TensorFlow Lite, but for full utilization of the micro-controller, a dedicated tailored solution will be required.
Of course, knowing what you have to do doesn’t make it easy. You now need to find a platform which will streamline these operations and provide hardware-optimized libraries.
How do I make this an easy-to-use flow?
What you want is a flow where you can take the network you trained over your specific platform, TensorFlow for example, and compile this directly onto your embedded solution—with no intervention other than dialing in a few basic requirements. Of course, you also want the option to be able to further hand-optimize, maybe setting different levels of quantization in different planes. Maybe experimenting with weight thresholds versus on-chip memory sizes. And you want libraries optimized to the hardware and hardware optimized to the libraries.
Proven AI platform like CEVA’s CDNN are designed to provide this type of flow. CDNN offers an offline processor toolset for quantization and runtime tasks generation as well as tailored run time libraries for CEVA DSP’s and customer users hardware accelerators. CEVA’s solution supports all popular AI model formats, including TensorFlow Lite, ONNX, Caffe, and other.
Published on Embedded.com.