Have we settled for what we think is possible in AI at the edge, rather than what we really want? Performance through novel architectures for cloud datacenter AI seems unbounded – why can’t edge AI also make quantum leaps in performance without compromising power? Now it can. With the CEVA NeuPro-M heterogeneous and secure AI/ML processor architecture you can build an AI solution with a significant boost in real performance, at even lower power, while retaining tremendous flexibility. This new embedded AI processor combines an unmatched range of dedicated coprocessors with a high level of traffic optimization and parallel processing. All with optional secure access to protect network weights and biases against identity theft.
Source: CEVA
Heterogeneous Coprocessors & Accelerators Options
Edge devices have a diverse range of needs. They all want more performance, less power, smaller area, within a spectrum of possible tradeoffs. We all want to stretch to the very best that is possible within a product envelope but without needing to switch platforms (i.e. hardware, firmware and software). This starts with range of options from which you can build ML solutions to best meet your needs.
If you want an out-of-the-box Transform, that can be a part of your solution, providing up to 2X performance gain for convolutional neural networks, also reducing power at no precision degradation even with restricted data types like 4 & 8 bit. All without the need for a training phase. Add the sparsity engine to the inference flow for up to 4X acceleration to optimize for zero-value data or weights. NeuPro-M can support a wide range of data types in a mixed-precision inferencing process, fixed point from 2×2 to 16×16, and floating point (and Bfloat) from 16×16 to 32×32.
The complementary streaming logic unit handles on-the-fly fixed point scaling, activation and pooling. You can also use a vector processor to support future layer architectures in your AI processor. NeuPro-M also offers a set of next generation AI features including transformers for vision and NLP, fully-connected, 3D convolution, RNN, and matrix decomposition options (with potential for further significant acceleration and power reduction).
Higher Throughput
In most AI systems, some of these functions might be handled in specialized engines, requiring data to be offloaded and the transform to be loaded back when completed. That’s a lot of added latency, completely undermining performance in an otherwise strong model. NeuPro-M eliminates that issue by connecting all these coprocessors directly to a shared L1 memory. Sustaining much higher bandwidth than you’ll find in alternative solutions.
The vector processing unit sits at the same level as the other coprocessors within each NPM engine. Your custom layer algorithms implemented in the VPU benefit from the same acceleration as the rest of the model. Here too, no offload and reload is needed to accelerate a custom layer. Also important, you can have up to 8 of these NPM engines (all the coprocessors, plus the NPM L1 memory) in a single NeuPro-M core, offering complete scalability per designated use-case. NeuPro-M also offers a significant level of software-controlled bandwidth optimization between the L2 memory and the L1 memory, optimizing frame handling and minimizing need for DDR accesses.
Data and weight traffic are also minimized. For data, the coprocessors and accelerators directly connect to each other for a true on-the-fly, head to tail fused operation pipeline and can also work simultaneously on the same local L1 memory data, as already mentioned. In some cases, the host can communicate data directly with the NeuPro-M L2, again reducing need for DDR transfers. Weights are stored compressed in DDR memory and are decompressed on-chip in real-time. Similarly, to reduce bandwidth constraints data can be compressed for transfer over the external interface. The unique memory hierarchy implementation in NeuPro-M will hide latency and cycle penalties in cases where DDR access is unavoidable.
Data tunneling is also optimized through a decentralized control architecture. Through local data sequencers on each engine as well as on the common subsystem, NeuPro-M will optimize tunneling for bandwidth, performance and utilization by orchestrating the most optimized parallel processing schema per use case.
Model Optimization & CDNN SDK
With so many options, how do you optimize a trained model from one of the standard networks to your specialized model? CDNN (CEVA Deep Neural Network) framework is a comprehensive software toolchain with compatibility for common open-source languages, combining a network inferencing graph compiler and a dedicated system architecture planner tool. This tool provides you a simple and fast way to assess model performance over the NeuPro-M as well as precision and cycle count statistics. It also suggests the optimal processor configuration needed per a given use-case.
These model and memory offline optimizations help take maximum advantage of load balancing between the various components and engines within the NeuPro-M. By dividing the network to subgraphs and image to optimize for tile size, CDNN can ensure your algorithm fully utilizes of all parts of the processor. Similarly, mechanisms like dynamic bandwidth reduction, scale-per-channel, help you fully exploit the architecture for high performance at low energy consumption. CDNN provides you with the tools to tap the full potential of your model over NeuPro-M, for maximum performance at minimum power.
NeuPro-M Delivers
We have run benchmarks over several designs, comparing models built on combinations of NeuPro-M features. We see 1.5 order of magnitude in performance improvement in frames per second, outstanding power efficiency of 24 Tops/Watt, and up to 6x in bandwidth reduction, compared to CEVA’s previous gen NeuPro architecture.
You really can have more for less – building on a heterogeneous embedded processor with multiple engines to support the full range of AI acceleration options you want. To learn more read out more detailed write-up.