Resistive memories enable bio-inspired architectures for edge AI




Research activities in the field of brain-inspired computing have gained a large momentum in recent years. The main reason is the attempt to go beyond the limitation of the conventional Von Neumann architecture that is increasingly affected by the limitation of the bandwidth and latency of the memory-logic communication. In neuromorphic architectures, the memory is distributed and can be co-localised with the logic. New resistive memories technologies can provide this possibility easily given their capability to be integrated within the interconnect layers of a CMOS process.

While most of the current attention in AI deployment has being directed to implementation of Deep Learning algorithms in large conventional computing system, the impact on device and circuit technology has been mixed. While advanced standard CMOS technology has been used to develop GPU and specific circuit accelerators, there has been no real push to use any “bio-inspired” hardware. The avenues open by emerging resistive memory devices (RRAMs), which can emulate a biologically plausible synaptic behavior at nanometer scale by modulating conductance with the application of a relatively low bias voltage, have been restricted to research groups due to a (perceived) insufficient maturity of the technology.

However, these new devices can provide the solution to one of the major problems facing a large deployment of AI into consumer and industrial products: Energy Efficiency. The energy overhead to transmit all data into the cloud/server systems for their analysis will quickly hit the limit of economic viability of AI if its use become more widespread. Moreover, for real time systems, like autonomous vehicles and industrial controls, the latency would still be an issue if the servers connected to the 5G infrastructures to treat that data were concentrated in well-defined areas and not distributed across the infrastructure. For these reason, and in Europe for privacy concern too, it will become increasingly important to have edge/point of use AI capable systems highly energy efficient and possibly with progressively improved local learning capabilities.

Embedded AI systems are ideally suited to treat data that will require a real time response and in situations where energy is a main concern. The interest for such systems is growing as testified by the success of the tinyML initiative [1]. Bio-inspired (i.e. where the memory element acts also as interconnect and computing element) approaches in this field have an added advantage when treating sparse, time domain, data streams generated by sensor like microphones, lidar, ultrasound, etc.. These systems would then be able to conduct most of the operations in the analog domain, streamlining the data flow by avoiding power hungry, unnecessary multiple analog to digital conversions, and using non-clocked, data driven architectures. The absence of clock and the dissipation in the memory elements only during the signal pulse, result in an extremely low power consumption in absence of input (hence its suitability for sparse signals) and may not require specific sleep modes to attain battery powered operations regimes. The non-volatility, moreover, would require parameter setting only at first power-on or at eventual updates of the system and not transfer from an external source at each power-up.

Use of new resistive memories, however, are not restricted only to such “edge” or “bio-inspired” applications, but can also benefit conventional, fully digital, clocked systems performing the function of slow-nonvolatile cache/fast mass storage intermediate memory level in neural accelerators. In this case, the benefit will be the reduction of the fast DRAM and SRAM cache areas while still reducing latency to access the mass storage.

Hardware Platforms for bio-inspired computing

From a technological perspective, RRAMs are a good candidate for neuromorphic applications because of CMOS compatibility, high scalability, strong endurance, and good retention characteristics. However, defining practical implementation strategies and useful applications of large-scale co-integrated hybrid neuromorphic systems (CMOS neurons with resistive memory synapses) remains a difficult challenge

Resistive RAM (RRAM) devices like Phase Change Memories (PCM), Conductive Bridge RAM (CBRAM) and Oxide RAM (OxRAM) have been proposed to emulate biologically inspired features of synaptic functionality that are essential for realizing neuromorphic hardware. Among the different types of emulated synaptic features, spike-timing-dependent plasticity (STDP) is one of the mostly used but is certainly not the only possibility and some may reveal to be more useful in implementations for real applications.

An example of a circuit implementing these ideas and validate the approach is SPIRIT, presented ay IEDM 2019 [2]. The implemented SNN topology is a single-layer, fully connected one whose objective is to perform inference tasks on the MNIST database, there are 10 output neurons, one per class. To reduce the number of synapses, the images were down-scaled to 12×12 pixels (144 synapses per neuron). Synapses are implemented using Single Level Cell (SLC) RRAMs, i.e. only considering the low and high resistance levels. The structure is of 1T-1R type, with one Access Transistor per cell. Multiple cells are connected in parallel for enabling various weights. Synaptic quantization experiments done on the learning framework have shown that integer values, ranging from -4 to +4, are a good compromise between classification accuracy and RRAM number. Since we aim at obtaining weighted currents, 4 RRAMs must be used for the positive weights. For the negative weights, the Sign bit could have been encoded using RRAMs as well: however, since a fault-tolerant triple redundancy would have been needed, it was preferred to use 4 additional RRAMs for implementing the negative weights.

The “Integrate and Fire (IF)” analog neurons design was guided by the need for mathematical equivalence with the tanh activation function used in supervised offline learning. The specifications were the following: (1) a stimulation with a synaptic weight equal to ±4 must generate a spike; (2) neurons have to generate positive and negative spikes; (3) they must have a refractory period, during which they cannot emit spikes, but must continue to integrate. Neurons are architected around a MOM 200fF capacitor. Two comparators are used for comparing its voltage level to positive and negative thresholds. Since RRAMs must be read with a voltage drop limited to 100mV between its terminals, for preventing setting the devices to LRS, the obtained currents cannot be directly integrated by the neurons, they are copied by current injectors. The impact of programming conditions was assessed and adequate programming conditions are used for ensuring a large enough memory window. Relaxation mechanisms do appear on a very short time scale (less that one hour). Therefore, classification accuracy does not degrade over time. Read stability was also verified, up to 800M spikes sent to the circuit.

Classification accuracy on the 10K test images of the MNIST database is measured at 84%. This value must be compared to the accuracy obtained from ideal simulations of 88%, which is limited by the simple network topology (1 layer with 10 output neurons). The energy dissipation per synaptic event is equal to 3.6 pJ. When accounting for the circuit logic and SPI interface, it amounts to 180 pJ (it could be reduced by optimization of the communication protocol). Measurements show that an image classification needs 136 inputs spikes on average (for ΔS=10): this is less than one spike accumulated per input, leading to a 5x energy gain compared to equivalent Formal coding MAC operations in 130nm node. The energy gain comes from (1) the lightness of the base operation (accumulation, instead of multiplication-accumulation as in classical coding) and (2) the activity sparsity due to the spike coding. The sparsity benefit will increase with the number of layers.

This small demonstrator showed how it is possible to achieve performance levels on a par with conventional embedded approaches but with a power consumption dramatically reduced. In fact, the rate code used in the SNN demonstration makes this implementation equivalent to a classical-coded one: transcoding from the classical domain to the spike domain does not induce any loss in accuracy. However, from the simple topology used in this proof-of-concept, which is a single-layer perceptron explain the slightly lower classification accuracy, compared to state of the art deep learning models which uses much larger networks and more layers. To overcome this difference, a much more complex topology is currently being implemented (MobileNet class) and the classification accuracy will scale up accordingly, with the same energy benefits.

The same approach will be extended to circuits embedded with microphones or lidars to analyse, locally and in real time, the data stream avoiding the need of transmission over the network. Both rate coded and time coded strategies can be used to optimize the network depending on the information content of the signal. Initially the learning would be performed centrally and only the inference integrated into the system, but some degree of incremental learning will be introduced in future generations.

Another way for exploiting the properties RRAM beneficial for embedded AI products is the use of analog architectures based on crossbar arrays of RRAM. They can provide a much denser implementation of the multiplier-accumulator (MAC) function, central in both inference and learning circuits, when compared to a conventional digital implementation. If the further step of moving into the time domain and eliminate clocking is taken, then compact low power systems beyond current state of the art are attainable. While very promising and largely investigated by academia, this approach is still not largely accepted by the industry, which points to the difficulty of designing, verifying, characterizing and certifying analog asynchronous designs, and by the difficulty of scaling analog solutions. In our opinion, all these obstacles can be overcome in the interest of extremely power efficient solutions.

Part of the perceived difficulty with these memories arise from the variability observed but that is the reflex of the experimental conditions. We observe much better distribution when operating in 300mm and when the integration process is more mature, so we assume variability issues can be solved during industrialization. Design tools, also, are coming and models that are more accurate are progressively becoming available. Temperature variations certainly have an impact, but the statistical nature of this type of computation and its intrinsic robustness to some degree of parameter variations in the inference phase make its final impact much less relevant that with the conventional analog designs the community is used to. One of the advantages of the analog crossbar approach is that automatically there is no current when a “zero” data is applied. However there is a leakage current contribution from the “zero” values stored when a “one” data is applied, this can limit the reasonable sizes of the crossbars and drive the research towards the best values for the resistance levels.

Some issues are more fundamental. The first one is that power efficiency and the high degree of parallelism are coming from trading off time multiplexing (frequency of operation) with area: what is the limit of nets size (problem or number of classes size) where this trade off is advantageous and how it depends on the implementation node? Another one is the cyclability of these memories. While it is sufficient for inference phase, and the programming of the crossbar can be done at the initialization phase with an acceptable overhead, on-chip learning using classical backpropagation schemes and number of iterations is out of question due to the excessive writing load. However very promising avenues using other learning approaches are being pursued and hopefully can provide valid solution in the next few years.

Before the introduction of such type of circuits, technologies like RRAMs and 3D integration can be used to in conventional implementations to give already solutions at a smaller power budget and smaller form factor. FPGA implementations for highly customized applications, pure software implementations running on MCUs or CPUs or dedicated highly parallel multicores/accelerators like or similar to GPUs for more general-purpose applications are the mainstream today. All of them can also benefit from the availability of local nonvolatile memory that could lead to FPGAs that are more compact, more optimized memory hierarchy for MCU/CPUs and multicore/accelerators chip. In particular, the use of dedicated versions of monolithic 3D integration, with RRAM planes intercalated among analog neuron planes, can produce much more compact and less power hungry system.

We investigated this approach, leading a prominent multidisciplinary group of EU R&D institutes, in the frame of the European H2020 program NeuRAM3, dedicated to researching best matches between advanced device technologies, circuit architectures and algorithms for the fabrication of neuromorphic chips. Among the many results of the project, as shown in the figure below, it is possible to see the example of an OxRAM fabricated in the CoolCube 3D monolithic process that is connected to both the top and the bottom CMOS layers. Moving forward, this type of technology can be used to integrate very dense arrays in the fabric of complex CMOS circuits dedicated to AI.


Figure. CoolCube 3D monolithic integration of OxRam within the interconnection between
top and bottom CMOS tiers opening the way to dense multilayer neural networks. (Source: Author)

3DTSV and 3D by Cu-Cu bonding are also promising candidate to have compact neuromorphic systems comprising the various elements in a highly-integrated architecture where partitioning is optimized according to the application or embedded AI element being closely coupled to imagers or other sensing or actuating elements.

Conclusion

A review of the impact RRAM can have in the bio-inspired computing systems has been conducted and some promising results and concept discussed.

Acknowledgements

The author wish to thanks his colleagues Elisa Vianello, Alexandre Valentian, Olivier Bichler, Marc Duranton for the contributions to this paper as this work is an overview of the projects carried out at CEA Tech in recent years and by multiple groups. The support of the French programs ARCH2NEU, NEMESIS, and by the H2020 European Program NeuRAM3 (Grant Agreement 687299) are also acknowledged.

References

[1] tinyML initiative, see www.tinymlsummit.org and https://tinyml.org/emea/

[2] A. Valentian et al., “Fully Integrated Spiking Neural Network with Analog Neurons and RRAM Synapses,” 2019 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 2019, pp. 14.3.1-14.3.4.


Dr. Carlo Reita earned the Laurea Degree of Dottore in Fisica from the University of Rome “La Sapienza”. He has held R&D and management posts with CNR (Italy), GEC-Marconi (UK), Cambridge University (UK), Thomson-CSF (now Thales, France) and Photronics (UK). He is currently Director of Strategic Partnerships and Planning in the CTO office of CEA-Leti.

 

The post Resistive memories enable bio-inspired architectures for edge AI appeared first on Embedded.com.





Original article: Resistive memories enable bio-inspired architectures for edge AI
Author: Carlo Reita