Optimization of Multi-Level Operation in RRAM Arrays for In-Memory Computing

: Accomplishing multi-level programming in resistive random access memory (RRAM) arrays with truly discrete and linearly spaced conductive levels is crucial in order to implement synaptic weights in hardware-based neuromorphic systems. In this paper, we implemented this feature on 4-kbit 1T1R RRAM arrays by tuning the programming parameters of the multi-level incremental step pulse with verify algorithm (M-ISPVA). The optimized set of parameters was assessed by comparing its results with a non-optimized one. The optimized set of parameters proved to be an effective way to deﬁne non-overlapped conductive levels due to the strong reduction of the device-to-device variability as well as of the cycle-to-cycle variability, assessed by inter-levels switching tests and during 1k reset-set cycles. In order to evaluate this improvement in real scenarios, the experimental characteristics of the RRAM devices were captured by means of a behavioral model, which was used to simulate two different neuromorphic systems: an 8 × 8 vector-matrix-multiplication (VMM) accelerator and a 4-layer feedforward neural network for MNIST database recognition. The results clearly showed that the optimization of the programming parameters improved both the precision of VMM results as well as the recognition accuracy of the neural network in about 6% compared with the use of non-optimized parameters.


Introduction
Since the IBM supercomputer Deep Blue was able to defeat the world chess champion Garry Kasparov in 1997 [1], the artificial intelligence (AI) field has experienced a dramatic boost.Such an acquired momentum can be attributed to two main reasons.First, the exponential increase in the last decades of the computational power exhibited by the computer systems, according to the Moore's Law.Second, the vast amount of data available thanks to internet and social media as well as, more recently, to the Internet of things (IoT).This ideal situation paved the way toward training deep neural networks (DNNs), which nowadays is the dominant paradigm in the AI field [2], in reasonable time with high accuracies.In fact, one of the most important accomplishments performed by DNN systems took place in 2016 when Google AlphaGo defeated the Go world champion Lee Sedol [3].However, pure software DNNs executed in supercomputers with thousands of CPU/GPUs suffer from an important main handicap: the energy consumption [4].In particular, the need of frequent transferences of data between the memory unit and the processing unit limits the energy efficiency drastically due to the a memory bottleneck inherent to the von Neumann architecture [5].In order to overcome this issue, companies like IBM and Intel have developed brain-inspired computing chips such as TrueNorth and Loihi [6,7], respectively, which dramatically reduce the energy consumption by using non-von Neumann architectures.Nowadays, the most promising of the novel non-von Neumann computing paradigms relies on the idea of in-memory computing, in which calculations are carried out in situ where the data are stored [8], therefore suppressing the extremely energy-and time-consuming memory-processor communications.Nevertheless, chips like the two mentioned store the synaptic weights of the DNN in standard static random access memories (SRAM) as digital floating precision numbers, which is not the optimal way in terms of area and energy overhead [9][10][11].
In this context, memristive devices have emerged as one of the most promising candidates to implement analogue artificial synapses in this computing paradigm [12,13].By using memristor crossbar arrays, the vector-matrix-multiplication (VMM) operations, which constitutes the greatest workload during the inference and backpropagation phases in DNN operation [14,15], can be easily performed taking advantage of the physical laws, such as Ohm's and Kirchhoff's laws that govern the electrical circuits [15,16].In particular, resistive switching RAM (RRAM) devices have shown great potential due to its high-density integration, full CMOS compatibility, high-speed and low-power switching, excellent endurance, long data retention, and multi-level operation [17,18].Among these characteristics, the latter is crucial in order to implement the analogue switching behavior required to mimic the plasticity of biological synapses.Several studies have already assessed experimentally this feature in RRAM devices.A few of them just check this capability as a proof-of-concept through DC measurements in single devices [19,20].Other works explore this feature by employing devices with dielectric films based on multiple layers of different materials [21,22].Complex algorithms have also been proposed to achieve reliable and accurate multi-level operation [21,[23][24][25].In a previous paper [26], we contributed in this regard by defining several reliable conductive levels in monolayer Al-doped HfO 2 RRAM devices integrated in 4-kbit arrays by tuning two parameters of the relative simple programming algorithm known as multi-level incremental step pulse with verify algorithm (M-ISPVA).This multi-level approach was successfully employed in implementing a 2-layer feedforward neural network for the classification of the MNIST dataset [27,28].Although similar studies can be found elsewhere [15,[29][30][31], none of them evaluate how the programming parameters used to define the synaptic weights in the RRAM array impact the results obtained by the VMM operations and, hence, on the accuracy of the corresponding DNN.
In this work, the accuracy of VMM operations intended to implement the inference operation in a DNN for MNIST dataset classification was significantly improved by optimizing the relatively simple M-ISPVA, which was used to program the corresponding synaptic weights on RRAM devices based on monoloyer (Al:HfO 2 ) dielectrics.First of all, the values of two key parameters of the M-ISPVA, namely, the target read-out current (I trg ) and the gate voltage (V g ), were tuned in order to define four discrete conductive levels with no overlap caused by the device-to-device (DTD) variability.Afterwards, the lack of overlap in terms of cycle-to-cycle (CTC) variability was additionally proved by means of inter-levels switching and endurance tests.The experimental behavior of the RRAM devices was captured by means of an empirical model, which lets us simulate first an 8×8 VMM hardware accelerator based on this technology and finally the 4-layer feedforward DNN for MNIST dataset classification.Finally, the accuracy of both neural-based systems was tested by assessing the role of the RRAM variability, directly linked to the specific set of programming parameters utilized: optimized (this work) and non-optimized (in [26]).
The characteristics of the RRAM arrays are detailed in Section 2, as well as the programming algorithm (M-ISPVA) and the whole electrical characterization carried out.All mathematical tools involved in the simulations, namely, the behavioral model, the quantization technique of weights and the addition of variability, are explained in Section 3. Afterwards, in Section 4, the electrical characterization of the RRAM samples and the modeling of the current distributions are shown and discussed.In Section 5, the VMM architecture is described and its operation tested.Then, the results obtained with the DNN on the MNIST data base are discussed in Section 6.Finally, we wrap up in Section 7 with the main conclusions.

Experimental Methodology
The whole experimental characterization was carried out by switching the RRAM devices integrated in a 4-kbit memory chip (Figure 1, right) in order to assess the DTD variability within an architecture comparable to a real hardware accelerator.Each RRAM cell in the memory chip follows the 1-transistor-1-resistor (1T1R) structure.The select transistor is a NMOS manufactured in the 250 nm CMOS technology, which controls the upper current limit of the different low resistive states (LRSs) by tuning the V g value through the word line (WL) terminal (Figure 1) during set operations with the M-ISPVA [26].The drain of the transistor is connected in series to a metal-insulator-metal (MIM) variable resistor, as shown in Figure 1 (left).This MIM cell is located on the metal line 2 of the CMOS process and consists of a planar TiN/Al:HfO 2 /Ti/TiN stack.Metal layers were deposited by magnetron sputtering with a thickness of 150 nm for top and bottom TiN electrodes and 7 nm for the scavenging Ti layer (under the TiN top electrode).The Al-doped HfO 2 dielectric layer was grown with a thickness of 6 nm and an Al content of about 10 % by using the atomic layer deposition (ALD) technique.MIM stacks were patterned with an area of about 0.4 µm 2 .The strategy followed by the M-ISPVA to program RRAM devices is to apply a sequence of voltage pulses to the RRAM devices, which features a constant increment in amplitude [32].This sequence is applied on the bit line (BL) terminal during forming and set operations, whereas it is applied on the source line (SL) terminal during reset operations (Figure 1).Since this is a write-verify algorithm, a read-out operation of the current flowing through the RRAM device is performed after applying every programming pulse.Voltage pulses are applied until either the read-out current measured overcomes the I trg value or a specific maximum voltage amplitude is achieved.The I trg parameter in M-ISPVA plays a complementary role to V g by controlling the lower current limit of each of the LRSs defined during set operations.During reset operations, the role of both I trg and V g parameters is slightly different.The former limits the maximum current value of the high resistive state (HRS), whereas the latter is defined with a value that minimizes the series resistance of the transistor.
In order to activate the multi-level resistive switching behavior ensuring the best performance, the creation of the conductive filament (CF) in the Al:HfO 2 dielectric layer was carried out by following the forming operation in three steps reported in [33].First of all, a soft breakdown was performed by using the M-ISPVA with <I trg , V g > = <42, 1.5> <µA, V>, a voltage amplitude sweep (VAS) from 2.5 to 5.0 V in steps of 0.01 V and a pulse width (PW) equal to 10 µs.Afterwards, a reset operation was carried out to stabilize the CF [33] by using the M-ISPVA with <I trg , V g > = <6, 2.7> <µA, V>, a VAS from 0.5 to 2.5 V in steps of 0.1 V and a PW equal to 1 µs.Finally, each of the three LRSs was defined in one batch of 128 RRAM devices by means of the set operation.During the set operation, the VAS and the PW were the same as during the reset operation.Depending on the LRS defined, namely, LRS1, LRS2, or LRS3, the <I trg , V g > values used by the M-ISPVA were: <16, 1.0>, <29, 1.2>, or <42, 1.5> <µA, V>, respectively.All subsequent reset-set cycles performed featured the same combination of M-ISPVA parameters.The read-out operation carried out after every programming pulse consisted of a pulsed voltage with an amplitude of 0.2 V.All the programming parameters previously mentioned are summarized in Table 1.
Table 1.Summary of M-ISPVA programming parameters used for the non-optimized and the optimized approaches during forming, reset, set, and read-out operations: target read-out current (I trg ), gate voltage (V g ), voltage amplitude sweep (VAS), and pulse width (PW).

Non-Optimized Optimized
Operation I trg (µA) In addition, the CTC variability associated with the RRAM devices under study was evaluated by means of two types of tests: a collection of switching cycles between all possible LRS-LRS combinations (passing through the HRS) in the same batch of 128 RRAM devices (inter-levels switching) and an endurance test of one thousand (1k) reset-set cycles between all three HRS-LRS combinations.

Modeling Methodology
The model developed to emulate the current distributions associated with the multiple conductive levels defined on the RRAM devices in the previous section is a behavioral model with no physical meaning behind it but is cost-effective in terms of computational requirements for simulation of in-memory computing circuits.Additional details of this model can be found in [34].In order to obtain an analytical model which includes the DTD variability and can be extrapolated for different reading voltages (what is crucial to implement the input vector in the VMM operation), the experimental cumulative distribution functions (CDFs) of currents are fitted by using a Gaussian distribution, whose CDF expression is: where I is the current value, er f the error function, µ the mean value, and σ the standard deviation.The dependence of µ and σ on the applied input voltage is assumed to be linear [35].The specific implementation of this model is a SPICE sub-circuit consisting of two terminals for the electrical connections and a parameter for the selection of the device-state (see Figure 1 in Reference [34]).Basically, the conductance of the corresponding device-state is implemented by means of behavioral current sources taking the conductance variability into account.That is, the actual conductance is obtained randomly following the experimentally measured CDFs after fitting them to Gaussian distributions.The multi-level approach used in RRAM devices to implement the synaptic plasticity demands the use of a quantization procedure of the continuous weight range of softwarebased DNNs.The strategy used in our study is known as Uniform-ASYMM, which is suggested in [36].This strategy consists of a linear and range-based approach that quantifies a given input data x f into its quantized version x q using n bits, that is, 2 n levels.The mathematical expression of Uniform-ASYMM is an asymmetric equation that maps the minimum (min x f ) and maximum (max x f ) of the continuous range to a quantized range with a bias term: In order to add the (experimentally characterized) variability to a given quantized level within the synaptic weight range of a DNN, a simple linear formula was employed: where k corresponds to one of the four conductive levels in which a given quantized value x q is located and rand is a random number generator following a Gaussian distribution according to Equation (1).

Experimental and Modeling Results
After applying the forming operation in three steps, the experimental read-out current distributions obtained for the four conductive levels, namely, HRS, LRS1, LRS2, and LRS3 by using the optimized combination of M-ISPVA parameters are shown in Figure 2a,b as histograms.The distributions illustrated in Figure 2a correspond to the read-out current values measured just after the I trg is overcome.No overlap between conductive levels is reported.Since the read-verify operations are continuously performed on the whole batch of 128 RRAM devices (even on RRAM cells already successfully switched) until the M-ISPVA finishes, post-algorithm instabilities can lead to perturbations on the conductive state originally programmed.The distributions measured at the end of the M-ISPVA are shown in Figure 2b.Only a slight reorganization of the read-out current distributions is observed while the lack of overlap between contiguous conductive levels is still maintained.On the other hand, the read-out current distributions illustrated in Figure 2c,d regarding the use of non-optimized M-ISPVA parameters [26] show a small but evident overlap between the three LRS distributions in both cases, namely, just after the switching (Figure 2c) and at the end of the M-ISVPA execution (Figure 2d).Therefore, a careful selection of the <I trg , V g > values for each conductive level leads indeed to a reliable definition of discrete conductive levels in RRAM arrays.
Once the optimized combination of programming parameters was experimentally proved to successfully define non-overlapped conductive levels despite the inherent DTD variability present on HfO 2 -based RRAM arrays [37,38], the stability of the four conductive levels during the subsequent switching cycles was tested.An endurance test consisting of 1k reset-set cycles between the HRS and the corresponding LRS was carried out on each batch of 128 RRAM devices in order to assess the CTC variability.As shown in Figure 3, the experimentally measured read-out current CDFs associated with each of the four conductive levels remain essentially unchanged along the whole endurance test.The existing CTC variability has no significant impact on the definition of the four conductive levels as the switching cycles go by.In order to further assess the CTC variability, the switching between conductive levels was experimentally tested in a way hardly shown in the literature, namely, the inter-levels switching.All possible LRS-LRS transitions (by using the HRS as an intermediate step) were carried out on the same batch of 128 RRAM devices.In Figure 4, the CDFs of the experimentally measured read-out currents for the three LRS conductive levels are shown before and after, respectively, performing each one of the six possible LRS-LRS transitions.All transitions were successfully accomplished independently of the particular history of conductive levels previously programmed on the RRAM batch.Finally, the experimental CDFs corresponding to the read-out current distributions depicted in Figure 2b,d, that is, the values measured at the end of the M-ISPVA for the optimized and non-optimized combination of programming parameters, respectively, were fitted by using the behavioral model presented previously.As Figure 5 illustrates, the Gaussian curves fit with very high accuracy the experimental data for all four conductive levels in both programming approaches.The average (µ) and standard deviation (σ) values for the fitting Gaussian CDFs are summarized in Table 2.

VMM Architecture and Operational Results
In order to evaluate the impact of the variability featured by the conductive levels defined on RRAM arrays in a realistic scenario, in particular, the variability reduction by optimizing the programming parameters, a 8×8 VMM hardware accelerator has been simulated.In Figure 6 (left), a schematic of the architecture used to implement the VMM is depicted.Each synaptic weight in the matrix is implemented in differential fashion by means of two 1T1R RRAM devices, each one of which defines its conductive value according to the behavioral model (Figure 5) described in [34].The differential approach let us effectively increase the number of discrete weight values per synapse (up to seven) as well as to deal with positive and negative coefficients without any additional computation step.Hence, the output value for each column is calculated at the bottom by a weighted adder as depicted schematically in Figure 6 (right).The contribution of the input voltages v in i to each output voltage v out j matches the following term: where R ij+ (R ij− ) is the resistance of the RRAM device connected to the positive (negative) R norm ≡ w ij are the weight values and R norm is a scale factor in the adder.For positive weights, the negative RRAM device is programmed in the HRS, while the positive RRAM device is programmed in one of the four conductive levels.On the other hand, negative weights require that the positive RRAM device is in the HRS, while the negative RRAM device is in any of the four possible conductive levels.The synaptic weights w ij defined in the 8×8 matrix are quantized values (in seven levels) calculated as the x q variable named in Equation ( 2).The original continuous range quantized x f spans from −10 to 10.Although the experimental CDFs fitted in Figure 5 were obtained at a read-out voltage of 0.2 V, the behavioral model already proved to reproduce well the read-out current CDFs measured during the M-ISPVA programming at different read-out voltages in the range of 0.1-0.5 V (top limit defined to avoid any possible read-out perturbation on the RRAM conductive state) [34].This feature turns out to be crucial to implement the coefficients of the input vector (applied at in i ), which will be multiplied with the 8 × 8 matrix.
Several error sources make the circuit outputs to be different from the targeted ones.First of all, the quantization error, which has been reduced by using differential synapses for implementing each matrix weight, is still a hurdle for the hardware implementation of accurate VMM accelerators.Secondly, a normalization error comes up since uniform quantization intervals are used while the conductive levels defined at each RRAM device are not well linearly spaced.Finally, the device variability plays the most crucial role since it additionally impacts quantization and normalization errors.Therefore, the results collected from the VMM simulations will be assessed in terms of the device variability achieved by using one of the two programming approaches, namely, M-ISPVA with non-optimized and optimized <I trg , V g > pairs of parameters.
The accuracy of the VMM operation was assessed by multiplying an 8 × 8 matrix with a 1×8 vector, both randomly generated in the range from −10 to 10.The vector was mapped into input voltages in the range between 0.0 and 0.5 V, whereas the synaptic weights of the matrix were quantized in seven levels by using Equation ( 2) and then mapped into the differential RRAM pairs as the corresponding conductive levels.Several Monte Carlo simulations were launched in order to consider the variability featured by the RRAM devices.Figures 7a,b summarize the results of the simulation study (solid lines) by using the non-optimized and the optimized set of M-ISVPA parameters, respectively.In each simulation step, the conductances of the RRAM devices were randomly modified according to the statistical distributions for each level depicted in Figure 5.For a better evaluation of the results, the VMM outputs were also calculated without taking into account the variability, setting the conductance of the RRAM devices at the mean value of the Gaussian distribution for each conductive level.Here, two scenarios were taken into account, namely, no quantized weights in the matrix (square symbols and dashed lines in Figure 7), which produces in fact the exact output values, and ideally quantized weights in the matrix (circle symbols in Figure 7), which omit the errors linked to the fact that experimental RRAM conductive levels are not evenly spaced.The results illustrated in Figure 7 clearly show that the output values (v out j ) are quite close to the ideal ones.Moreover, the use of the optimized set of programming parameters (Figure 7b) improves the accuracy of the operation significantly compared to the accuracy achieved when using the non-optimized one (Figure 7a).This is due to the fact that the former reduces the variability within each conductive level (avoiding any overlap between them) as well as defines these levels better linearly spaced.
For the sake of a better comparison between these two programming approaches, an average error between the exact result and the real VMM output has been defined as follows: where V out j is the voltage value provided by the circuit output j, V out j0 is its corresponding exact value, and n = 8 for our specific VMM circuit.In order to distinguish the quantization error from other error sources, two kinds of errors were defined, namely, the Total error and the Implementation error.The former takes into account the overall error, that is, the exact output V out j0 is the result obtained with no quantized weights, whereas the latter calculates the error considering V out j0 as the result obtained with ideally quantized weights.Therefore, the Implementation error is the contribution to the Total error which is not caused by the quantization of the matrix of weights.Figures 8a,b plot both kind of errors for each simulation step when using non-optimized and optimized programming parameters, respectively.The reduction of the error achieved by optimizing the M-ISPVA parameters is clearly illustrated.

DNN Implementation
Finally, the 8×8 VMM architecture was extended and integrated in a 4-layer feedforward DNN with the aim of evaluating the impact of the RRAM variability on the accuracy in the classification task of the MNIST database [39].This classifier is a multi-layer perceptron (MLP) that uses the backpropagation algorithm for optimizing the log-loss function by means of the stochastic gradient descent and ReLU activation functions [40].The DNN was trained during 500 epochs or until the loss (a prediction error) did not increase by at least 10 −4 .The architecture employed for implementing the DNN is depicted in Figure 9 and consists of 784 neurons (x i ) in the input layer (one for each of the 28×28 pixels of the MNIST database images), 10 neurons (y j ) in the output layer (one for each different class), and three hidden layers of 256 neurons (hl i j ) each.Overall, 334,336 synapses are involved, whose weights were quantized by means of the Uniform-ASYMM scheme.The variability was introduced to the synaptic weights according to Equation (3) after the training and the quantization process.In order to analyze the influence of the RRAM variability on the throughput of the quantized DNN, the balanced accuracy was calculated, that is, the average of the recall obtained on each class label [41].It can be estimated as the number of true positives divided by the total number of elements that actually belong to a specific class, namely, a digit in the MNIST database.The average (µ a ) and standard deviation (σ a ) values obtained by employing 20 different configurations of synaptic weights, randomly generated according to the experimental variability, are shown in Table 3.An analogous study was carried out for comparison by considering no variability and an ideal quantization of the synaptic weights, which provides an accuracy value of about 75.5%.
In both approaches, the device variability decreased the balanced accuracy, i.e., it made the DNN working less accurately.Nevertheless, the optimized set of programming parameters achieves better results than the non-optimized counterpart: about 6%.Therefore, it is clear that a significant improvement in the classification results of the DNN can be achieved with a careful selection of the programming parameters.This is not an obvious result since it is known that noise and random variability does not affect the outcome of DNNs in the same way as other hardware systems due to the particular features of the neural networks [42].It is worth highlighting that a few of the 20 attempts performed to implement variability in the synaptic weights produced better balanced accuracy than the ideally quantized DNN without variability (data not shown).This could be considered as a counterintuitive result.However, this issue is caused by the random nature of the variability introduced, which, in some cases, shifts the values of the synaptic weights closer to the optimal DNN without quantization, while, other times, it shifts them in the opposite direction, producing poorer results.

Conclusions
In this study, the efficacy of optimizing the selection of parameters employed by the multi-level programming algorithm i evaluated, namely, the M-ISVPA, the gate voltage (V g ), and the current target (I trg ); used in the definition of the conductive levels, i.e., the synaptic weights, in Al:HfO 2 -based RRAM arrays intended to be implemented in-memory computing accelerators.The goodness of the optimization was confirmed based on four different features: i) effective suppression of overlapping between read-out current distributions of adjacent conductive levels as well as improved linearity of their allocation in the read-out current range; ii) reliable endurance cycling during at least one thousand cycles together with a successful switching operation between all four possible conductive levels; iii) precision improvement of VMM operations in a 8×8 simulated hardware accelerator; and iv) about a 6 % improvement on the recognition rates achieved by a simulated 4-layer feedforward DNN dedicated to the classification of the MNIST database.

Figure 1 .
Figure 1.Cross-sectional transmission electron microscopy (TEM) image and schematic of the 1T1R memory cell (left) and block diagram and micrograph of the 4-kbit memory device (right).

Figure 3 .
Figure 3. CDFs of the read-out currents experimentally measured just after the switching transition (solid symbols) and at the end of the M-ISPVA (open symbols) logarithmically sampled (1, 10, 100, and 1k cycles) during the 1k cycles endurance test for the four optimized conductive levels.

Figure 5 .
Figure 5. CDFs of the read-out currents measured at the end of the M-ISPVA (open symbols) and calculated using Gaussian distributions (solid lines) for the four conductive levels for the nonoptimized (a) and the optimized (b) programming approaches.

Figure 6 .
Figure 6.Schematic of a VMM accelerator implementation using two RRAM devices for each synaptic weight in differential fashion (left) and of the weighted adder placed at each column (right).

Figure 7 .
Figure 7. Results of 100 Monte Carlo simulations of a VMM operation for a given input vector and a given weights matrix (solid lines), when using the non-optimized (a) and the optimized (b) combination of M-ISPVA parameters.The exact results calculated with no quantized weights (square symbols and dashed lines) and the results calculated with ideally quantized weights (circle symbols) are included for comparison.

Figure 8 .
Figure 8.Total error (red lines) and Implementation error (black lines) for the 100 Monte Carlo simulations in Figure 7 when using the non-optimized (a) and the optimized (b) set of M-ISVPA programming parameters.

Figure 9 .
Figure 9. Schematic of the fully connected DNN architecture with 784 nodes in the input layer (x i ), 10 nodes in the output layer (y j ), and three hidden layers (in gray) with 256 perceptron units (hl i j ) each one.

Table 2 .
Average (µ) and standard deviation (σ) values (in µA) of the Gaussian fits for each conductive level for the non-optimized and the optimized operation scheme, respectively.

Table 3 .
Balanced accuracy (average µ a and standard deviation σ a in %) obtained by working with the training and test datasets of the MNIST database for the non-optimized and the optimized programming approaches.