# Search acceleration and Learning at the Edge with Crossbar ReRAM

#### Sylvain Dubois

Vice President Business Development & Marketing sylvain.dubois@crossbar-inc.com

Aug 29th, 2019



#### Search can be similar to finding a needle in a haystack







## It's getting even more difficult with machine-generated data growth



#### >35ZB of data generated in 2023

#### Find the needle in a greater and greater haystack



## **Problem: Objects (vectors) Classification in Al**

There is a computing-intensive task required after every Neural Network





#### The memory bottleneck

## "Memory is the key to enable true intelligence"

BERNSTEIN ARTIFICIAL INTELLIGENCE





## Solution: XPU is a near-memory computing accelerator

Host interface @ 66MHz xSPI/FIFO interface Targeted for massive search/lookups, kNN, RBF, CBIR, Softmax KNN, RBF, CBIR, Softmax

## Deterministic perf & persistent memory

- o 8-bit signed integer to binary objects
- Object length of 16 to1K
- 1024 to 64K objects per macro
- o Manhattan or Cosine distance
- o Simultaneous processing
- o 3 Billion OLUPS and 53 Billion OLU/Watt

#### Configurable

- 8-bit signed integer to binary objects
- Object length of 16 to1K
- 1024 to 64K objects per macro
- o Manhattan or Cosine distance

#### ≻Scalable

 Multiple Instances of Macros/Chips can be cascaded to increase # of Instances

#### **Enabling Learning at the Edge**



#### 3+ Billion Objects LookUp Per Second (OLUPS)



| Object length | OLUPS         | OLU/Watt       |
|---------------|---------------|----------------|
| 1024          | 50,000,000    | 833,333,333    |
| 512           | 100,000,000   | 1,666,666,667  |
| 256           | 200,000,000   | 3,333,333,333  |
| 128           | 400,000,000   | 6,666,666,667  |
| 64            | 800,000,000   | 13,333,333,333 |
| 32            | 1,600,000,000 | 26,666,666,667 |
| 16            | 3,200,000,000 | 53,333,333,333 |

Scalable to 16 Billion OLUPS per stick





## **Enabled by Crossbar ReRAM technology**



#### Programming: Positive Voltage on TE

- 1. Creation of Metal ions from TE oxidation
- 2. Electro-migration of the ions through the switching layer
- 3. Reduction of the ions and formation of the filament
- → ON state is reached when a complete filament is created between both electrodes

#### Erasing: Positive Voltage on BE

- 1. Oxidation of the filament atoms through electric field and temperature (Joule Heating)
- 2. Electro-migration of the ions through the switching layer
- 3. Reduction of the ions and reformation of the TE
- $\rightarrow$  OFF state is reached when the conductive path is broken



#### Status: from lab to fab



"In a lab, you can certainly create architectures that work with certain characteristics, but then when you go from the lab to high-volume manufacturing and you want to make billions of those devices at high yield, that's a whole different kettle of fish."

> Gary Dickerson, president and CEO Applied Materials



### Latest silicon results





### **Crossbar ReRAM Advantages**

|                                      | Target Commercial<br>Crossbar ReRAM<br>28/22nm | Commercial<br>Embedded Flash<br>40nm  | Anticipated<br>Oxygen ions based<br>RRAM 40nm | Anticipated<br>Embedded MRAM<br>22nm            | Crossbar<br>ReRAM                     |  |
|--------------------------------------|------------------------------------------------|---------------------------------------|-----------------------------------------------|-------------------------------------------------|---------------------------------------|--|
| Physical Mechanism<br>& on/off ratio | Metal atoms storage<br>80~120X on/off ratio    | Electron storage<br>3~6X on/off ratio | Oxygen ions storage                           | Spin-polarized current<br>1.3~1.7X on/off ratio | Scales below 2xnm                     |  |
| Stack complexity                     | Simple                                         | Complex<br>dedicated CMOS lines       | Simple                                        | Super complex<br>10+ layers stack               | 10X Simpler<br>than MRAM              |  |
| Materials involved                   | 3 films<br>Existing materials                  | Existing materials                    | 3 films<br>Existing materials                 | >25 materials                                   | 2X Fewer Masks<br>10X Fewer materials |  |
| Mask layer adder                     | 2 masks                                        | 6+ masks                              | 2 masks                                       | 5 masks                                         | .vs MRAM                              |  |
| Speed Read                           | 15ns                                           | 25ns                                  | 25ns                                          | 20ns                                            | - Faster read                         |  |
| Speed Write                          | 10us                                           | 12us                                  | 30us                                          | 300ns                                           | i dotor rodd                          |  |
| Read energy                          | Low<br>0.2 uA/MHz/bit                          | Low<br>0.77 uA/MHz/bit                | Medium<br>1.2 uA/MHz/bit                      | High<br>2 uA/MHz/bit                            |                                       |  |
| Write current                        | Low<br>~60uA/bit                               | Complex access<br>block erase only    | High<br>> 250uA/bit                           | High<br>300uA/bit                               | _ 3X-10X Lower<br>energy              |  |
| Standby current                      | Low<br>2 uA                                    | Super high<br>> 150uA                 | Medium<br>> 4uA                               | Super high<br>200 uA                            |                                       |  |
| Data retention                       | > 10Yr                                         | > 10Yr                                | > 10Yr                                        | > 10Yr                                          | 7                                     |  |
| Endurance                            | > 1M                                           | 10K / 100K                            | 10K                                           | 1M                                              | High reliability<br>Magnetic          |  |
| Operating temp                       | 125C                                           | 150C                                  | 125C                                          | 150C                                            | immunity                              |  |
| Magnetic Immunity                    | YES                                            | YES                                   | YES                                           | NO                                              |                                       |  |



### **Crossbar: Make an impact on Edge and Cloud computing**

#### Intelligence & Learning at the Edge

Multi-modal event detection People re-identification **Reduce TCO and power for hyperscale players** 

3X lower cost than DRAM & 8X lower energy \$1K reduction per server







#### **Research programs for ReRAM-based monolithic computers**



Monolithic ReRAM + CPUs die

- 1TB/s access to ReRAM tiles
- 500X denser than SRAM
- 10X energy efficiency over stacked DRAM

University of Maryland Technical Report UMIACS-TR-2019-01, July 2019.

#### Design and Evaluation of Monolithic Computers Implemented Using Crossbar ReRAM

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Shang Li, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung

University of Maryland and Crossbar Incorporated

ABSTRACT

A monohibic computer is an emerging architecture in which a multicore CPU and a high-capacity main memory system are all integrated in a single din. We believe such anchitectures will be possible in the near future due to nonvelatile memory technology, such as the musitive random access memory. In *ReIAM*, from Crosshar locorporated. Crosshar's RERAM can be fabricated in a attanded CMOS logic process, allowing it to be integrated into a CPU's die. The RERAM orths are manufactured in between metal wires and do not employ per-cell access transitors, leaving the bulk of the base silicon area vacant. This means that a CPU can be monolithically integrated directly underreath the RERAM memory, allowing the cores to have manuloely parallel access to the main memory.

This paper presents the characteristics of Crossbar's Re-RAM technology, informing architects on how ReRAM can enable monolithic computers. Then, it develops a CPU and memory system architectum around those characteristics, es pecially to exploit the unprecedented memory-level paralelism. The architecture employs a tiled CPU, and incorporates memory controllers into every compute tile that support a variable access granularity to enable high scalabil ity. Lastly, the paper conducts an experimental evaluation of monolithic computers on graph kernels and streaming con putations. Our results show that compared to a DRAMbased tiled CPU, a monolithic computer achieves 4.7s higher performance on the graph kernels, and achieves roughly parity on the streaming computations. Given a future 7nm technology node, a monolithic computer could outperform the ntional system by 66% for the streaming computations.

#### 1. INTRODUCTION

In the post-Moore era, computer architects will no longer be able to why on technologies to continue during architectural innovation. This paper a ploran one such posibility: monolishic computer. A monolishic computer relies on new logic-memory integration technology to labricate a CPU and a high-capacity main memory system all on a uige disc. Computer to isovereintional package-level integration moving multiple dime...e.g., stacking DRAM dies on top of a logic laper or integrating DRAM and CPU dies over a allicon interposer- a monolishic computer achieves much higher integration of the CPU and memory system. Whereas package-level integration can support thousands

Ingration will be able to support millious of view, providing a nuch when main memory interface than is currently postible. This will enable architects to deliver greater memory parallelism and handwidth to data-intensive computetions, and achieve support bandwidth per ward, to addition, monolidic computers will reduce the physical distance that memory requests will need to travel. Bocause all memory requests can size on the CPU die, there won't be a need to cross the allicon interpoor, or worse, to traver the system motherboard. This locality herefit can provide significant additional improvements in power efficiency.

Monolithic computer do not exist yet, but some me such re believe emerging non-volatin memories will charge that in the fishers [1,2]. Aly  $\sigma$  of [1] argue that spin-transfer forque magnetic RAM GTT MRAMM or resistive RAM (Re-RAM) are such enabling memory technologies. Unlike conventional DRAM which requires special memory districation processes. STT.MRAM and ReRAM can be fabricated in a standard CMCD Spio: process. These, she have not be fabricated in the interface of the standard standard standard standards. These, they have not be procesial to be intergrated into a CPU's die. The main hundle, hough, is identifying a suitable integra-

The main nurtie, though, is identifying a suitable integration technology. Fine grained 3D monofilitic integration [15] [2] has been proposed as a possible solution, one that assumes a process technology in which multiple planut layers of itleon can be fabricated monofilitically in 3D. This would allow compute logic and non-valitic memory hose integrated in alternating planar layers. While this multin is ntremely high logic and memory deminies, unfortunately, it requires advanced process technology that is still at the developmental stages in strearth labs.

In our work, we assume a much simpler integration approach that exploits non-violatile memories with 3D crosspoint architectures. In crosspoint memories-examples include lark? (Sprane [J] as we will as the ReAM bechnology from Crossbar Incomported [4]-dre memory orlis are fabricated in between metal wins of a CMOS logic process, i.e. at the intersection of wirms laid out perpendicularly in adjacent metal layers. Rather than isolate individual cells using accent transitors, crosspoint arrays prowide inter-cell isolation via whetter devices integrabed with the memory cells. So, there are no transitors within the core of the crosspoint arrays. Instead, dre half of the adjacent area underneath the memory are the for implementing non-memory circuits.

It is well known that such crosspoint memories can be tayered on top of compute logic during back-end of line (BEOL) processing, forecastingle in the top metal layers of a CPU's die. Although these are no per-cell transitions, some peripheral logic is still needed at each crosspoint army for

http://maggini.eng.umd.edu/pub/UMIACS-TR-2019-01.pdf http://maggini.eng.umd.edu/pub/monolithic-memsys18.pdf



#### Summary

- Huge demand for more efficient memory access in AI
- Solution is to bring data closer to computing
- Crossbar XPU delivering Billions OLUPS
- Enabled by ReRAM
  - Forming Free with DC voltage below 2V
  - Compatible with CMOS integration
  - Compatible with pre soldering reflow programing
  - Extremely good Endurance and Retention beyond 100kC

#### **Crossbar moving the needle in Edge and Cloud Computing**







© Crossbar, Inc. All rights reserved.