



## Head in the Clouds – Building a chip for scale out computing

Bryan Chin & the Cavium Team Cavium, Inc.

## Agenda

- Cavium Background
- Network Processor
- Core characteristics
- Data center processor
- Potential for performance

#### **Cavium SoC's for Range of Target Markets**



Highly Integrated SOCs enable Lower Real-Estate, Cost & Power

#### The Road from Here to There





#### Cavium All Rights Reserved - 2014

## **OCTEON III CN78XX**





#### **Relative Sizes**





Sandybridge Apple A74 decode 6 issue6 issue OoOOoO

Cavium MIPS core 2 issue In order

It's hard to get interesting die photos because of all the metal...

### **Angels on a Pinhead**



- Can fit about 4 simple cores in area of one complex core
- But what is the cost?
  - Simpler micro-architecture (less ILP, MLP)
  - Shallow memory hierarchy
  - Global shared L2 instead of private or semiprivate L2

#### **OCTEON II L1d (DCACHE) MPKI Generally**

#### Less Than Nehalem L2 (MLC) MPKI

Nehalem – 32K 4 way I, 32KB 8 way D, 256K L2, 4-24 MB OCTEON II – 37KB, 37 way I, 32KB 32 way D, 4MB 16 way L2

MPKI = Misses Per Kilo Instructions MLC = Mid Level Cache

| SPEC2006 Integer | 32KB L1d MPKI<br>Cavium Octeon II | 256KB L2 MPKI<br>Intel Nehalem |
|------------------|-----------------------------------|--------------------------------|
| 401.bzip2        | 6.21                              | 8.34                           |
| 429.mcf          | 84.59                             | 108.1                          |
| 445.gobmk        | 2.11                              | 3.03                           |
| 456.hmmer        | 1.08                              | 3.02                           |
| 458.sjeng        | 1.14                              | 0.89                           |
| 462.libquantum   | 12.91                             | 38.6                           |
| 464.h264ref      | 3.99                              | 2.25                           |
| 473.astar        | 9.49                              | 10.6                           |

# *Highly associative* 1st level cache same or better hit rate than 2 level Private Cache

### **Characteristics of the Workloads**

|                                  | Networking   | Scaleout     | Enterprise              |
|----------------------------------|--------------|--------------|-------------------------|
| Highly parallel                  | $\checkmark$ | $\checkmark$ |                         |
| Benefit from ILP/MLP<br>(OoO) *  |              |              | $\checkmark$            |
| Repetitive task on lots of data  | $\checkmark$ | $\checkmark$ |                         |
| Hardware accelerator<br>Friendly | $\checkmark$ | $\checkmark$ | Sometimes<br>(e.g. GPU) |
| Compile once, run many           | $\checkmark$ | $\checkmark$ |                         |

\* Ferdman et al; "Clearing the Clouds: A Study of Emerging Scale-Out Workloads on Modern Hardware; ASPLOS 2012

## In this case, simpler is "better"

- Cloud workloads -> Not a lot of ILP or MLP
- What can you do to improve performance?
- OoO depends on
  - Available ILP
  - Overlapping independent memory operations (MLP)
- Build an in order machine
  - Reduce the load to use latency in the L1-



#### cnMIPS II Core 8+ Stage Pipeline





- Thread-dedicated resources = very deterministic CPU performance
- Highly-associative L1 caches = equivalent miss rate to much larger caches

## Achieving a 3 cycle load to use latency



- Higher the dispatch width, the more instruction slots to fill
- Custom Circuit techniques
  - Not everything has to happen on a single cycle boundary
  - Need to do special timing analysis
  - Deterministic delay allows for optimal placement of registers/latches
  - Ability to build efficient, high speed fully associative structures
  - Usual other tricks (...)
- Optimize wires, logic for speed
- Have a simple load instruction
  - Alignment, simple address calculation, fewer exceptional conditions

#### **Scale Out Data Center Processor**

|                                  | Networking<br>Processor | Data Center<br>Processor | Remarks                                                   |
|----------------------------------|-------------------------|--------------------------|-----------------------------------------------------------|
| Highly Threaded                  | $\checkmark$            | 1                        |                                                           |
| SSO (Scheduling Sync Unit)       | $\checkmark$            |                          |                                                           |
| Compression Accelerator          | $\checkmark$            | $\checkmark$             |                                                           |
| Crypto Acceleration              | $\checkmark$            | 1/2                      | Focus is not on packet processing                         |
| Input Packet Parsing             | $\checkmark$            | 1/2                      | Data center more<br>homogenous                            |
| Output queuing                   | $\checkmark$            | 1/2                      | Data center more<br>homogenous                            |
| High Bandwidth Networking        | $\checkmark$            | $\checkmark$             |                                                           |
| Low Latency Networking           |                         | $\checkmark$             |                                                           |
| Regular Expression Engine        | ✓                       | ✓                        | Can be repurposed                                         |
| Integrated High Speed<br>Network | $\checkmark$            | ~                        | Reduce components,<br>improve reliability, lower<br>power |
| Integrated Storage               |                         | $\checkmark$             | Rotating Media, SSD                                       |

### **Example: RegEx acceleration**

- In Octeon, RegEx hardware used for packet sniffing
  - Intrusion detection
  - Virus detection
  - Packet classification
- In Data Center, Regex Hardware can be used
  - To parse text data unstructured and semi structured data
    - Find ZIP codes, phone numbers, name, address
    - Search machine logs (error detection, site visit statistics)
  - Works well when setup time is small compared to run time : streaming bulk data!

#### **Example: Big Data Text Search (2)**

- Text search
  - Submit precompiled regex pattern to regex engine
  - Search for 1 of 4 different patterns (e.g. url, email, date)
- Next Generation is 4.5 to 150 x faster than software only solution



#### **Memcached Latency Profile**

Memcached latency variation with TPS Rates (TCP protocol - x86 vs Octeon CN68XX)



- Single socket OcteonII compares well with dual socket Xeon X5-2690
- 2.9 GHz Xeon versus 1.5 GHz Octeon
- More cores is goodness

#### **Cavium Processors**

- Networking Market
  - OCTEON Family
  - Well suited to area efficient cores (lots of aggregate processings)
  - Well suited to purpose built accelerators
- Scale Out Data Center
  - Well suited to area efficient cores
  - Well suited to purpose build accelerators
- SHAMELSS PLUG #1: It takes a lot of talented people – we are hiring! (bchin@cavium.com)

## Benchmarking

- How do we measure performance on these new kinds of applications?
  - Need to develop better metrics
  - System performance
    - Data center network
    - Disk I/O
    - OS software
- How do we measure agility
  - Configuration, maintainability
  - Elasticity
- Challenging problem
  - Very dynamic space
    - YARN, HIVE, PIG, Hadoop, Storm, Spark, Mahout, R, Presto, Drill, Scalding, Summingbird, Thrift, Impala, Parquet, SCUBA, Kafka, Cobbler, Chef
- SHAMELESS PLUG #2
  - Talk to me about it.; eembc.org
  - <u>markus.levy@eembc.org; bchin@cavium.com</u>

#### **Questions?**

# Poll – how many engineers does it take?

- (Core) Microprocessor design team (simple core)
  - Logic designers
  - Circuit and implementation
  - Verification
  - Physical design
  - Validation
  - Subtotal:
- Rest of chip
  - Logic designers
  - Circuit and implementation
  - Verification
  - Physical design
  - Validation
  - Subtotal:
- Total: ???

# **Poll – how many engineers does it take?**

IMHO [rough answers – based on experience at Sun, MIPS, QED, PMC, Cavium, etc.]:

- (Core) Microprocessor design team (simple core)
  - Logic designers (~6) FP Unit, Integer Unit, Load/Store, Instruction Fetch
  - Circuit and implementation (~6) custom circuits
  - Verification (~6) rule of thumb 1 to 2 verif for each RTL
  - Physical design (~10)
  - Validation (~5)
  - Subtotal: ~30-40
- Rest of chip
  - Logic designers (~20)
  - Circuit and implementation (~10)
  - Verification (~20)
  - Physical design (~10)
  - Validation (~5)
  - Subtotal: 60-70
- Total: >100