02. Pipeline and branch prediction

Base lecture in russian

CPU performance = frequency/instruction_efficiency = (cycles/sec) / (cycles/instruction)

Higher frequency
- ⇒ more energy cost
- ⇒ more heat emitted
- ⇒ limited by hardware (e. g. by electrons speed ⇒ smaller parts)
More efficient instructions
- Logic (e. g. «wall of memory» effect)
- CPU internal devices (see below) microprogram design
+ More efficient memory access

Simultaneous execution: dependency

Problem: simultaneity is not possible when we cannot execution next instruction until previous one is not done

Data dependence: instruction 2 uses data prepared by instruction 1
Path dependence: instruction 1 calculates an address of instruction 22

There's no universal dependency eliminator.

Scaling

Leaving task parallelism alone, what can be done to scale up one program execution?

Vector operations: scalable data in one instruction (e. g. vector addition, mesh tracing etc.
Superscalar operations: multiple instructions in one time (or rather microinstrucions e. g. decoding, memory access, calculation etc.)

Full loading problem: any dependency turns parallel execution into sequential

VLIW (very long instruction word) — loosely speaking, a CPU capability of preforming almost any set of operations in parallel, and manual planning what operations on what data it shall be
- Easy to implement and to scale further
- Complex common case compilation / programming
  - Effective on fixnum algorithms
- Examples: «Эльбрус», old supercomputers, GPU

Hardware implementations

Predictive calculations: use more than one number of CPU components to calculate instructions simultaneously
- in parallel, if no dependency
- in advance, if there might be path dependency
  - drop if became inactual
  - require spare memory (e. g. register duplication) and instant sync
  - predict jumps to reduce incorrect path drops
  - voodoo
On-the-fly code optimisation
- Dependency tracing
- Register renaming for data dependency elimination
- Independent instructions reordering
- Dependent instructions reordering for predictive calculation
- stronger voodoo

Example: MARS BHT simulator

Example code:

   1 .eqv    START   0x10010000
   2 .eqv    SZ      256
   3 .eqv    GAP     11
   4 .text
   5         li      $t8 0
   6         li      $t7 GAP
   7         sll     $t7 $t7 2
   8 loopg:  sll     $t9 $t8 2
   9 loop:   lw      $t0 START($t9)
  10         addu    $t9 $t9 $t7
  11         blt     $t9 SZ loop
  12         addiu   $t8 $t8 1
  13         blt     $t8 GAP loopg

Pipeline

Pipeline is microprogam-based instruction design in which single instruction is divided to stages, each implemented by separate CUP part. Under circumstances, each stage can act in parallel with others (each taken from different instruction).

Rough scheme of MIPS pipeline (other architectures provides another):

Instruction fetch
- Read from text memory
Instruction Decode
- Determine arguments and read them from registers
  - including fresh data, calculated on E stage, but still not passed though W stage
- Determine next address (including non-pseudo conditional brances, because they do not require arithmetics)
Execution
- ALU/C1 calculates
Memory fetch
- Data memory access
Writeback
- Write result to registers

Example:

   1 lui   $at, 0x1001      # lw $t2, 0x10010000 pseudoinstruction
   2 lw    $t2, 0($at)
   3 sll   $t3, $t3, 2
   4 add   $t2, $t2, $t3    # depends on $t3
   5 lui   $at, 0x1004      # sw $t2, 0x10040000 pseudoinstruction
   6 sw    $t2, 0($at)

simple (with no register back-access) HTML-эмулятор MIPS с конвейером (GitHub)

bubble (nop) auto-insertion

==== Data conflict === Pipeline slip/stall

   1 lw    $t2, 0($at)
   2 add   $t1, $t3, $t2

MIPS1/MIPS2 (ancient) two bubbles (nop) is needed, because of W stage of lw is not executed still
All other MIPS'es: back-access ($t2 is ready just after M stage), so only one stall

Path conflict

On stage D next address is calculated, but next instruction is already fetched. So «delay slot» appeared.

Delay slot

fetching and execution of the instruction addressed just after conditional branch, irrespective of conditional value

addi $ra, $ra, 4    |     addi $t2, $t2, 4
jr   $ra            |     jr   $ra
                    |     nop

addi $ra, $ra, 4    |     jr $ra
jr   $ra            |     addi $t2, $t2, 4

Code at bottom right is correct

Modern MIPS'es have hardware slip delay insertion too

MARS can simulate this (`Delayed branching»)

RiscV has no delay slot

H/W

HSE/ProgrammingOS/02_PipelinePrediction (последним исправлял пользователь FrBrGeorge 2020-03-01 11:30:56)