02. Pipeline and branch prediction
CPU performance = frequency/instruction_efficiency = (cycles/sec) / (cycles/instruction)
- Higher frequency
- ⇒ more energy cost
- ⇒ more heat emitted
- ⇒ limited by hardware (e. g. by electrons speed ⇒ smaller parts)
- More efficient instructions
- Logic (e. g. «wall of memory» effect)
CPU internal devices (see below) microprogram design
- + More efficient memory access
Simultaneous execution: dependency
Problem: simultaneity is not possible when we cannot execution next instruction until previous one is not done
- Data dependence
- instruction 2 uses data prepared by instruction 1
register calculation flow and conditional operations
memory access (both memory calculation flow and $at register usage)
indirect addressing
instruction 1 calculates an address of instruction 22
There's no universal dependency eliminator.
Scaling
Leaving task parallelism alone, what can be done to scale up one program execution?
- Vector operations
scalable data in one instruction (e. g. vector addition, mesh tracing etc.
- ⇒ Full vector load is needed for efficiency
- ⇒ High impact on data dependence
⇒ Reorder of calculations as possible solution
- Multimedia (MMX/SSE etc.)
- GPU/APU
- DAC/ADC
assembler optimization
hardware code reordering
multiple instructions in one time (or rather microinstrucions e. g. decoding, memory access, calculation etc.)
⇒ Reorder of instructions as possible solution
Full loading problem: any dependency turns parallel execution into sequential
- VLIW (very long instruction word) — loosely speaking, a CPU capability of preforming almost any set of operations in parallel, and manual planning what operations on what data it shall be
- Easy to implement and to scale further
- Complex common case compilation / programming
- Effective on fixnum algorithms
Examples: «Эльбрус», old supercomputers, GPU
Hardware implementations
- Predictive calculations: use more than one number of CPU components to calculate instructions simultaneously
- in parallel, if no dependency
in advance, if there might be path dependency
- drop if became inactual
- require spare memory (e. g. register duplication) and instant sync
- predict jumps to reduce incorrect path drops
- voodoo
- On-the-fly code optimisation
- Dependency tracing
- Register renaming for data dependency elimination
- Independent instructions reordering
Dependent instructions reordering for predictive calculation
- stronger voodoo
Example: MARS BHT simulator
Example code:
Pipeline
Pipeline is microprogam-based instruction design in which single instruction is divided to stages, each implemented by separate CUP part. Under circumstances, each stage can act in parallel with others (each taken from different instruction).
Rough scheme of MIPS pipeline (other architectures provides another):
Instruction fetch
- Read from text memory
Instruction Decode
- Determine arguments and read them from registers
including fresh data, calculated on E stage, but still not passed though W stage
- Determine next address (including non-pseudo conditional brances, because they do not require arithmetics)
- Determine arguments and read them from registers
Execution
- ALU/C1 calculates
Memory fetch
- Data memory access
Writeback
- Write result to registers
Example:
simple (with no register back-access) HTML-эмулятор MIPS с конвейером (GitHub)
bubble (nop) auto-insertion
==== Data conflict === Pipeline slip/stall
MIPS1/MIPS2 (ancient) two bubbles (nop) is needed, because of W stage of lw is not executed still
All other MIPS'es: back-access ($t2 is ready just after M stage), so only one stall
Path conflict
On stage D next address is calculated, but next instruction is already fetched. So «delay slot» appeared.
- Delay slot
- fetching and execution of the instruction addressed just after conditional branch, irrespective of conditional value
addi $ra, $ra, 4 | addi $t2, $t2, 4 jr $ra | jr $ra | nop
- or
addi $ra, $ra, 4 | jr $ra jr $ra | addi $t2, $t2, 4
Code at bottom right is correct