When IO-bound Hides Inside CPU. Revealing IO bottleneck in pure… | by Gleb Sakhnov | Feb, 2022

Revealing IO bottleneck in pure CPU-bounded utility utilizing Go

Picture by Christian Wiediger on Unsplash
func incrementManyTimes(val *int64, instances int) 
for i := 0; i < instances; i++
*val++

// outline construction kind to carry the values
kind IntVars struct
i1 int64
i2 int64
// create the precise values
vars := IntVarsi1: 0, i2: 0
incrementManyTimes(&vars.i1, 1000)
incrementParallel(&vars.i1, &vars.i2, 1000)
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value              1.408 ns/op
BenchmarkIncrement2ValuesInParallel 2.172 ns/op
Core i7 Xeon 5500 Collection Information Supply Latency (approximate)               

L1 CACHE hit 1-2 ns
L2 CACHE hit 3-5 ns
L3 CACHE hit 12-40 ns

native DRAM ~60 ns
distant DRAM ~100 ns

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value              1.408 ns/op
BenchmarkIncrement2ValuesInParallel 2.172 ns/op

Mitigation

kind IntVars struct 
i1 int64
_ [56]byte // padding
i2 int64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkIncrement1Value               1.367 ns/op
BenchmarkIncrement2ValuesInParallel 1.374 ns/op

Cross-architecture assist

import "golang.org/x/sys/cpu"kind IntVars struct 
i1 int64
_ cpu.CacheLinePad // padding
i2 int64

Measuring CPU cache efficiency

func primary() 
a := IntVars
incrementParallel(&a.i1, &a.i2, 100000000)
▶ perf stat -B -e L1-dcache-load-misses ./take a look at8,650,268      L1-dcache-load-misses
▶ perf stat -B -e L1-dcache-load-misses ./test-padded205,526      L1-dcache-load-misses

More Posts