Saturday, February 6, 2016

Playing with cpu cache: funny and simple experiments

Typically when we are talking about cpu caches  and taking advantage of it, most programmers feel that it is something beyond their reach. In a lot of cases it is indeed so.
I want here to show 2 simple examples that get you a feeling of the large impact caches can have on performance.
Caches are special types memory that can give you faster access to the information you need by making use of 2 properties.

  • Locality: being temporal or spatial.  Close in space or close in time.
  • Predictability: The information you require now, is probably somehow related to the information you required just before.
Examples. If the cpu process a machine instruction located at memory location X, there is a pretty good chance that the next instruction will be located in the vicinity of X. If your application requires now information X, there is a pretty good chance that it will require to access that information again in the near future.

Typically accessing local memory can be made much faster than accessing remote memory 


To get a feeling of the importance of locality. I have taken the following data from Greg Brendan book  on system performance. All times are also scaled to "human" scale
1 CPU cycle: 0.3 ns => to human scale 1s
L1 cache access: 0.9ns => 3s
L2 cache access: 2.8ns => 9s
L3 cache access: 12.9ns => 43s
Main memory access: 120ns => 6 min
SSD Disk: 50-150 us => 2-6 days
Rotational Disk: 1-10ms => 1-12months
Internet: San Francisco to New york: 40ms  => 4 years
...
We see the huge differences when accessing information that is less and less local.

Anybody going serious about system performance should read Greg Brendan book

Example 1

This example initializes a 2-dimensional array 4096 by 4096 with the value 5.
In this example we do it in 2 ways: row by row, or column by column.
In "C/C++",  the relative memory location of an 2 dimensional array is as follows: mem_loc(arr[x][y]) = x*(#columns) + y

Meaning that 2 adjacent cells on the same row, will be located in adjacent memory locations.
e.g. cell at array[x][y] is located near the cell at array[x][y + 1]
While cells located on adjacent rows, sharing the same column, are separated by the number of columns
e.g. distance(array[x + 1][y], array[x][y]) = number of columns
In other words the 2 cells are located far apart, well at least if the number of columns is at bit large.

In the example here below, the distance in memory between 2 cells on 2 adjacent rows is 4096, while distance of 2 cells located in adjacent columns is 1.



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <stdio.h>

#define size (1 << 12)
int arr[size][size];



void loop_local()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[i][j] = 5;
  }
}


void loop_remote()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[j][i] = 5;
  }
}



int main(int argc, char *argv)
{
  if (argc > 1)
    loop_local();
  else
    loop_remote();
  return 0;
}


You will find the code and compile instructions at https://github.com/ywahl/PlayingWithCache.
Executing the code when accessing the cells column by column.
$ time ./datalocality 1

real    0m0.324s
user    0m0.150s
sys     0m0.174s

Executing the code when accessing the cells row by row.
$ time ./datalocality

real    0m1.570s
user    0m0.017s
sys     0m1.551s

We see a huge difference in the results, although executing almost exactly the same code. I will not enter in a detailed analysis of the results, this is beyond the scope of this article.
You can play also with the size of the array in line 3 and discover jumps in time(local)/time(remote) when crossing certain values.

Example 2

The second example introduces threading. We'll see that the time needed to execute 2 apparently independent threads depends greatly on which cpu cores those threads are executed and the location of the memory they access.

CPU cores located on the same chip die share typically share part of the L2/L3 caches.
CPU cores located on different chip dies (on different sockets), typically do not share cache.

When a CPU core writes to a specific location, it writes it first to cache. When the cache is flushed, it is written back to memory if modified. (Note that this can vary greatly depending on the cpu and system architecture).
When a CPU core writes to a specific memory location, and another CPU core shares that area of memory, then special synchronization actions are required to synchronize the CPU caches and main memory. There are various mechanisms to achieve that, but this is beyond the scope of this article. The important point here, is, that it takes more time due to synchronization overhead to write memory locations that are shared by the various CPU caches.

In the following example we have 2 threads: one thread that writes to a specific memory location, the other one reads from another specific location. The example lets you play with the memory offset between those 2 memory locations as well as with the CPU core on which those threads are running.



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void *readThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  *ptr = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("readThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    a += *ptr;
  *ptr = a;
}


void *writeThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  a = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("writeThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    *ptr += a;
}

The examples were executed on a system with 2 CPU sockets (2 NUMA nodes)
Each  NUMA node has 8 CPU cores. Each core has 2 HyperThreads.
The Layer 1 Cache line size is 64 bytes. Or 16 4 bytes integers.

The application is launched as follows. For example:
Write Thread on CPU core 8(Numa node 1), Read Thread on CPU core 0 (Numa node 0) and 100*sizeof(int) offset between read and write data.

$ time ./cachecoherency 1000000000 0 100 0 8 

Level 1 Data cache line size=64
rd_offset=0 wr_offset=100 num_iter=1000000000
writeThread started on cpu=8 ret_affinity=0
readThread started on cpu=0 ret_affinity=0
finished wr_data=1000000000 rd_data=1000000000


real    0m3.512s
user    0m6.930s
sys     0m0.038s

We obtain the following results

CPU core WrCPU core RdNuma Node WrNuma Node RdOffset Wr/RdRunTime
80101003.512s
801019.686s
500018.952s
1600017.055s
000016.821s

We see a dramatic difference in the results when the offset between read and write is less than the L1 cache size, versus, larger than the cache size. There is almost 300% difference the 2 cases.

What is surprising also is that running the 2 threads on the same CPU core ( the last run) results actually more efficient that running this code on 2 different CPU's, both in time and cpu resources. All this because of cache effects.

Those were 2 small examples, that demonstrates the effect of cache.

Making use of this in real world application, is however not an easy task at all. And depending on the programming language you are using, you might not be even able to take advantage of it consciously.





Remaining relevant: Abstraction layers and API's for cloud native applications

Separation of concerns, abstraction layers and API's for Cloud native environments Those 3 terms are closely related to each other. T...