Computers & thoughts

Saturday, February 6, 2016

Playing with cpu cache: funny and simple experiments

Typically when we are talking about cpu caches and taking advantage of it, most programmers feel that it is something beyond their reach. In a lot of cases it is indeed so.
I want here to show 2 simple examples that get you a feeling of the large impact caches can have on performance.
Caches are special types memory that can give you faster access to the information you need by making use of 2 properties.

Locality: being temporal or spatial. Close in space or close in time.
Predictability: The information you require now, is probably somehow related to the information you required just before.

Examples. If the cpu process a machine instruction located at memory location X, there is a pretty good chance that the next instruction will be located in the vicinity of X. If your application requires now information X, there is a pretty good chance that it will require to access that information again in the near future.

Typically accessing local memory can be made much faster than accessing remote memory

To get a feeling of the importance of locality. I have taken the following data from Greg Brendan book on system performance. All times are also scaled to "human" scale
1 CPU cycle: 0.3 ns => to human scale 1s
L1 cache access: 0.9ns => 3s
L2 cache access: 2.8ns => 9s
L3 cache access: 12.9ns => 43s
Main memory access: 120ns => 6 min
SSD Disk: 50-150 us => 2-6 days
Rotational Disk: 1-10ms => 1-12months
Internet: San Francisco to New york: 40ms => 4 years
...
We see the huge differences when accessing information that is less and less local.

Anybody going serious about system performance should read Greg Brendan book

Example 1

This example initializes a 2-dimensional array 4096 by 4096 with the value 5.
In this example we do it in 2 ways: row by row, or column by column.
In "C/C++", the relative memory location of an 2 dimensional array is as follows: mem_loc(arr[x][y]) = x*(#columns) + y

Meaning that 2 adjacent cells on the same row, will be located in adjacent memory locations.
e.g. cell at array[x][y] is located near the cell at array[x][y + 1]
While cells located on adjacent rows, sharing the same column, are separated by the number of columns
e.g. distance(array[x + 1][y], array[x][y]) = number of columns
In other words the 2 cells are located far apart, well at least if the number of columns is at bit large.

In the example here below, the distance in memory between 2 cells on 2 adjacent rows is 4096, while distance of 2 cells located in adjacent columns is 1.

#include <stdio.h>

#define size (1 << 12)
int arr[size][size];



void loop_local()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[i][j] = 5;
  }
}


void loop_remote()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[j][i] = 5;
  }
}



int main(int argc, char *argv)
{
  if (argc > 1)
    loop_local();
  else
    loop_remote();
  return 0;
}

You will find the code and compile instructions at https://github.com/ywahl/PlayingWithCache.
Executing the code when accessing the cells column by column.
$ time ./datalocality 1

real 0m0.324s
user 0m0.150s
sys 0m0.174s

Executing the code when accessing the cells row by row.
$ time ./datalocality

real 0m1.570s
user 0m0.017s
sys 0m1.551s

We see a huge difference in the results, although executing almost exactly the same code. I will not enter in a detailed analysis of the results, this is beyond the scope of this article.
You can play also with the size of the array in line 3 and discover jumps in time(local)/time(remote) when crossing certain values.

Example 2

The second example introduces threading. We'll see that the time needed to execute 2 apparently independent threads depends greatly on which cpu cores those threads are executed and the location of the memory they access.

CPU cores located on the same chip die share typically share part of the L2/L3 caches.
CPU cores located on different chip dies (on different sockets), typically do not share cache.

When a CPU core writes to a specific location, it writes it first to cache. When the cache is flushed, it is written back to memory if modified. (Note that this can vary greatly depending on the cpu and system architecture).
When a CPU core writes to a specific memory location, and another CPU core shares that area of memory, then special synchronization actions are required to synchronize the CPU caches and main memory. There are various mechanisms to achieve that, but this is beyond the scope of this article. The important point here, is, that it takes more time due to synchronization overhead to write memory locations that are shared by the various CPU caches.

In the following example we have 2 threads: one thread that writes to a specific memory location, the other one reads from another specific location. The example lets you play with the memory offset between those 2 memory locations as well as with the CPU core on which those threads are running.

void *readThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  *ptr = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("readThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    a += *ptr;
  *ptr = a;
}


void *writeThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  a = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("writeThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    *ptr += a;
}

The examples were executed on a system with 2 CPU sockets (2 NUMA nodes)
Each NUMA node has 8 CPU cores. Each core has 2 HyperThreads.
The Layer 1 Cache line size is 64 bytes. Or 16 4 bytes integers.

The application is launched as follows. For example:
Write Thread on CPU core 8(Numa node 1), Read Thread on CPU core 0 (Numa node 0) and 100*sizeof(int) offset between read and write data.

$ time ./cachecoherency 1000000000 0 100 0 8

Level 1 Data cache line size=64
rd_offset=0 wr_offset=100 num_iter=1000000000
writeThread started on cpu=8 ret_affinity=0
readThread started on cpu=0 ret_affinity=0
finished wr_data=1000000000 rd_data=1000000000

real 0m3.512s
user 0m6.930s
sys 0m0.038s

We obtain the following results

CPU core Wr	CPU core Rd	Numa Node Wr	Numa Node Rd	Offset Wr/Rd	RunTime
8	0	1	0	100	3.512s
8	0	1	0	1	9.686s
5	0	0	0	1	8.952s
16	0	0	0	1	7.055s
0	0	0	0	1	6.821s

We see a dramatic difference in the results when the offset between read and write is less than the L1 cache size, versus, larger than the cache size. There is almost 300% difference the 2 cases.

What is surprising also is that running the 2 threads on the same CPU core ( the last run) results actually more efficient that running this code on 2 different CPU's, both in time and cpu resources. All this because of cache effects.

Those were 2 small examples, that demonstrates the effect of cache.

Making use of this in real world application, is however not an easy task at all. And depending on the programming language you are using, you might not be even able to take advantage of it consciously.

Saturday, January 2, 2016

Survival of the fittest, evolution and the descriptive/prescriptive laws of nature

When you hear about Darwin and the evolution theory for the first time, you are almost always told it is the survival of the fittest, the law of the jungle.
I always felt uncomfortable understanding evolution in these terms.
Although historically, evolution theory radically contradicted the world view of monotheistic religions, it still somehow managed to imply the hand of god hiding somewhere, controlling everything.
Each generation will produce offspring, the fittest among them will survive, the weakest won't, and so on, and so on,
You might feel equally uncomfortable when you try to understand the laws of nature as prescriptive forces
Do we understand that an electron moves the way it moves because it has to comply with the various physical laws of motion andconservation of energy OR the electron moves the way it moves, and the physical law is merely a description of its movement.
If we understand evolution in this light, then the theory transforms itself from survival of the fittest to the fittest being the one who survives.
We actually don´t know who is the fittest, and the fittest, the way we understand it, is not always the one who survives.
What we can say for sure, is that the one who survives will probably reproduce. The ones who do not survive will not reproduce, not because they are weaker, in our definition of weak, but because they are simply not THERE anymore, whether due to hazard or other reasons

Saturday, December 12, 2015

Towards non deterministic computing

We have always linked computers to logic and determinism.
A computer or a software application is mostly considered as black box where we expect that a specific input always results into the same exact output. When this is not the case, we're annoyed and frustrated.
Having non-deterministic results can have dramatic outcomes. Take the example of nuclear plants or airplanes. It is a funny thing because sometimes we forget that the ones who have final say on the behavior of nuclear plants and airplanes are humans with totally non-deterministic and unreliable behavior. We are relying on them solely based on their sense of "responsibility" and will to survive

During decades, software and hardware have been designed to achieve exactly that: reliability and determinism, based on Boolean logic and others.

In several fields of software and hardware design, we are discovering that in order to progress further we need to let go of determinism. Some examples are:
-Concurrent programming
-Distributed programming
-Artificial Intelligence and Machine Learning
-Quantum computing

For example, in concurrent and distributed systems, we used to obtain increased performance by throwing more CPU or threads at a problem.
But this has some limits as stated by Amdahl's law. Indeed, at some point in time, increasing the numbers of CPU's, they will spend more time synchronizing their tasks rather than spending it on actual work.

Those who are dealing with project management know the issues of having several parallel tasks going on. At some point in time, the people involved with the different tasks will need to synchronize and exchange information. The larger the amount of people or parallel tasks, the more time you will spend synchronizing afterwards. I am sure you all have been through that, when you spend more time in meetings than actually doing work.

A lot of work has been done in hardware and software to relax synchronization requirements between CPU's and memory or in distributed systems between servers.
People working, on system software, trying to achieve maximum performance know the strange phenomenons you can observe if you don't take specific measures synchronizing memory accesses on cpu's or systems with non-sequentially consistent memory models. (which are the majority nowadays) with the use of so called memory barriers or memory fences.

In distributed systems, we have the same issues but we give them different names, and sometimes different solutions with conflict resolution algorithms resolving conflicts

Pushing this to the extreme, we get research projects like the Renaissance Virtual Machine from IBM where non-determinism is accepted as a premise and they are trying to find ways to deal with it.

I could go on and on. I guess that what I am trying to say here, is that we are observing a paradigm shift: we are starting to accept uncertainty/non-determinism either out of necessity or due to maturity. We are accepting it and mostly learning to deal with it.