Computers & thoughts

Saturday, April 11, 2020

Remaining relevant: Abstraction layers and API's for cloud native applications

Separation of concerns, abstraction layers and API's for Cloud native environments

Those 3 terms are closely related to each other. They are at the foundation of how current software is being written: by enabling software and system modularity. It allows different individuals, teams and organisations to work separately on common projects and systems in parallel.

Those familiar with the premises outlined in the book Sapiens, a brief history of humankind, know that the ability of the Homo Sapiens to rally, in great number, around common abstractions and concepts is what made it so incredibly successful, and differentiated it from the other primates and species from the "homo" genus.

An API(Application Interface), is such a common abstraction as mentioned here above. It is a convention, a concept agreed upon by independent parties on how to work and collaborate together around common abstractions . It is the Lingua franca between different software entities.
Each party in this dialogue is viewed by the other side as a simplified abstraction. The intrinsic complexity of each part is hidden from the other. This is what is meant by abstraction layer.

A good and well known example in the Unix world is "The "everything" is a file". The applications are pretending they are using and talking to files. No matter what complex hardware or concept is hidden behind it, they are all accessed with the simple and known, open, read, write, close API functions.

Evolution of the underlying architectures

The more the underlying architecture and concepts evolve underneath those API's/abstraction layers, the more inefficient, the more difficult it is to make full use of newer capabilities.

It is as if you tried to describe the modern world with the same Latin used by the Romans 2000 years ago. You wouldn't be able to describe and express most of todays objects and concepts in a efficient and clear manner.

Nowadays, it is very difficult to write high performance networked applications, with the simple read, write and socket abstractions. To do it, you need to be creative, find loopholes or bypass the abstraction all together.

To fully exploit new features, new concepts and keep on relying of the benefits of separation of concerns, the API's and abstracted concepts need to evolve too.

Modern cloud native applications

Modern cloud native applications are often scalable and distributed systems based on a microservice architecture.

The technology stack typically used to implement micro-services is something like this:

An application or micro-service running in a container
The container is abstracting/containerizing OS services
The container runs on top of a virtual machine
The virtual machine is running on top of a physical machine
Those containers are also part of Kubernetes clusters
The Kubernetes cluster is abstracting other OS resources,
They create overlay networks which work on top of real networks.

Lets say a micro-service wants to share some data stored in its container with one of its peer-services:

Application/Micro Service wants to read data from storage
The container pretends to offer storage access to the application
On its turn it calls the container storage layer.
The container storage layer after doing some work of his calls the virtual machine storage layer
On its turn it will call the physical machine/server layer storage layer which ultimately will access the physical storage device
Once the application has the stored data in its memory, it will try to connect to its peer service.
This process will use various layers of API's and indirection's to get to his destination.

There is large overhead due to the various abstraction layers we have to go through to get the task done. Lots of lost CPU/$ because of it, lots of performance lost because of it.

One of the reason for it, is that the API's were are using. the abstraction layers we are using have not evolved.

There is no real reason to use a virtual machine anymore which abstracts a real physical machine. The concept of a server as we used to think off is actually not necessary anymore.

We are trying to use yesterdays data-center architectures based on servers, OS and virtual machines to support today;s workloads and applications.

As mentioned above, it is as, if you would use old Latin to describe today's world. It would be incredibly inefficient.

Note, I am not pretending that today's micro-service based applications are better than old monolithic server based ones. It is just that developing new applications using old concepts holds you back, is complicated and expensive.

It is time to rethink the data-center bottom up and re-juvinile the whole stack of abstraction layers and API's we are using.

Wednesday, December 20, 2017

Using Kafka Streams for network analysis part 1

Introduction

In the following series of articles, I want to explore the use of streaming engines for network analyses. For that end, I decided to use Kafka Streams for its simplicity. It doesn't require the setup of clusters, complex configurations,...It is a matter of including kafka streaming jar dependency in your application and you are ready to go. When load increases just spawn another instance of your application.

Where this small toy project will lead me is unknown.

Note that, there are already some heavy weight projects such as Metron which use streaming engines which enables us to similar analyses. Other projects of the same kind, not using streaming platforms, are Bro , YAF or Moloch.

Traditionally, protocol state handling and packet reassembly are performed by monolithic blocks of code written usually in "C/C++", sometimes in Go. We find it for example in Operating Systems, or applications such as Wireshark, Tcpdump, or PacketBeat

After that initial packet capture and parsing is done, a packet summary is forwarded to some framework for indexing, analysis,...

One of the drawbacks of sending only packet summaries, is that low level analyses is not possible anymore further down the line.

Below a diagram of the setup I am going to experiment with.

Setup

The Network probe

The network probe implemented is very simple. It is written in C++/C and uses the standard pcap library for packet capture and librdkafka for transmitting record on Kafka for further processing.

The probe itself is single threaded. The librdkafka, however, uses multiple threads internally.

We use the standard network 5 Tuple: source & destination IP address, protocol, source & destination port as the Kafka key for the configured topic. By using the network 5 Tuple as key, we guarantee that packets belonging to the same stream are handled by the same Kafka consumers down the line.
We need also to arrange the 5 tuple is such a way that both directions of the same stream have the same key.

In this toy network probe, only basic Ethernet II packet parsing was implemented, no 802.1q vlans, and others were implemented.

Modern cards Ethernet cards offload the CPU by automatically reassembling TCP segments. Artificial large packets are received by the network driver and all layers above.

In order to capture the TCP/UDP packets as they are received by the network card, you need to disable the network cards generic receive offload if present at all (sometimes also LRO, TRO,...)
On Linux you can do that with

sudo ethtool ethX -K gro off

For high performance SW based probes, frameworks such as DPDK, PF_RING, NETMAP are recommended.

Streaming part

In this article, I decided to use Kafka Streams but any other Streaming platform would have been equally valid.

The first step is to ingest the Kafka records produced by the network probes discussed here above.

The binary encoded Kafka Key/Value records produced by the network probes, are deserialized to Net5Tuple POJO's as the key and a ByteArray for the value.

1	final KStream<Net5Tuple, byte []> packetStream = builder.stream(net5TupleSerde, byteArraySerde, "packets");

We configure the default Serializers/Deserializers as follows:

1
2

 streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, net5TupleSerde.getClass().getName());
 streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray().getClass().getName());

The Net5Tuple class uses NIO's ByteBuffers to map the binary fields into their Java counterparts. NIO uses by default Big Endian(network order) ordering, which is the ordering used to encode IP addresses and ports.

From that point on, we can start using Streaming primitives to perform packet reassembly and decoding, collect statistics and perform analyses.

Examples

I will show here 2 small examples of what can be done. More complex examples such as TCP packet reassembly, http decoding will be handled in future articles.

Packet statistics

Here we print on the console packet statistics per stream. Note: this bare meaning only in the case of TCP and UDP.

 packetStream.groupByKey(net5TupleSerde, byteArraySerde)
  .count()
  .toStream()
  .map((k,v) -> KeyValue.pair(k.toString(), v))
  .print();

We could use here also a KTable and then access externally the statistics via REST API's

SubStreams according to protocol

  KStream<Net5Tuple, byte []> [] perProtoStream = packetStream.branch(
                (k,v) -> k.getProtocol() == TCP,
                (k,v) -> k.getProtocol() == UDP,
                (k,v) -> true
        );

References:

Kafka: a distributed streaming platform
Kafka Streams: streaming engine on top of the Kafka
LibPcap: packet capture library
Code for the network probe is available upon request.
DPDK: user mode data plane libraries

Wednesday, December 6, 2017

Evaluating Spring Reactive & WebFlux framework as base for streaming Microservices

Introduction

There are already many excellent introductory articles on reactive programming, about the reactor project and its integration within the Spring Framework.

This post is about comparing the "standard " Spring framework to the reactive framework as a base for streaming Microservices

There is an excellent example project on github from Mark Paluch illustrating the difference between both ideas.

For the moment there is a lack of performance benchmarks. It is difficult to know if the reactive frameworks with all their overhead outperform the standard way of things.

If you really think it through, it is actually allowing the application to handle the scheduling of the CPU resources (asynchronous/reactive) vs letting the OS handle the CPU resources.

Setup

I wanted to perform a quick and basic performance/load test with a base streaming app building block that does some data enrichment.

Reactive Kafka Consumer⇒Reactive REST towards external service for data enrichment⇒Reactive Kafka Producer

vs

Kafka Consumer⇒Blocking REST⇒Kafka Producer

The reactive way

The reactive setup will keep on pulling Kafka records, and buffering up REST requests waiting for for their answers. Nothing will block. Once a response is received from the data enrichment service and the original data is enriched, the data is passed to the Kafka producer.
Since nothing in this pipe line is supposed to be blocking, the only thing stalling the processing is the amount we are willing to buffer. In this particular case, outstanding REST requests with their associated network sockets.

The standard way

The Kafka consumer will poll the Kafka Broker for records. The records will be fur in different workers threads. Each record is submitted to an ExecutorService with a pool of worker threads doing the actual work. Each worker thread, once scheduled for work will issue a REST request and stall until it receives a response. Then it will proceed sending the final result to Kafka, and then accepting the next record, and so on.

Since the REST request is blocking, a thread is blocked and is not able to any further processing until a response is received. That means that any outstanding REST request consumes a thread.

The code

        ReceiverOptions<String, String> receiverOptions =
                ReceiverOptions.<String, String>create(consumerProps)
                        .subscription(Collections.singleton("input-topic"))
                        

        Flux<ReceiverRecord<String, String>> inboundFlux =
                KafkaReceiver.create(receiverOptions)
                        .receive();

Creating the Reactive Kafka consumer

1
2
3

        WebClient webClient = WebClient.builder()
                .baseUrl("http://enrichmentservice/")
                .build();

Creating the Reactive web client

        SenderOptions<String, String> senderOptions =
                SenderOptions.<String, String>create(producerProps);

        KafkaSender<String, String> kafkaSender = KafkaSender.create(senderOptions);

Creating the Reactive Kafka producer

The heart of the reactive chain

    public static Mono<String> enrichData(WebClient client, ReceiverRecord<String, String > r) {
        WebClient.ResponseSpec responseSpec = client.get().retrieve();
        Mono<String> restResp = responseSpec.bodyToMono(String.class);
        return restResp.map(m -> m + " " + r.key());
    }

The data enrichment method

        kafkaSender.createOutbound()
              .send(inboundFlux.flatMap(
                        r -> {
                            return enrichData(webClient, r);
                        })
                        .map(r -> new ProducerRecord<String, String>("output-topic", r)))
               .then()
               .subscribe();

The reactive chain

One of the issues encountered in that part, was how to map the result of the web client which is a single REST response back towards to a stream of record. In Reactor terminology: from a Mono<String> back to Flux<String>. This is done here by using the flapMap operator which accepts a Function with Publisher as parameter. Publisher is the base interface of both Mono's and Flux's

The results

Note the actual results are dependent on the machines, Kafka brokers, you are running it on. But the conclusion should be pretty independent of your setup.

At low loads, the standard solution +- compares to the reactive solution . Sometimes the standard solution slightly outperforms the reactive one from a system resource point of view (%cpu, %mem)
At higher loads the standard solution starts using a huge amount of threads to cope with the stalling REST requests, which encompass system resources, cpu's spending time context switching.

On the other hand the reactive solution doesn't need to create additional threads to cope with the load.

In my low performance setup with a load of 4000 Kafka records/s and a 100ms delay in the enrichment rest service.

Standard: 520 threads, 520 sockets, 32%CPU usage, 8.3%Memory
Reactive: 37 threads, 502 sockets, 16%CPU usage, 4.3%Memory

With my setup, I could not get really get higher than 4000 Record/s, but I think we can safely extrapolate the results to higher loads.

Conclusion and final notes

We see that in this particular setup the Spring reactive framework does a good job especially when the standard synchronous framework is starting to use a huge amount of resources to cope with latencies and delays.

I don't pretend reactive programming is the response to all issues. The Spring Reactive framework needs much more benchmarking and performance characterization.

There are other reactive frameworks, other asynchronous/event driven frameworks that were not considered here.

On a final note, we can see that both solutions creates a large amount HTTP of connections to cope with the delays. This is due to the nature of the HTTP protocol: you cannot issue a new request on the same connection before receiving the response. This raise the question whether HTTP is a good candidate for high performance REST based services.

Most probably HTTP2 should perform better here

NB:
I had to increase "reactor.bufferSize.small" to increase the number of pending reactive REST requests

Monday, December 4, 2017

The world is not ours anymore - Minecon Earth 2017

Last month, I was with mine 9 years old son, to a local Minecraft event where the Minecon conference was screened followed by a panel of young Minecraft youtubers.

At the beginning I was not so eager to go to another of those gaming events for kids, full of chaos, screaming, but this time it was different.

It was an epiphany, I suddenly realized the fundamental gap there is between the digital natives and the representatives of my generation X and to some extent to the Millennials(Generation Y).

Their heroes are young youtubers between 14 a 16 years. They are the stars of this generation. To some extent, it is much democratic and easy to access to stardom nowadays then when we were young. You make some cool movies, publish and promote them on youtube. And then hope for "Likes" and "Subscriptions", the measures of your success.

At the end of the screening, there were presentations by the members of the panel. One of them, was about setting up a Minecraft server, hiring developers, promoting and marketing of the server. Strategies on how retain as long as possible the online players. Full of good common sense without the professional bullshit. This by a 16 years old kid. It was amazing.

Thursday, March 30, 2017

Stream processing of time series and the uncertainty principle of Heisenberg

Recently at work, we worked on processing time-based event series.

The collected events could enter the system out of order, and we needed to sort and process them based on their time of arrival in the system.

This was done using time windows. Time windows are buckets of events. Events arrived in the system are placed in the corresponding time buckets. If an event arrives later than a predefined max out order time, it is discarded.

Once the system decides a time window is ready for processing it releases the events and they are processed.

E.g. consider the following:

Time window 1: 0sec-10sec
Time window 2: 10sec-20sec
Time window 3: 20sec-30sec

And consider a maximum out of order delay of 2sec.

Event 1 arrives at t=1sec and is placed in time window 1
Event 2 arrives at t=5sec and is placed in time window 1
Event 3 arrives at t=11sec and is placed in time window 2
Event 4 arrives at t=9sec and is placed in time window 1 (out of order delay is still smaller or equal than 2 seconds)
Event 5 arrives at t=15 sec and is placed in time window 2, since t=15 sec, and t-end of time window 1> out of order delay, the window 1, with events 1,2 and 4, is released for processing

The question arises, what is the optimal size of the window?

If you have small time windows, you fill them fast, and release them fast for further processing, but on the other end the ability to sort the events is limited.

On the other hand, using large time windows provides the ability to arrange a large amount of events in order of arrival in the system, but the delay to process the ordered events increases proportionally.

If you want to order an infinite stream of time-based events perfectly, you will have to wait an eternity before processing them.

This duality, this inherent impossibility to reconcile both aspects: reduce latency to a maximum (release events as fast as possible for processing: small time windows) and the ability to have an ordered set of events (large time events). The same duality exists in computer science, to a certain extent, with latency vs performance.

This is exactly what the uncertainty principle in physics is about.

The uncertainty principle of Heisenberg is an important concept/idea in the field of quantum mechanics.

It was formulated by Werner Heisenberg in the 1920's when the basics quantum physics were conceived.

It basically states that there is a limit to the precision in wich we can measure velocity and position of an object at the same time. Either we can measure speed with extreme precision but then we compromise on the exactitude of the object's position or the other way around.

For objects at human scale the lack of precision is negligible. However, for atomic particles the limit becomes really apparent. This is a fundamental limitation. It does not result from our lack of technological ingenuity. It is a limit to the way we represent physical reality.

This fundamental uncertainty is the core principle of the famous Einstein's quote: God doesn't play dice.

There is another equivalent formulation of the uncertainty principle which uses energy and time, or frequency and time, instead of velocity and position:

It is impossible to measure the exact frequency of a phenomenon quickly. If you want zero uncertainty on the frequency you have to measure it during an infinite time.

People working on signal processing always have to compromise about the precision of the frequency measurements and the time windows.

Let's say you want to investigate the migration pattern of birds. You want to know the frequency of their migrations.

You go outside once in October and you see birds flying southwards. Can you deduce anything regarding the frequency of their migration? No, you cannot.

Let's say you go outside a couple of times during the year, and observe that once during this time frame they were flying southwards and once to the north. Can you extrapolate with certainty that this pattern is applicable every year and at all times. No you cannot.

To be sure you will have to observe for eternity.

What I wanted to illustrate here is that sometimes seemingly unrelated ideas are somehow connected.
And that the way we perceive and interpret reality sometimes limits us in seeing underlying concepts

Tuesday, March 7, 2017

Not politically correct

Fed up with the supreme value of ''not being politically correct"
''Not being politically correct" is nowadays considered as an act of rebellion
Politically incorrect people are considered as freedom fighters.
Those who are not rude, not obnoxious or don't think that everyone who doesn't think like them are considered as morons, liars and merely part of a flock of sheeps.
Those who are not racist, egocentric and bigots are supposedly just lying to everyone, including themselves.
"Deep inside everyone is a rotten bastard" is what they tell us.
Well I don't think so.
Not everyone is a rotten bastard that doesn't speak out.
Not being an all loving altruist and being "politically correct" are mutually compatible

Saturday, January 28, 2017

Nihil sub sole novum

And here we go again for another round
I hope I am wrong
Fear and obscurantism are back on
Like waves, ups and downs. You can try to brake them, to stop them, but they'll keep on coming.
History has it own dynamics, like the Greek tragedy, the Roman fatum, it is not stoppable.
Unfortunately it is stronger than us

Sunday, July 31, 2016

Rediscovering the pleasure of single threading

Beside some very basic applications, most business class applications today are multi-threaded.
They are multi-threaded either:

to exploit the multiprocessor nature of current systems
because the programming language they are written in, implicitly uses multiple threads
to avoid complex asynchronous or event based programming paradigms

or a combination of the 3 above.
The most performing applications today are multi-threaded but each thread is seen as completely independent.

Sharing only a bare minimum of state information among them, and no synchronization. The application is actually a set of single threaded applications which can almost linearly scale since they, almost, don't share any information between them and also avoid any expensive context switching when multiple threads run on a single cpu.

I had recently the pleasure to implement a small single threaded message driven framework.
What a pleasure!!!! A model I had used years ago as a microkernel on embedded systems used for networking.

What a pleasure I didn't had to think about

Synchronizing threads
Protect access to shared data
All kinds of locks, deadlocks
Priority inversions

Every time you use some datastructure in a multithreaded application you have to worry about all kinds of potential you might encounter.

In that case, not, I could permit myself extravagances ,......

Saturday, February 6, 2016

Playing with cpu cache: funny and simple experiments

Typically when we are talking about cpu caches and taking advantage of it, most programmers feel that it is something beyond their reach. In a lot of cases it is indeed so.
I want here to show 2 simple examples that get you a feeling of the large impact caches can have on performance.
Caches are special types memory that can give you faster access to the information you need by making use of 2 properties.

Locality: being temporal or spatial. Close in space or close in time.
Predictability: The information you require now, is probably somehow related to the information you required just before.

Examples. If the cpu process a machine instruction located at memory location X, there is a pretty good chance that the next instruction will be located in the vicinity of X. If your application requires now information X, there is a pretty good chance that it will require to access that information again in the near future.

Typically accessing local memory can be made much faster than accessing remote memory

To get a feeling of the importance of locality. I have taken the following data from Greg Brendan book on system performance. All times are also scaled to "human" scale
1 CPU cycle: 0.3 ns => to human scale 1s
L1 cache access: 0.9ns => 3s
L2 cache access: 2.8ns => 9s
L3 cache access: 12.9ns => 43s
Main memory access: 120ns => 6 min
SSD Disk: 50-150 us => 2-6 days
Rotational Disk: 1-10ms => 1-12months
Internet: San Francisco to New york: 40ms => 4 years
...
We see the huge differences when accessing information that is less and less local.

Anybody going serious about system performance should read Greg Brendan book

Example 1

This example initializes a 2-dimensional array 4096 by 4096 with the value 5.
In this example we do it in 2 ways: row by row, or column by column.
In "C/C++", the relative memory location of an 2 dimensional array is as follows: mem_loc(arr[x][y]) = x*(#columns) + y

Meaning that 2 adjacent cells on the same row, will be located in adjacent memory locations.
e.g. cell at array[x][y] is located near the cell at array[x][y + 1]
While cells located on adjacent rows, sharing the same column, are separated by the number of columns
e.g. distance(array[x + 1][y], array[x][y]) = number of columns
In other words the 2 cells are located far apart, well at least if the number of columns is at bit large.

In the example here below, the distance in memory between 2 cells on 2 adjacent rows is 4096, while distance of 2 cells located in adjacent columns is 1.

#include <stdio.h>

#define size (1 << 12)
int arr[size][size];



void loop_local()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[i][j] = 5;
  }
}


void loop_remote()
{
  int i, j;
  for (i = 0; i < size; i++) {
    for (j = 0; j < size; j++)
      arr[j][i] = 5;
  }
}



int main(int argc, char *argv)
{
  if (argc > 1)
    loop_local();
  else
    loop_remote();
  return 0;
}

You will find the code and compile instructions at https://github.com/ywahl/PlayingWithCache.
Executing the code when accessing the cells column by column.
$ time ./datalocality 1

real 0m0.324s
user 0m0.150s
sys 0m0.174s

Executing the code when accessing the cells row by row.
$ time ./datalocality

real 0m1.570s
user 0m0.017s
sys 0m1.551s

We see a huge difference in the results, although executing almost exactly the same code. I will not enter in a detailed analysis of the results, this is beyond the scope of this article.
You can play also with the size of the array in line 3 and discover jumps in time(local)/time(remote) when crossing certain values.

Example 2

The second example introduces threading. We'll see that the time needed to execute 2 apparently independent threads depends greatly on which cpu cores those threads are executed and the location of the memory they access.

CPU cores located on the same chip die share typically share part of the L2/L3 caches.
CPU cores located on different chip dies (on different sockets), typically do not share cache.

When a CPU core writes to a specific location, it writes it first to cache. When the cache is flushed, it is written back to memory if modified. (Note that this can vary greatly depending on the cpu and system architecture).
When a CPU core writes to a specific memory location, and another CPU core shares that area of memory, then special synchronization actions are required to synchronize the CPU caches and main memory. There are various mechanisms to achieve that, but this is beyond the scope of this article. The important point here, is, that it takes more time due to synchronization overhead to write memory locations that are shared by the various CPU caches.

In the following example we have 2 threads: one thread that writes to a specific memory location, the other one reads from another specific location. The example lets you play with the memory offset between those 2 memory locations as well as with the CPU core on which those threads are running.

void *readThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  *ptr = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("readThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    a += *ptr;
  *ptr = a;
}


void *writeThread(void *arg)
{
  struct data *pdata = arg;
  int a = 0;
  int i;
  cpu_set_t c_set;
  volatile int *ptr = pdata->data_ptr;
  a = 1;
  CPU_ZERO(&c_set);
  CPU_SET(pdata->cpu, &c_set);
  i = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &c_set);
  printf("writeThread started on cpu=%d ret_affinity=%d\n", pdata->cpu, i);
  for(i = 0; i < pdata->num_iter; i++)
    *ptr += a;
}

The examples were executed on a system with 2 CPU sockets (2 NUMA nodes)
Each NUMA node has 8 CPU cores. Each core has 2 HyperThreads.
The Layer 1 Cache line size is 64 bytes. Or 16 4 bytes integers.

The application is launched as follows. For example:
Write Thread on CPU core 8(Numa node 1), Read Thread on CPU core 0 (Numa node 0) and 100*sizeof(int) offset between read and write data.

$ time ./cachecoherency 1000000000 0 100 0 8

Level 1 Data cache line size=64
rd_offset=0 wr_offset=100 num_iter=1000000000
writeThread started on cpu=8 ret_affinity=0
readThread started on cpu=0 ret_affinity=0
finished wr_data=1000000000 rd_data=1000000000

real 0m3.512s
user 0m6.930s
sys 0m0.038s

We obtain the following results

CPU core Wr	CPU core Rd	Numa Node Wr	Numa Node Rd	Offset Wr/Rd	RunTime
8	0	1	0	100	3.512s
8	0	1	0	1	9.686s
5	0	0	0	1	8.952s
16	0	0	0	1	7.055s
0	0	0	0	1	6.821s

We see a dramatic difference in the results when the offset between read and write is less than the L1 cache size, versus, larger than the cache size. There is almost 300% difference the 2 cases.

What is surprising also is that running the 2 threads on the same CPU core ( the last run) results actually more efficient that running this code on 2 different CPU's, both in time and cpu resources. All this because of cache effects.

Those were 2 small examples, that demonstrates the effect of cache.

Making use of this in real world application, is however not an easy task at all. And depending on the programming language you are using, you might not be even able to take advantage of it consciously.

Saturday, January 2, 2016

Survival of the fittest, evolution and the descriptive/prescriptive laws of nature

When you hear about Darwin and the evolution theory for the first time, you are almost always told it is the survival of the fittest, the law of the jungle.
I always felt uncomfortable understanding evolution in these terms.
Although historically, evolution theory radically contradicted the world view of monotheistic religions, it still somehow managed to imply the hand of god hiding somewhere, controlling everything.
Each generation will produce offspring, the fittest among them will survive, the weakest won't, and so on, and so on,
You might feel equally uncomfortable when you try to understand the laws of nature as prescriptive forces
Do we understand that an electron moves the way it moves because it has to comply with the various physical laws of motion andconservation of energy OR the electron moves the way it moves, and the physical law is merely a description of its movement.
If we understand evolution in this light, then the theory transforms itself from survival of the fittest to the fittest being the one who survives.
We actually don´t know who is the fittest, and the fittest, the way we understand it, is not always the one who survives.
What we can say for sure, is that the one who survives will probably reproduce. The ones who do not survive will not reproduce, not because they are weaker, in our definition of weak, but because they are simply not THERE anymore, whether due to hazard or other reasons

Saturday, December 12, 2015

Towards non deterministic computing

We have always linked computers to logic and determinism.
A computer or a software application is mostly considered as black box where we expect that a specific input always results into the same exact output. When this is not the case, we're annoyed and frustrated.
Having non-deterministic results can have dramatic outcomes. Take the example of nuclear plants or airplanes. It is a funny thing because sometimes we forget that the ones who have final say on the behavior of nuclear plants and airplanes are humans with totally non-deterministic and unreliable behavior. We are relying on them solely based on their sense of "responsibility" and will to survive

During decades, software and hardware have been designed to achieve exactly that: reliability and determinism, based on Boolean logic and others.

In several fields of software and hardware design, we are discovering that in order to progress further we need to let go of determinism. Some examples are:
-Concurrent programming
-Distributed programming
-Artificial Intelligence and Machine Learning
-Quantum computing

For example, in concurrent and distributed systems, we used to obtain increased performance by throwing more CPU or threads at a problem.
But this has some limits as stated by Amdahl's law. Indeed, at some point in time, increasing the numbers of CPU's, they will spend more time synchronizing their tasks rather than spending it on actual work.

Those who are dealing with project management know the issues of having several parallel tasks going on. At some point in time, the people involved with the different tasks will need to synchronize and exchange information. The larger the amount of people or parallel tasks, the more time you will spend synchronizing afterwards. I am sure you all have been through that, when you spend more time in meetings than actually doing work.

A lot of work has been done in hardware and software to relax synchronization requirements between CPU's and memory or in distributed systems between servers.
People working, on system software, trying to achieve maximum performance know the strange phenomenons you can observe if you don't take specific measures synchronizing memory accesses on cpu's or systems with non-sequentially consistent memory models. (which are the majority nowadays) with the use of so called memory barriers or memory fences.

In distributed systems, we have the same issues but we give them different names, and sometimes different solutions with conflict resolution algorithms resolving conflicts

Pushing this to the extreme, we get research projects like the Renaissance Virtual Machine from IBM where non-determinism is accepted as a premise and they are trying to find ways to deal with it.

I could go on and on. I guess that what I am trying to say here, is that we are observing a paradigm shift: we are starting to accept uncertainty/non-determinism either out of necessity or due to maturity. We are accepting it and mostly learning to deal with it.