The architecture benchmark consists of two applications that are build with the architecture. The benchmark illustrates the performance capabilities of the architecture in the context of high frequency trading. The benchmark focuses on the non-blocking i/o event handling mechanisms poll() and epoll().
A trader application that receives feed messages, either via a TCP/IP connection or via IP-multicast. When feed is received, the trader immediately sends back a TCP/IP order message to the exchange.
An exchange application that broadcasts feed messages at a configured interval. The feed can either be broadcasted via a TCP/IP server or via IP-multicast. The exchange also has a TCP/IP order server to receive order messages. When an order message is received, the exchange determines the RTT by subtracting the received timestamp of the order messages and the sent timestamp of the earlier broadcasted feed message. The RTT is stored in a data set. When the data set reaches the configured sample size, the exchange reports a statistical analysis of the RTT.
Note: The latency is defined as 0.5 * RTT! Meaning the time it takes for a message to travel from the trader to the exchange or vice versa.
A specification of the messages that are communicated between the exchange and trader applications can be found here.
The benchmark has to run on two Linux x86_64 machines, one machine for each application. The two machines must be directly interconnected with 2 Solarflare 10 Gb network adapter supporting the OpenOnLoad user space network stack.
Two machines were installed with Fedora, one machine with Fedora 18 and the other with Fedora 19.
Each machine is equipped with one SFN6122F network adapter. Both machines are directly connected with a 2 meter SFP+ cable.
The machine running Fedora 18 has a Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz, hyper-threading turned off.
The machine running Fedora 19 has a Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz, hyper-threading turned off.
For both machines, the instructions as described in the Low Latency Quickstart Guide from Solarflare, were followed. In order to get the OpenOnLoad working for kernels above 3.9 (i.e. 3.10+), a patch from Solarflare was required.
The exchange application uses gnuplot to produce graphs of the results measured, so make sure both machines have gnuplot installed (sudo yum install gnuplot).
The following commands were executed on the Fedora 19 machine to run the exchange application, a feed message is broadcasted every 5000us and samples are measured every 20000 times:
[jevi@quiff exchange]$ export EF_POLL_USEC=100000000
[jevi@quiff exchange]$ export Q_THREAD_CPUBIND="MainThread,0;Timer,1;Reporter,2"
[jevi@quiff exchange]$ onload ./exchange -I<ip-of-local-sfn-adapter> -G224.0.0.1 -f --interval=5000 --samples=20000
The following commands were executed on the Fedora 18 machine to run the trader application:
[jevi@qintar exchange]$ export EF_POLL_USEC=1000000000
[jevi@qintar exchange]$ export Q_THREAD_CPUBIND="MainThread,0;Timer,2"
[jevi@qintar exchange]$ onload ./trader -h<ip-of-exchange-via-sfn> -I<ip-of-local-sfn-adapter> -G224.0.0.1 --id=sfn
The benchmark executes 20000 samples for intervals 500us, 1000us, 5000us, and 10000us.
The benchmark is tested with poll() and epoll().
The results for poll() can be found here.
The results for epoll() can be found here.
The following table shows a summary of the differences between poll() and epoll() in terms of latency (0.5 * RTT).
Poll() performs better than epoll(), especially when the broadcast interval increases (10000us). Note that the epoll() performs better than poll() in the 5000us interval but the standard deviation for poll() is lower which suggests a more stable performance.
The benchmark is executed by binding the trader and the exchange threads to fixed CPU cores in order to avoid CPU cache misses. However, other (operating system) processes can still use the CPU cores and this causes cache polution. This effect especially happens when the test interval of the benchmark increases. A possible improvement would be to park the other processes to a fixed free CPU core (i.e. not sharing the cache of the benchmark CPU cores).
You can download the benchmark here.