Lab notes on 10 Gbit/s network tests
(C) 2007 Jan Wagner, Guifré Molera

These lab notes contain our 10 Gbit/s network tests and results that we achieved. The screen logs and notes are available for download. Everything here should be considered "work in progress". Tests are run with normal iperf, with tsunami UDP data transfer, FTP, and other usual tools.

Brief status summary (12Dec2007): jumbo frames are essential, offloaded TCP does 9.5 Gbit/s with nearly out-of-box settings, UDP requires some more tweaking to achieve 9.9 Gbit/s.

08Nov2007 - 'crossover' test

Network cards:	Chelsio evaluation kit - two N320E-CX (dual-port 10G CX4) cards and one 3m CX4 cable for USD 1990
Computers:	abidal and juliano
Setup:	Each PC contained one dual-port N320E, one port was used to connect PCs together directly over the 3m CX4 cable. I set up a 192.168.1.* private net for 10G. Abidal swap off, Juliano swap on.
Tests:	TCP and UDP iperf with 1500 and 9000 MTU interface setting, attempts at disk-to-disk Tsunami
Screen logs:	logs for abidal and juliano

Results and notes: the cxgb3 1.0.113 T3 driver wouldn't compile in 2.6.17.4 and 2.6.22, I gave up after five minutes (TODO: try again to compile and install) and just used to cxgb3 driver that is already included in the mainstream kernel. But, cxgb3 complained (dmesg, at ifup eth2) about firmware 4.7.0.0 and that it was too new for the driver. Had to delete all t3fw-4.7.0.bin's (in /lib/firmware/ etc) and instead copy there Chelsio's t3fw-4.0.0.bin. The card and pinging between the two PCs started to work.

Initial iperf TCP and UDP transfers went at 2-3 Gbit/s. With 32kB packets UDP went at 3.5 Gbit/s. Iperf always hit 100% CPU. After changing MTU from 1500 to 9000 throughput increased to 8 Gbit/s for UDP. Iperf still hit 100% CPU. Suspecting that newer driver version may offload better.

Tsunami did not work too well, only 1.5 Gbit/s Tsunami from Abidal RAID to Juliano RAID was possible. The opposite direction didn't quite work. Abidal tsunami client data landed all in memory, data was flushed to disk by OS only after ~10s pauses (watching disk cage LEDs flicker). Tried increasing abidal ./tsunami/client/client.h ring buffer size to 512MB, didn't help much, just takes longer for client diskiothred blocking messages to appear, no throughput improvement.

A possible reason is Xorg on da monkey, it ate 2 x 450MB memory and 100% CPU of one core (launchpad.net ubuntu gutsy bug #51991). Switched xorg.conf 'fglrx' back to 'vesa' and Xorg memory and CPU hogging went away, good thing. Further tsunami tests not yet done after this. TODO!

09Nov2007 - more tests

Test setup is the same as 08Nov2007. No new cxgb3 driver or firmware yet.

Setup:	Tweaked the TCP settings on both PCs (tcp-tweakUp.sh) after running the first couple of tests with Ubuntu-default settings.
Tests:	TCP and UDP iperf with 1500, 9000 and 9600 MTU interface setting, disk-to-disk and disk-to-memory Tsunami
Screen logs:	logs for abidal and juliano

Tsunami shortcoming: the target rate (e.g. 'set rate 4g') integer overflows above 4g/4096m so one can't set for example a 5g target rate. This would need a tsunami protocol version change v1.1->v1.2 and breaks compatibility with older Tsunami software...

Progress of cxgb3 driver update: On Abidal the first compile attempt of cxgb3 1.0.113a on Ubunty Gutsy resulted in /bin/sh: Syntax error: Bad fd number. Googled reason is ubuntu uses softlink sh->dash instead of bash. Fixed with: sudo ln -s /bin/bash /bin/sh. Now just 3 compile errors in cxgb3_main.c about struct net_dev. Well actually more, turns out. This is because of struct sk_buff members have been renamed to something more descriptive (e.g. nh to network_header, and so on, so find&replace is a fast fix).

Results and notes: TCP iperf worked at near 9 Gbit/s. UDP iperf works poorer. Tsunami reached around 3.5 Gbit/s for abidal raid to juliano memory.

Abidal power consumption checked out of general curiosity with 11 disks switched on, 6 of them in RAID0, 'vesa' driver in xorg.conf: at PC power-up ~500W, running idle ~310W, reading from 6-disk RAID ~330W, tsunami'ing from 6-disk RAID to 10G CX4 ???W.

more...

13Nov2007 - petabit Tsunami

Setup:	Booted Juliano into old 2.6.20 kernel and installed Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware.
Tests:	Tsunami from Abidal disk to Juliano memory at 9600 MTU setting
Computers:	abidal and juliano
Screen logs:	logs for abidal and juliano

I've now created a new Tsunami branch for protocol v1.2, see tsunami homepage. Installed it on Juliano and Abidal. Several changes to Tsunami were needed to get it working with 64-bit protocol variables. For a x64 platform compile (e.g. abidal) the source needed casts, now x86 and x64 compile works without warnings.

Results and notes: the final transfer rate was 3.4 Gbit/s when transferring out from Abidal's RAID, while 'hdparm' says the disks can do raw 4.4 Gbit/s. The usleep_that_works() doesn't seem to work fully, for example at a 2G setting the actual IPD is 108us vs desired 130us. The server/main.c line

ipd_time = ((ipd_time + 50) < xfer->ipd_current) ? ((u_int64_t) (xfer->ipd_current - ipd_time - 50)) : 0;

is the culprit, with +-5 instead of +-50 the rate is closer to 2G. Juliano receiving: OS rx errors start already at a 4000m target setting (3..11% loss). Abidal receiving: OS rx errors start at 4300m rate setting (3..11% loss). Around 6 Gbit/s with UDP iperf was possible now, TCP iperf 5.8 Gbit/s. New driver is worse?

15Nov2007 - Myrinet boards tests

Setup:	Juliano + Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware and Abidal + Myrinet myri10gb driver
Tests:	Test using one Chelsio board and one Myrinet board connected by CX4 10 m cable
Computers:	abidal and juliano
Screen logs:	logs for abidal and juliano

Myrinet board allows only jumbo frame size (mtu) of 9000. (caused by chipset specs or non-dual port mode?). So I need to decrease also the mtu in the Juliano side, so both have similar mtu. TX / RX is around 5.60 Gbps and 6.50 Gbps one way. Both computers take 100 % of the CPU. Using TCP mode the transfer rate changes from 7.03 and 5.77 Gbps ( so one goes up and other decreases). Also the CPU used is 80 to 100 %. No such changes if the packet size is 16K, 24K or 32K.

Tsunami tests -> Using diskless both sides give an average rate of 3800 Mbps for both interfaces. Constant small error rate about 10 %. Both CPU's are fully loaded (100%).

15Nov2007 - Myrinet boards tests

Setup:	Juliano + Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware and Abidal + Myrinet myri10gb driver
Tests:	Tests using Tsunami and Myrinet and Chelsio boards.
Computers:	abidal and juliano
Screen logs:	logs for abidal and juliano

Tests using Tsunami and lossy transfers, allows a maximum transfer rate of 7000 Mbps (64K) and 6750 Mbps (32K). The data loss for each case is around 45 %. Reducing the rate in order to get 0% data we end up at 4200 Mbps without any data loss or really low. Even 4700 Mbps gave an error of 7 %. So not too bad in that sense. It seems to improve a bit the data loss by using 64K than 32K.

Finally using disk writing-> Transfers using Tsunami works fine up to 3500 Mbps, but not higher speeds. Then the data is loss in huge amounts and usually crashes Tsunami... Also using netcat the maximum rate is 420 MBps so around 3300 Mbps. (Remember the speed achieved by hdparm from the RAID disk was 470 MBps).

dd bs=32768 count=1000000 if=/dev/zero | nc -u 192.168.1.105 64224 nc -u -l -p 64224 > /raid/test1

03/04Dec2007 - HP 6400cl and jumboframe tests

Juliano 2.6.20/.22 and Abidal 2.6.23, Chelsio CX4's.

Setup:	Enabled 9000-byte jumbo frames on the HP 6400cl, Summit X450, Juliano and Abidal. Legacy interrupt mode disabled using kernel boot parameter 'pci=msi'.
Computers:	abidal and juliano
Tests:	TCP and UDP iperf
Logs:	logs for juliano logs

Chelsio CX4 to HP6400cl to Chelsio CX4: TCP 9.5 Gbit/s, UDP still only around 5 Gbit/s. Using the Myrinet SR instead of Chelsio CX4 in Abidal made TCP and UDP throughputs only worse.

UDP throughput is limited by the test computer CPUs. 'iperf' doesn't use threads, so it loads only one core of the multi-core CPUs.

In previous UDP tests the sender and receiver hit 100% on one CPU core. The other 1 or 3 cores remained unused. With two 'iperf' client+client and server+server program instances running in parallel, at both ends, 3.9 Gbit/s UDP per iperf works fine with 0% loss but near 100% CPU core load. This gives a total throughput of 7.8 Gbit/s UDP through the same network card (4470 MTU). Increasing to 9000 MTU allowed two parallel 4.9 Gbit/s UDP transfers with 0% loss i.e. a total of 9.8 Gbit/s UDP (9000 MTU).

Conclusions: for ≥4 Gbit/s UDP to work efficiently, it looks like the UDP receiver program should be multithreaded - at least for the 3 GHz Core 2 Duo's or 2 GHz AMD Opterons tested here.

TODO: play nice and test VLANs with jumboframes. See if we can get the same jumboframe routing problem that JIVE sees on their HP 5412zl switch with certain new totally buggy firmware.

12Dec2007 - UDP checksumming tests

Trying to find out why UDP iperf has such a high CPU load. One suspicion is that TCP checksum offloading is in hardware while UDP checksums are calculated in software.

Setup:	Enabled 9000-byte jumbo frames on the HP 6400cl, Summit X450, Juliano and Abidal. Kernel has boot parameter 'pci=msi'.
Computers:	abidal and juliano
Tests:	UDP iperf with iperf modified to use the SO_NO_CHECK socket option, Tsunami v1.2 with SO_NO_CHECK, different MTU sizes
Logs:	Abidal iperf client, Juliano iperf server, summarized log of Tsunami transfers

I modified iperf v2.0.2 PerfSocket.cpp such that SetSocketOptions() now attempts to disable UDP checksums. This is the additional code:

    if ( isUDP( inSettings ) ) {
        int yes = 1;
        setsockopt ( inSettings->mSock, SOL_SOCKET, SO_NO_CHECK, &yes, sizeof(yes));
    }

In Tsunami ./server/network.c I added the same SO_NO_CHECK setting. Later I updated ./server/main.c for slightly better IPD timing code:

  /* delay for the next packet */
  ipd_time = get_usec_since(&delay);
  // ipd_time = ((ipd_time + 50) < xfer->ipd_current) ? ((u_int64_t) (xfer->ipd_current - ipd_time - 50)) : 0;
  // usleep_that_works(ipd_time);
  if (ipd_time < xfer->ipd_current) {
      usleep_that_works(xfer->ipd_current - ipd_time);
  }

Iperf results: when the client (sender) sets SO_NO_CHECK, UDP rates up to 9.90 Gbit/s work with 0% packet loss! The CPU load hits 100% on one core, the rest are idle. Testing again without SO_NO_CHECK on the sender side results in poor rates and high loss again.

MTU	UDP size	target	result	one core 'top' on Abidal/sender
1500 bytes	32 kB	9000m	4.88 Gbits/sec 0.007 ms 12097/570979 (2.1%)	"3.0%us, 54.0%sy", iperf 100%
4470 bytes	32 kB	9000m	8.55 Gbits/sec 0.009 ms 50053/1028255 (4.9%)	"6.6%us, 58.8%sy", iperf 100%
9000 bytes	32 kB	9000m	9.04 Gbits/sec 0.010 ms 0/1034452 (0%)	"37.9%us, 49.8%sy" iperf 100%
9000 bytes	32 kB	9900m		"36.0%us, 56.0%sy" iperf 100%

Table: effect of both PCs MTU setting on the CPU load and thus maximum sending rate. On Abidal always only one of four CPU cores had the full load, the other cores were idle.

After reducing MTU from 9000 to 1500, still with checksums disabled, the client iperf on Abidal could not send faster than about 5 Gbit/s. Further results are in the table above.

The iperf is a bit inconsistent. After trying MTU 9000, 1500, 4470 and then reverting to 9000 again the same UDP iperf of the first test now works "only" at 9.04 Gbit/s instead of 9.90 Gbit/s. Perhaps need to reboot after tampering with MTU.

Iperf conclusions: A possible conclusion is that with the current driver or hardware, UDP checksumming indeed runs in software and not in the hardware. Another thing is that Chelsio had revealed that unlike with offloaded TCP streams, for UDP packets the IP fragment reassembly is done in software. The two-fold difference we see in the CPU-limited (100% load) throughput between 1500 and 9000 MTU settings but with the same 32kB UDP blocksize seems to confirm this. Thus to get fast UDP we need both jumbo-frames as well as SO_NO_CHECK!

Tsunami results: disabling UDP checksums did help a little, but in addition the IPD packet spacing code had to be improved slightly. Before, it had an intentional +-50us inaccuracy. Now transmit rates easily reach 7 Gbit/s, but the (diskless) receiving side is still too slow. Memory-to-memory 4.5 Gbit/s works with 0% loss, but faster has increased loss. A 8192 blocksize at 9000 MTU works dismal. And with memory-to-disk the situation is still bad, only ~3.5 Gbit/s works until the disk becomes too slow, same as before.

Tsunami conclusions: Something in the client slows things down even in the diskless mode. The ringbuffer uses mutexes and a 'gprof' profiling suggests quite a lot of CPU% is spent in the bit-fiddling 'session->transfer.received[block / 8] & (1 << (block % 8))' code (bitmap of received blocks). And what about that zero-copy thing - mmap() the output file?

13Dec2007 - Some Tsunami v1.2 testing

The results for Abidal's 12-disk RAID0 performance test are below. Baseline performance measured with 'hdparm' told the slowest disk was /dev/sdi with 66MB/s throughput, hence ideally we should get 12 * 66 MB/s = 792 MB/s = 6.3 Gbit/s. The best write speed we can get is 4.5 Gbit/s using the ext2 file system.

Filesys	Result of bonnie++ -d /raid -f -s 20G:8k
XFS	482MB/s write, 209MB/s re-write, 572MB/s read abidal,20G,,,482387,73,208637,41,,,572292,63,253.6,1,16,11033,60,+++++,+++,8389,40,10951,58,+++++,+++,7524,38
ReiserFS	265MB/s write, 313MB/s re-write, 613MB/s read abidal,20G,,,264824,95,312503,94,,,613376,98,435.9,2,16,23639,99,+++++,+++,17679,84,22932,100,+++++,+++,19246,100
ext2	563MB/s write, 180MB/s re-write, 293MB/s read abidal,20G,,,563299,72,180216,42,,,292603,34,297.5,1,16,,+++++,+++,,,,,+++++,+++,,*
JFS	257MB/s write, 158MB/s re-write, 344MB/s read abidal,10G,,,422608,67,276320,48,,,664353,69,409.5,1,16,11541,32,+++++,+++,16046,41,8430,68,+++++,+++,4595,38 abidal,20G,,,414490,67,291779,53,,,674383,74,323.0,1,16,9353,29,+++++,+++,15385,43,6617,56,+++++,+++,3513,26
FAT	320MB/s write, 306MB/s re-write, 766MB/s read abidal,10G,,,319583,71,306267,97,,,765734,94,609.4,2,16,73,99,171,100,772,100,117,99,176,99,295,100
raw	FAT /raid unmounted : hdparm -tT /dev/md0 : 594.61 MB/sec FAT /raid mounted : hdparm -tT /dev/md0 : 173.58 MB/sec
raw	wr-nexgen /dev/md0, 80GB of data in 32kB blocks, 1.2GB RAM buffer Took 126.376016 seconds, 5437.699243 Mbits(dec)/s NB: kflushd 35%CPU, pdflush 75%CPU, wr-nexgen 100%CPU, entire 4-core PC 55% busy
raw	wr-nexgen /dev/md0, 0.8TB of data in 32kB blocks, 2GB RAM buffer Took 1284.366311 seconds, 5350.457744 Mbits(dec)/s NB: never drops below 4.9 Gbps
Filesys	Result of bonnie++ -d /raid -f -s 30G:32k
XFS	489MB/s write, 217MB/s re-write, 593MB/s read abidal,30G:32k,,,489413,71,217053,41,,,593493,63,231.6,1,16,12044,69,+++++,+++,7552,37,8302,51,+++++,+++,8991,48
Table: File system performance on the same 12-disk SATA RAID0 set. Fastest write is on ext2 and reads on FAT32, but with both read and write combined XFS is fastest.

A quick tests using LVM2 instead of MD to build the RAID 0 gave really poor performance. The best raw LVM2 read rate of 275MB/s was found for 8 stripes and a 64kB stripe size. Using more or less disks and larger or smaller stripes only resulted in even smaller throughput. This could be because of the vintage LVM bug #129488.

Tsunami v1.2 performance was tested in some more detail. First, the client's UDP recvfrom() was replaced by a block number incrementer instead of actually receiving the UDP data, and the ringbuffer and disk I/O were disabled. Here the client "throughput" reaches many Terabit/s on Abidal. With the ringbuffer access (mutexes etc) added back, "throughput" is 11 Gbit/s. When the recfrom() code is restored, that is, when testing memory-to-memory transfers without disks, the maximum throughput is ~7 Gbit/s like before. With disk I/O added because of slow RAID0 the throughput crawls down to the same ~3.5 Gbit/s we achieved earlier.

The bonnie++ throughput figures for 'ext2' looked promising. Tsunami to an ext2- instead of xfs-formatted RAID0 however did work that much faster, the limit was 3.7 Gbit/s. According to 'top', the Tsunami client took 200% CPU.

With the UDT4 protocol (appserver, appclient, 8 Gbps target rate) the maximum memory-to-memory rate is only 5.2 Gbps for now.

23Jan2007 - raw2raid and raid2udp tests

todo ...

Using mmap() to receive network data directly into the file buffers seemed initially promising because of the Linux kernel zero-copy feature that can be used by mmap():ing a file into the address space of the process. This way, one or more memory copy operations are skipped and this reduces the CPU load.

A major problem turned out to be the kernel flushing behaviour. Without explicit flushing in the source code, the kernel will keep the written mmap():ed file pages in main memory over the duration of several or even tens of seconds until nearly all free memory is gone (4GB on Abidal). Only then the pages are flushed to disk and everything stalls for several seconds until the flush completes -- or the kernel makes a kernel dump.
It was necessary to add manual ASYNC or even SYNC msync() calls after every couple of megabytes. This helped, and Abidal RAID0 could sustain around 4.2 Gbit/s. The performance of a similar program gulp was comparable.

Lab notes on 10 Gbit/s network tests (C) 2007 Jan Wagner, Guifré Molera

08Nov2007 - 'crossover' test

09Nov2007 - more tests

13Nov2007 - petabit Tsunami

15Nov2007 - Myrinet boards tests

15Nov2007 - Myrinet boards tests

03/04Dec2007 - HP 6400cl and jumboframe tests

12Dec2007 - UDP checksumming tests

13Dec2007 - Some Tsunami v1.2 testing

23Jan2007 - raw2raid and raid2udp tests

Lab notes on 10 Gbit/s network tests
(C) 2007 Jan Wagner, Guifré Molera