These lab notes contain our 10 Gbit/s network tests and results that we achieved. The screen logs and notes are available for download. Everything here should be considered "work in progress". Tests are run with normal iperf, with tsunami UDP data transfer, FTP, and other usual tools.
Brief status summary (12Dec2007): jumbo frames are essential, offloaded TCP does 9.5 Gbit/s with nearly out-of-box settings, UDP requires some more tweaking to achieve 9.9 Gbit/s.
Network cards: | Chelsio evaluation kit - two N320E-CX (dual-port 10G CX4) cards and one 3m CX4 cable for USD 1990 |
Computers: | abidal and juliano |
Setup: | Each PC contained one dual-port N320E, one port was used to connect PCs together directly over the 3m CX4 cable. I set up a 192.168.1.* private net for 10G. Abidal swap off, Juliano swap on. |
Tests: | TCP and UDP iperf with 1500 and 9000 MTU interface setting, attempts at disk-to-disk Tsunami |
Screen logs: | logs for abidal and juliano |
Results and notes: the cxgb3 1.0.113 T3 driver wouldn't compile in
2.6.17.4 and 2.6.22, I gave up after five minutes (TODO: try again to
compile and install) and just used to cxgb3 driver that is already
included in the mainstream kernel. But, cxgb3 complained (dmesg, at ifup
eth2) about firmware 4.7.0.0 and that it was too new for the driver. Had
to delete all t3fw-4.7.0.bin's (in /lib/firmware/ etc) and instead copy
there Chelsio's t3fw-4.0.0.bin. The card and pinging between the
two PCs started to work.
Initial iperf TCP and UDP transfers went at 2-3 Gbit/s. With 32kB packets
UDP went at 3.5 Gbit/s. Iperf always hit 100% CPU. After changing MTU
from 1500 to 9000 throughput increased to 8 Gbit/s for UDP. Iperf
still hit 100% CPU. Suspecting that newer driver version may offload
better.
Tsunami did not work too well, only 1.5 Gbit/s Tsunami from Abidal
RAID to Juliano RAID was possible. The opposite direction didn't quite
work. Abidal tsunami client data landed all in memory, data was flushed
to disk by OS only after ~10s pauses (watching disk cage LEDs flicker).
Tried increasing abidal ./tsunami/client/client.h ring buffer size to
512MB, didn't help much, just takes longer for client diskiothred
blocking messages to appear, no throughput improvement.
A possible reason is Xorg on da monkey, it ate 2 x 450MB memory and 100%
CPU of one core (launchpad.net ubuntu gutsy bug
#51991). Switched xorg.conf 'fglrx' back to 'vesa' and Xorg memory
and CPU hogging went away, good thing. Further tsunami tests not yet done
after this. TODO!
Test setup is the same as 08Nov2007. No new cxgb3 driver or firmware yet.
Setup: | Tweaked the TCP settings on both PCs (tcp-tweakUp.sh) after running the first couple of tests with Ubuntu-default settings. |
Tests: | TCP and UDP iperf with 1500, 9000 and 9600 MTU interface setting, disk-to-disk and disk-to-memory Tsunami |
Screen logs: | logs for abidal and juliano |
Tsunami shortcoming: the target rate (e.g. 'set rate 4g') integer overflows above 4g/4096m so one can't set for example a 5g target rate. This would need a tsunami protocol version change v1.1->v1.2 and breaks compatibility with older Tsunami software...
Progress of cxgb3 driver update: On Abidal the first compile
attempt of cxgb3 1.0.113a on Ubunty Gutsy resulted in
Results and notes: TCP iperf worked at near 9 Gbit/s. UDP iperf works poorer. Tsunami reached around 3.5 Gbit/s for abidal raid to juliano memory.
Abidal power consumption checked out of general curiosity with 11 disks switched on, 6 of them in RAID0, 'vesa' driver in xorg.conf: at PC power-up ~500W, running idle ~310W, reading from 6-disk RAID ~330W, tsunami'ing from 6-disk RAID to 10G CX4 ???W.
more...
Setup: | Booted Juliano into old 2.6.20 kernel and installed Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware. |
Tests: | Tsunami from Abidal disk to Juliano memory at 9600 MTU setting |
Computers: | abidal and juliano |
Screen logs: | logs for abidal and juliano |
I've now created a new Tsunami branch for protocol v1.2, see tsunami homepage. Installed it on Juliano and Abidal. Several changes to Tsunami were needed to get it working with 64-bit protocol variables. For a x64 platform compile (e.g. abidal) the source needed casts, now x86 and x64 compile works without warnings.
Results and notes: the final transfer rate was 3.4 Gbit/s
when transferring out from Abidal's RAID, while 'hdparm' says the disks
can do raw 4.4 Gbit/s. The usleep_that_works() doesn't seem to work
fully, for example at a 2G setting the actual IPD is 108us vs desired
130us. The server/main.c line ipd_time = ((ipd_time + 50) < xfer->ipd_current) ? ((u_int64_t) (xfer->ipd_current - ipd_time - 50)) : 0;
Setup: | Juliano + Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware and Abidal + Myrinet myri10gb driver |
Tests: | Test using one Chelsio board and one Myrinet board connected by CX4 10 m cable |
Computers: | abidal and juliano |
Screen logs: | logs for abidal and juliano |
Myrinet board allows only jumbo frame size (mtu) of 9000. (caused by chipset specs or non-dual port mode?). So I need to decrease also the mtu in the Juliano side, so both have similar mtu. TX / RX is around 5.60 Gbps and 6.50 Gbps one way. Both computers take 100 % of the CPU. Using TCP mode the transfer rate changes from 7.03 and 5.77 Gbps ( so one goes up and other decreases). Also the CPU used is 80 to 100 %. No such changes if the packet size is 16K, 24K or 32K.
Tsunami tests -> Using diskless both sides give an average rate of 3800 Mbps for both interfaces. Constant small error rate about 10 %. Both CPU's are fully loaded (100%).
Setup: | Juliano + Chelsio's cxgb3 1.0.113 driver including 4.7.0.0 firmware and Abidal + Myrinet myri10gb driver |
Tests: | Tests using Tsunami and Myrinet and Chelsio boards. |
Computers: | abidal and juliano |
Screen logs: | logs for abidal and juliano |
Tests using Tsunami and lossy transfers, allows a maximum transfer rate of 7000 Mbps (64K) and 6750 Mbps (32K). The data loss for each case is around 45 %. Reducing the rate in order to get 0% data we end up at 4200 Mbps without any data loss or really low. Even 4700 Mbps gave an error of 7 %. So not too bad in that sense. It seems to improve a bit the data loss by using 64K than 32K.
Finally using disk writing-> Transfers using Tsunami works fine up to 3500 Mbps, but not higher speeds. Then the data is loss in huge amounts and usually crashes Tsunami... Also using netcat the maximum rate is 420 MBps so around 3300 Mbps. (Remember the speed achieved by hdparm from the RAID disk was 470 MBps).
dd bs=32768 count=1000000 if=/dev/zero | nc -u 192.168.1.105 64224 nc -u -l -p 64224 > /raid/test1
Juliano 2.6.20/.22 and Abidal 2.6.23, Chelsio CX4's.
Setup: | Enabled 9000-byte jumbo frames on the HP 6400cl, Summit X450, Juliano and Abidal. Legacy interrupt mode disabled using kernel boot parameter 'pci=msi'. |
Computers: | abidal and juliano |
Tests: | TCP and UDP iperf |
Logs: | logs for juliano logs |
Chelsio CX4 to HP6400cl to Chelsio CX4: TCP 9.5 Gbit/s, UDP still only around 5 Gbit/s. Using the Myrinet SR instead of Chelsio CX4 in Abidal made TCP and UDP throughputs only worse.
UDP throughput is limited by the test computer CPUs. 'iperf' doesn't use threads, so it loads only one core of the multi-core CPUs.
In previous UDP tests the sender and receiver hit 100% on one CPU core. The other 1 or 3 cores remained unused. With two 'iperf' client+client and server+server program instances running in parallel, at both ends, 3.9 Gbit/s UDP per iperf works fine with 0% loss but near 100% CPU core load. This gives a total throughput of 7.8 Gbit/s UDP through the same network card (4470 MTU). Increasing to 9000 MTU allowed two parallel 4.9 Gbit/s UDP transfers with 0% loss i.e. a total of 9.8 Gbit/s UDP (9000 MTU).
Conclusions: for ≥4 Gbit/s UDP to work efficiently, it looks like the UDP receiver program should be multithreaded - at least for the 3 GHz Core 2 Duo's or 2 GHz AMD Opterons tested here.
TODO: play nice and test VLANs with jumboframes. See if we can get the same jumboframe routing problem that JIVE sees on their HP 5412zl switch with certain new totally buggy firmware.
Trying to find out why UDP iperf has such a high CPU load. One suspicion is that TCP checksum offloading is in hardware while UDP checksums are calculated in software.
Setup: | Enabled 9000-byte jumbo frames on the HP 6400cl, Summit X450, Juliano and Abidal. Kernel has boot parameter 'pci=msi'. |
Computers: | abidal and juliano |
Tests: | UDP iperf with iperf modified to use the SO_NO_CHECK socket option, Tsunami v1.2 with SO_NO_CHECK, different MTU sizes |
Logs: | Abidal iperf client, Juliano iperf server, summarized log of Tsunami transfers |
I modified iperf v2.0.2 PerfSocket.cpp such that SetSocketOptions() now attempts to disable UDP checksums. This is the additional code:
if ( isUDP( inSettings ) ) { int yes = 1; setsockopt ( inSettings->mSock, SOL_SOCKET, SO_NO_CHECK, &yes, sizeof(yes)); }
In Tsunami ./server/network.c I added the same SO_NO_CHECK setting. Later I updated ./server/main.c for slightly better IPD timing code:
/* delay for the next packet */ ipd_time = get_usec_since(&delay); // ipd_time = ((ipd_time + 50) < xfer->ipd_current) ? ((u_int64_t) (xfer->ipd_current - ipd_time - 50)) : 0; // usleep_that_works(ipd_time); if (ipd_time < xfer->ipd_current) { usleep_that_works(xfer->ipd_current - ipd_time); }
Iperf results: when the client (sender) sets SO_NO_CHECK, UDP rates up to 9.90 Gbit/s work with 0% packet loss! The CPU load hits 100% on one core, the rest are idle. Testing again without SO_NO_CHECK on the sender side results in poor rates and high loss again.
MTU | UDP size | target | result | one core 'top' on Abidal/sender |
---|---|---|---|---|
1500 bytes | 32 kB | 9000m | 4.88 Gbits/sec 0.007 ms 12097/570979 (2.1%) | "3.0%us, 54.0%sy", iperf 100% |
4470 bytes | 32 kB | 9000m | 8.55 Gbits/sec 0.009 ms 50053/1028255 (4.9%) | "6.6%us, 58.8%sy", iperf 100% |
9000 bytes | 32 kB | 9000m | 9.04 Gbits/sec 0.010 ms 0/1034452 (0%) | "37.9%us, 49.8%sy" iperf 100% |
9000 bytes | 32 kB | 9900m | "36.0%us, 56.0%sy" iperf 100% | |
Table: effect of both PCs MTU setting on the CPU load and thus maximum sending rate. On Abidal always only one of four CPU cores had the full load, the other cores were idle. |
After reducing MTU from 9000 to 1500, still with checksums disabled, the client iperf on Abidal could not send faster than about 5 Gbit/s. Further results are in the table above.
The iperf is a bit inconsistent. After trying MTU 9000, 1500, 4470 and then reverting to 9000 again the same UDP iperf of the first test now works "only" at 9.04 Gbit/s instead of 9.90 Gbit/s. Perhaps need to reboot after tampering with MTU.
Iperf conclusions: A possible conclusion is that with the current driver or hardware, UDP checksumming indeed runs in software and not in the hardware. Another thing is that Chelsio had revealed that unlike with offloaded TCP streams, for UDP packets the IP fragment reassembly is done in software. The two-fold difference we see in the CPU-limited (100% load) throughput between 1500 and 9000 MTU settings but with the same 32kB UDP blocksize seems to confirm this. Thus to get fast UDP we need both jumbo-frames as well as SO_NO_CHECK!
Tsunami results: disabling UDP checksums did help a little, but in addition the IPD packet spacing code had to be improved slightly. Before, it had an intentional +-50us inaccuracy. Now transmit rates easily reach 7 Gbit/s, but the (diskless) receiving side is still too slow. Memory-to-memory 4.5 Gbit/s works with 0% loss, but faster has increased loss. A 8192 blocksize at 9000 MTU works dismal. And with memory-to-disk the situation is still bad, only ~3.5 Gbit/s works until the disk becomes too slow, same as before.
Tsunami conclusions: Something in the client slows things down
even in the diskless mode. The ringbuffer uses mutexes and a 'gprof'
profiling suggests quite a lot of CPU% is spent in the bit-fiddling
The results for Abidal's 12-disk RAID0 performance test are below. Baseline performance measured with 'hdparm' told the slowest disk was /dev/sdi with 66MB/s throughput, hence ideally we should get 12 * 66 MB/s = 792 MB/s = 6.3 Gbit/s. The best write speed we can get is 4.5 Gbit/s using the ext2 file system.
Filesys | Result of bonnie++ -d /raid -f -s 20G:8k |
---|---|
XFS | 482MB/s write, 209MB/s re-write, 572MB/s read abidal,20G,,,482387,73,208637,41,,,572292,63,253.6,1,16,11033,60,+++++,+++,8389,40,10951,58,+++++,+++,7524,38 |
ReiserFS | 265MB/s write, 313MB/s re-write, 613MB/s read abidal,20G,,,264824,95,312503,94,,,613376,98,435.9,2,16,23639,99,+++++,+++,17679,84,22932,100,+++++,+++,19246,100 |
ext2 | 563MB/s write, 180MB/s re-write, 293MB/s read abidal,20G,,,563299,72,180216,42,,,292603,34,297.5,1,16,*,+++++,+++,*,*,*,*,+++++,+++,*,* |
JFS | 257MB/s write, 158MB/s re-write, 344MB/s read abidal,10G,,,422608,67,276320,48,,,664353,69,409.5,1,16,11541,32,+++++,+++,16046,41,8430,68,+++++,+++,4595,38 abidal,20G,,,414490,67,291779,53,,,674383,74,323.0,1,16,9353,29,+++++,+++,15385,43,6617,56,+++++,+++,3513,26 |
FAT | 320MB/s write, 306MB/s re-write, 766MB/s read abidal,10G,,,319583,71,306267,97,,,765734,94,609.4,2,16,73,99,171,100,772,100,117,99,176,99,295,100 |
raw | FAT /raid unmounted : hdparm -tT /dev/md0 : 594.61 MB/sec FAT /raid mounted : hdparm -tT /dev/md0 : 173.58 MB/sec |
raw | wr-nexgen /dev/md0, 80GB of data in
32kB blocks, 1.2GB RAM buffer Took 126.376016 seconds, 5437.699243 Mbits(dec)/s NB: kflushd 35%CPU, pdflush 75%CPU, wr-nexgen 100%CPU, entire 4-core PC 55% busy |
raw | wr-nexgen /dev/md0, 0.8TB of data in 32kB blocks, 2GB RAM
buffer Took 1284.366311 seconds, 5350.457744 Mbits(dec)/s NB: never drops below 4.9 Gbps |
Filesys | Result of bonnie++ -d /raid -f -s 30G:32k |
XFS | 489MB/s write, 217MB/s re-write, 593MB/s read abidal,30G:32k,,,489413,71,217053,41,,,593493,63,231.6,1,16,12044,69,+++++,+++,7552,37,8302,51,+++++,+++,8991,48 |
Table: File system performance on the same 12-disk SATA RAID0 set. Fastest write is on ext2 and reads on FAT32, but with both read and write combined XFS is fastest. |
A quick tests using LVM2 instead of MD to build the RAID 0 gave really poor performance. The best raw LVM2 read rate of 275MB/s was found for 8 stripes and a 64kB stripe size. Using more or less disks and larger or smaller stripes only resulted in even smaller throughput. This could be because of the vintage LVM bug #129488.
Tsunami v1.2 performance was tested in some more detail. First, the client's UDP recvfrom() was replaced by a block number incrementer instead of actually receiving the UDP data, and the ringbuffer and disk I/O were disabled. Here the client "throughput" reaches many Terabit/s on Abidal. With the ringbuffer access (mutexes etc) added back, "throughput" is 11 Gbit/s. When the recfrom() code is restored, that is, when testing memory-to-memory transfers without disks, the maximum throughput is ~7 Gbit/s like before. With disk I/O added because of slow RAID0 the throughput crawls down to the same ~3.5 Gbit/s we achieved earlier.
The bonnie++ throughput figures for 'ext2' looked promising. Tsunami to an ext2- instead of xfs-formatted RAID0 however did work that much faster, the limit was 3.7 Gbit/s. According to 'top', the Tsunami client took 200% CPU.
With the UDT4 protocol (appserver, appclient, 8 Gbps target rate) the maximum memory-to-memory rate is only 5.2 Gbps for now.
todo ...
Using mmap() to receive network data directly into the file buffers seemed initially promising because of the Linux kernel zero-copy feature that can be used by mmap():ing a file into the address space of the process. This way, one or more memory copy operations are skipped and this reduces the CPU load.
A major problem turned out to be the kernel flushing behaviour. Without
explicit flushing in the source code, the kernel will keep the written
mmap():ed file pages in main memory over the duration of several or even
tens of seconds until nearly all free memory is gone (4GB on Abidal).
Only then the pages are flushed to disk and everything stalls for several
seconds until the flush completes -- or the kernel makes a kernel dump.
It was necessary to add manual ASYNC or even SYNC msync() calls after
every couple of megabytes. This helped, and Abidal RAID0 could sustain
around 4.2 Gbit/s. The performance of a similar program gulp was comparable.