IEPM

TCP Stack Measurements on SC03 10Gbits/s Links

SLAC Home Page
TCP stacks on lightly loaded 1Gbit/s links (Jan '03) | TCP stacks on Production Links (April '03) | 7 TCP stack comparisons on Production Links (Oct '03) | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements | SC03 Bandwidth Challenge
Les Cottrell. Created 16 Dec '03

 


Introduction | Setup | Methodology | Results Measurement set 1 | Results Measurement set 2

Introduction

During SuperComputing 2003 (SC03), we made some tentative TCP performance measurements on 10Gbits/s links between hosts at the SLAC booth at the Phoenix convention center and a host at the SLAC/Stanford Point of Presence at the Palo Alto Internet eXchange (PAIX) and hosts at StarLight in Chicago. Due to the short amount of time we had access to these links (~ 3 days) and the emphasis on demonstrating the maximum throughput for the SC03 Bandwidth Challenge these measurement are necessarily incomplete, however some of the tentative results are felt to be worth reporting.

Setup

All hosts used Intel PRO/10GbE LR NICS plugged into the 133MHz 64 bit PCI-X slots and ran Linux 2.4.19 or more recent (the most recent was 2.4.22). At PAIX there was a Dell 2650 PowerEdge with two 3.06GHz Xeon cpus. It was connected directly to a Cisco 15540 DWDM multiplexer at 10Gbits/s. The 10Gbits/s wavelength was carried to LA by CENIC where it was plugged into a 10GE interface in a Cisco HPR router. On the other side of the router an OC192 POS interface sent the signal on a Level(3) circuit via San Diego to Phoenix where it was plugged into a Juniper router at SCInet, and thence through a Force10 E1200 and to a Cisco 6506 at the SLAC booth. Three twin cpu Dell 2650s with Intel 10GE NICs were plugged into the SLAC booth Cisco 6506. Two of the Dell 2650s had dual 3.06GHz cpus and the third had 2.4GHz cpus. The PAIX/LA/Phoenix link was dedicated to the SLAC and Caltech booth traffic. The route appeared as
[root@antonia ~]# traceroute 137.164.27.130
traceroute to 137.164.27.130 (137.164.27.130), 30 hops max, 38 byte packets
 1  B1.211.sc03.org (140.221.198.1)  0.203 ms  0.129 ms  0.119 ms
 2  scinet-211-C.sc03.org (140.221.240.61)  2.016 ms  1.405 ms *
 3  core-rtr-2-bwc-rtr-1.sc03.org (140.221.255.206)  0.306 ms  0.258 ms  0.253 ms
 4  slac-core-rtr-2.sc03.org (140.221.255.178)  9.851 ms  9.635 ms  9.644 ms
 5  hpr-slac-sc03--lax-hpr.cenic.net (137.164.27.130)  17.268 ms  17.252 ms  17.248 ms

A second 10Gbits/s link was also used via the shared Abilene backbone to StarLight/Chicago. At Starlight there was an HP Integrity rx2600 system with . dual Itanium 1.5 GHz CPUs with 8 GB RAM.  The route appeared as:
[root@iphicles root]# /usr/sbin/traceroute 192.150.94.252
traceroute to 192.150.94.252 (192.150.94.252), 30 hops max, 38 byte packets
 1  140.221.198.1 (140.221.198.1)  0.455 ms  0.200 ms  0.191 ms
 2  140.221.240.69 (140.221.240.69)  2.075 ms  1.394 ms *
 3  140.221.255.77 (140.221.255.77)  0.383 ms  0.385 ms  0.334 ms
 4  140.221.255.170 (140.221.255.170)  9.659 ms  9.629 ms  9.621 ms
 5  198.32.8.95 (198.32.8.95)  17.138 ms  17.112 ms  17.111 ms
 6  198.32.8.103 (198.32.8.103)  52.442 ms  52.425 ms  58.185 ms
 7  198.32.8.80 (198.32.8.80)  325.548 ms  344.866 ms  329.918 ms
 8  198.32.8.76 (198.32.8.76)  65.599 ms  65.422 ms  65.476 ms
 9  192.150.94.252 (192.150.94.252)  65.526 ms  65.478 ms  65.468 ms

Even though the Abilene backbone was shared, the typical cross-traffic was a few hundreds of Mbits/s, so to first order these 10Gbits/s links were dedicated to our use, unlike the the production network results reported in TCP Stacks Testbed.

We also had access via Abilene/Chicago to an HP Integrity rx2600 system with dual 1.5GHz cpus with 4GB RAM in Amsterdam/NIKHEF. The route to this host was as follows:
[root@antonia ~]# /usr.sbin/traceroute 192.150.94.34
/usr.sbin/traceroute: Command not found.
[root@antonia ~]# /usr/sbin/traceroute 192.150.94.34
traceroute to 192.150.94.34 (192.150.94.34), 30 hops max, 38 byte packets
 1  B1.211.sc03.org (140.221.198.1)  0.198 ms  0.130 ms  0.118 ms
 2  scinet-211-B.sc03.org (140.221.240.69)  0.928 ms  1.781 ms  1.459 ms
 3  core-rtr-1-a-bwc-rtr-1-a.sc03.org (140.221.255.77)  0.303 ms  0.294 ms  0.253 ms
 4  abilene-core-rtr-1.sc03.org (140.221.255.170)  10.727 ms  9.677 ms  9.615 ms
 5  snvang-losang.abilene.ucaid.edu (198.32.8.95)  24.372 ms  17.069 ms  21.460 ms
 6  kscyng-snvang.abilene.ucaid.edu (198.32.8.103)  52.696 ms  57.096 ms  59.529 ms
 7  * iplsng-kscyng.abilene.ucaid.edu (198.32.8.80)  251.159 ms *
 8  * * chinng-iplsng.abilene.ucaid.edu (198.32.8.76)  314.288 ms
 9  rc-lab-11.nc3a.nato.int (192.150.94.34)  174.666 ms  174.583 ms  174.578 ms
 

Methodology

We set up the sending hosts at SC2003 with the Caltech FAST TCP stack, and the DataTAG multi-TCP stack that allowed dynamic (without reboot) selection of the standard Linux TCP stack (New Reno with Fast re-transmit), the Manchester University implementation of the High Speed TCP (HS TCP) and the Cambridge University Scalable TCP stack. By default we set the Maximum Transfer Unit (MTU) to 9000Bytes and the transmit queue length (txqueuelen) to 2000 packets.

We made two sets of multi-stack measurements.

  1. The first chronologically was made during our Bandwidth Challenge demonstration time slot (17:00-18:30 November 19, 2003). The measurements were made between the SLAC/FNAL booth in Phoenix and PAIX, Chicago and Amsterdam. During the demonstration hosts in the Caltech and SLAC/FNAL booths were using the LA/PAIX, and Abilene links simultaneously with little attempt at coordination.
  2. The second set of measurements started just before midnight on Wednesday 19th November. This was a more controlled set of measurements with no attempt at a demonstration. Also due to the time the measurements were performed and more care being given to scheduling, there was negligible non-test measurement cross-traffic. In this set, each test was for 1200 seconds, with a given stack, and fixed maximum window size.

Results Measurement set 1

The effects of the Caltech/SLAC/FNAL bandwidth demo on the external router 10 Gbps interfaces with LA/PAIX, Abilene and Teragrid can be seen below:

Similar results are seen by looking at the Force 10 interfaces to the SLAC booth. More details on the measurements with Scalable, HS-TCP and stock TCP.

 Results - Measurement set 2

The effects of the measurements on the SCInet router external interfaces (facing LA/PAIX and Abilene) are shown below. The stacks used and the windows sizes are labeled. In the case of HS-TCP with an 8MByte window and Scalable with a 16MByte we changed the MTU from 9000Bytes to 1500Bytes half-way through each test. To start the FAST tests we had to reboot the sending host and due to an oversight, the FAST TCP measurements were all made with MTUs of only 1500Bytes. Also following the reboot the txqueuelen was set to 100 packets.

PAIX (17 ms RTT)

On the Phoenix to PAIX link we used maximum window sizes of 8Mbytes, 16MBytes and 32MBytes. This bracketed the nominal optimum window size calculated form the bandwidth delay product (17ms * 10Gbits/s) of  ~ 20MBytes. For the PAIX link, all the tests were made with a single TCP stream. For Reno, HS-TCP and Scalable there was little observable differences in the behavior between stacks:

The 4.3Gbits/s limit was slightly less than the ~ 5.0Gbit/s achieved with UDP transfers in the lab between back to back 3.06GHz Dell PowerEdge 2650 hosts. The limitation in throughput is believed to be due to CPU factors (CPU speed, memory speed or the I/O chipset). The relative decrease in throughput going from 9000Byte MTU to a 1500Byte MTU was roughly proportional to the reduction in MTU size. This maybe related to the extra cpu cycles required to process the 6 times as many, but 6 times as small MTUs. Back to back UDP transfers in the lab between 3.06GHz  Dell PowerEdge 2650 hosts achieved about 1.5Gbits/s or about twice the 700Mbits/s achieved with the SC03 long distance TCP transfers.

Chicago (65 ms RTT)

The BDP indicates a window size of about 80MBytes is needed (65 ms * 10Gbits/s). With a single Reno stream, we only had time to try windows of 8MBytes and 16MBytes and achieved stable average throughputs of only 767Mbits/s and 1530Mbit/s respectively. With 10 Reno streams, the throughput was much less stable (see 16384KByte and 32768KByte graphs). With the 32768KByte windows, we sustained stable peaks for several minutes of over 3.9Gbits/s (the Stability Index for the peak periods was about 12%) and an average throughput of about 3Gbits/s (Stability Index 39%). For the 16384KByte window there were short peaks of ~2.5Gbits/s and an average throughput of ~970Mbits/s (Stability Index 72%). From the SCInet router utilization data there was no evidence of congestion in SCInet or from SCInet to Abilene, however, there may have been other sources of congestion on the path (however, the RTTs also show no evidence of congestion) or in the host at Chicago. .

The configurations and results are summarized in the table below. Clicking on the stack name in the first column will display the time series result of the measurement.
Destination (RTT) Start
time (PST)
Stack Streams Window MTU txqueuelen Throughputs Mbits/s Stdev Mbits/s Stability
PAIX (17ms) 0:08 Reno 1 8192KB 9000B 2000 2892 76 2.63%
PAIX (17ms) 0:35 Reno 1 16384KB 9000B 2000 4328 8.4 0.19%
PAIX (17ms) 1:00 Reno 1 32768KB 9000B 2000 4293 8.6 0.2%
PAIX (17ms) 1:11 hs-tcp 1 8192KB 9000B 2000 2899 6.1 0.21%
PAIX (17ms) 1:22 hs-tcp 1 8192KB 1500B 2000 335 49 14.67%
PAIX (17ms) 1:58 hs-tcp 1 16384KB 9000B 2000 4287 342 0..09%
PAIX (17ms) 2:20 hs-tcp 1 32768KB 9000B 2000 4343 15.7 0.36%
PAIX (17ms) 2:45 scalable 1 8192KB 9000B 2000 2900 5.3 0.18%
PAIX (17ms) 3:08 scalable 1 16384KB 9000B 2000 4304 2.7 0..06%
PAIX (17ms) 3:35 scalable 1 16384KB 1500B 2000 676 70 10.36%
PAIX (17ms) 3:35 scalable 1 32768KB 9000B 2000 4314 7.8 0.18%
PAIX (17ms) 4:42 FAST 1 8192KB 1500B 100 349 26 7.45%
PAIX (17ms) 5:03 FAST 1 16384KB 1500B 100 693 45 6.49%
Chicago (65ms) 5:35 Reno 1 8192KB 9000B 2000 767 3.8 0.49%
Chicago (65ms) 6:01 Reno 1 16384KB 9000B 2000 1530 11.5 0.75%
Chicago (65ms) 6:25 Reno 10 16384KB 9000B 2000 2500 max / 972 avg 699 72%
Chicago (65ms) 6:55 Reno 10 32768KB 9000B 2000 4000 max / 3063 avg 449/1198 12% / 39%


Comments to iepm-l@slac.stanford.edu