Extreme Bandwidth: SC2002 Bandwidth Challenge | Internet Land Speed Record | TCP stack & jumbo frame measurements

10GE end-to-end TCP tests

Contents: Setup | Results (2/26/03) | David's Measurements 2/23/03 | Fabrizio's Measurements | Cheng's Measurements 2/27/03: | Sylvain's Measurements 2/28/03 | Next Steps Cheng's Results 2/27/03 | Sylvain's Results 2/28/03 | SNV to CHI | Fabrizio's Results SNV to GVA 3/3/03

Setup

The following machines have Intel 10GE interfaces.
 "haste" machines (Dell poweredge 2650's from LANL)
 198.51.111.94: sunnyvale in haste3 (removed 2/28/03) -> gw 198.51.111.93 (Cisco GSR 12406)
 192.5.175.133: Chicago in haste2 (removed 2/28/03) -> Juniper T640 

 Disk array machines from Caltech/CERN
 198.51.111.90: sunnyvale in cit-slac-19 eth1 -> gw 198.51.111.89 (Cisco GSR 12406)
 192.91.236.245: Chicago in v12chi -> gw 192.91.236.246 (Cisco 7609)
 192.91.239.213: Geneva in w02chi -> gw 192.91.239.214
Buffer in bottleneck router (the Cisco 7609 at Geneva) was increased from 2048 packets to the maximum of 4096 packets early 2/27/03.
10GE router monitoring at GVA
CERN POP at StarLight, Map of StarLight
Intel 10GE card front, back, Sample notice, Cisco 10GE GSR card, Cisco GSR at Sunnyvale, Les with GSR, note bootees for cleanliness
Hardware at Sunnyvale: specifications, SuperMicro Documentation, SuperMicros at Sunnyvale, Fabrizio booting SuperMicro
Configurations: SNV19, GVA2, haste3.
People: SLAC LANL team Left to Right: Jerrod Williams, Fabrizio Coccetti, Les Cottrell, Connie Logg, Eric Weigle (LANL), I-Heng Mei, Jiri Navratil.

Results, we have (2/26/03):

 1.2Gbps with 1500B MTU & FAST back to back between 198.51.11.90 and haste3 (191.51.111.94) [Eric]
 1.7Gbps with 8160B MTU & FAST back to back between    "   "            "          "        [Eric]

 Send UDP data at 1.9 Gbps from GVA (maximum sending rate could reach with iperf) [Sylvain]. 
 Sylvain reported "I have very poor performance from CERN. I can only send
 UDP traffic at 30 Mbps from the 10 GbE interface."

 TCP (Reno) performances are very poor around 320 Mbps using Jumbo frames 
 (from w02gva to a station at Chicago which has a GbE interface). Packets are lost
 when the throughput reaches 550 Mbps. [Sylvain]

 Can send TCP traffic at 2.2 Gbps from Chicago to the 10 GE Intel card at CERN using 
 3 TCP (Reno) streams and Jumbo frames. We are close to the saturation of the 
 transatlantic link and it means that that Intel card installed at CERN can receive more 
 than 2.2 Gbps of TCP traffic. [Sylvain] 

 Stock TCP from SNV to GVA max was 70Mbps [Eric]

 From SNV CHI on the 1st day we installed the card (Monday) Fabrizio & Eric
 got 1.3Gbits/s with HSTCP  from SNV (.90) to CHI.

 From SNV to CHI with FAST David got 1.35Gbps

 Between SNV (.90) and CHI somebody reported "I think the CPU is the
 bottleneck finally. UDP can reach only 1.5Gbps from .90 to chicago and 
 CPU was nearly fully used"

 It appears we have major problems between SNV and GVA, SNV CHI is at
 least an order of magnitude better. We need to understand this in more 
 detail. How does it look with UDP, how does it look between CHI & SNV?

David's measurements 2/26/03:

1. UDP tests of receiving rate (sending >= 1.5Gbps) between CHI/SNV/GVA with MTU 1500:
         To: CHI          SNV          GVA
From
CHI          n/a          840Mbps     1.8Gbps
SNV       1.5Gbps         n/a         1.5Gbps
GVA       95Mbps          736Mbps      n/a
The route from GVA (192.91.239.213 or .2) to CHI is through the 100Mbps port. (see raw logs are in: http://www.cs.caltech.edu/~weixl/feb26testing/udp/)
Something is wrong for the receiving path of the SNV 10GE port(198.51.111.90). -- This does not contradict with our observed 1.3Gbps from SNV to CHI since the receiving path of SNV90 is for ack, very light traffic and not sensitive to loss.)
2. Stock TCP tests
             To: 198.51.111.82 (SNV19 1GE)        198.51.111.90  (SNV19 10GE)
FROM (CHI)
192.91.239.213 10GE       n/a                     <100Mbps & unstable (seem to be many losses)
192.91.239.2   1GE       >780Mbps for peak rate   n/a

    This further supports the suspicion of the problem in receiving 
path of 198.51.111.90...

-----------------------------------------------------
3. FAST TCP tests 
                To: 192.91.239.213 CHI (10GE)         192.91.239.2 (1GE)
FROM (SNV19)
198.51.111.82 (1G)              n/a                          940Mbps (plot)
198.51.111.90 (10G)            123Mbps                       n/a (plot)

Note: 
    192.91.239.213 is a 10GE and 192.91.239.2 is a 1GE, both on 
the same machine in Geneva.
    198.51.111.90 is a 10GE and 198.51.111.82 is a 1GE, both on 
the same machine in Sunnyvale.
David surmised that the bottlenck was the cpu, since UDP could only reach
1.5Gbps from .90 to Chicago and the cpu was almost fully used
Chicago's 10Ge has only 1 CPU. Although SNV's .90 has 2 (seen as 4 by 
hyperthreading), iperf can use 1 on a time. And that CPU was fully used.
-----------------------------------------------------
4. High-Speed TCP (web100) tests (nearly the same as FAST)
                 To: 192.91.239.213 (10GE) 192.91.239.2 (1GE)
FROM
198.51.111.82 (1G)   n/a                   940Mbps (alpha=400)plot)
198.51.111.90 (10G)  121~126Mbps           n/a (alpha=2000)(plot)

    I tried multiple (3) flows with HSTCP from SNV 10GE to GVA 10GE. 
Each flow got 123Mbps.
    Hence, we may suspect there is some problem other than congestion 
control algorithm, that prevents single flow rate going higher.

-----------------------------------------------------
5. Tests from 198.51.111.66 to Geneva 213/02 (198.51.111.66 is a 1GE 
port on another machine in Sunnyval)

                        To: 192.91.239.213 (10GE)         192.91.239.2 (1GE)
FROM
198.51.111.66 (1G)          124Mbps                848 Mbps

(Yet, the UDP achieved 957Mbps from 198.51.111.66 to 192.91.239.213.)
(Console).
-----------------------------------------------------
6. TCP dump:

    Part of the tests in 5 is recorded by "tcpdump -i eth3" to capture the 
return path: 

For 198.51.111.66 -> 192.91.239.213 
(tcpdump)

For 198.51.111.66 -> 192.91.239.2 
(tcpdump)
We can see that the advertized receiving window of SNV66-GVA2 is 43906, but the advertized receiving window of connection SNV66-GVA213 is about 2746, which prevents the sender (SNV) sending faster.

The same difference on the connection from SNV82-GVA2 and SNV90-GVA213 with web100 kernel. (http://www.cs.caltech.edu/~weixl/feb26testing/tcpdumpfrom90/)

Anyway, I still don't know why the advertized window of connections to the 10GE card is so small -- All the other things are the same except the receiving card. Any idea?

Fabrizio's Measurements

Using HST TCP with mtu=1500, txq=100 between SNV (198.51.111.90) & GVA (192.91.239.213)
Using the window/buffer sizes on GVA listed in Les' email (almost all values = 30000000) SNV(10GE) -> GVA(10GE) : 124 Mbps
Then I changed the window/buffer values of GVA to the same values of SNV. I created a file in GVA to store this configuration, to load this configuration: user@w02gva ~# sysctl -p /etc/sysctl-slac.conf Then I got a small increase in performance SNV(10GE) -> GVA(10GE) : 140 Mbps
Thus it appears that the unusual window/buffer configuration on the GVA host has little effect.
I confirm a throughput > 900 Mbps 
for HS(mtu1500,txq100) from SNV(1GE) -> GVA(1GE)

----
HS TCP (mtu=1500 txq=10000)
Same as above, but with a much bigger txq.
SNV(10GE) -> GVA(10GE) :  138 Mbps

---
HS TCP (mtu=8192, txq=100)
SNV(10GE) -> GVA(10GE) : IPERF HUNG, RETURNING NO RESULT
I got the same behavior for MTU 4096, 2000, 3000 (I did not try other
values)

If I run the same test (MTU:8192) from SNV to CHI(192.5.175.133)
SNV(10GE) -> CHI(10GE) : 1.3Gbps after 10 sec

----
from CHI(192.5.175.133) using 2.4.19-16mdk (I believe it is Stock TCP)
CHI(10GE) -> GVA(10GE) : 177Mbps

Sylvain's Measurements 2/27/03:

I've reached this afternoon 2.15 Gbps (according to the iperf output) between Sunnyvale and CERN using TCP Reno and Jumbo frame. Here are some results:

UDP test

Sunnyvale : Hast 3   ->  GVA w02gva  : UDP transfer at 1.8 Gbps (sending
rate = 2.18 Gbps  - loss rate = 17 %, MTU = 1500 byte - Sender CPU  load = 100%)

Sunnyvale : Hast 3   ->  CHI v12chi : UDP transfer at 1.9 Gbps (sending
rate = 2.21 Gbps  - loss rate = 13 %, MTU = 1500 byte - Sender CPU  load = 100%)


Sunnyvale : Cit-slac19   ->  GVA w02gva : UDP transfer at 1.86 Gbps
(sending rate = 1.86 Gbps  - loss rate = 0.4 %, MTU = 1500 byte - Sender CPU0 load = 30% CPU2 load = 100%)

Sunnyvale : Cit-slac19   ->  GVA w02chi : UDP transfer at 1.8 Gbps
(sending rate = 1.8 Gbps  - loss rate = 1.5 %, MTU = 1500 byte - Sender CPU0 load = 30% CPU2 load = 100%)

TCP (RENO) test (MTU 9000) single stream

With standard MTU, I loose too many packets to reach high throughput. Performances are very unstable. (Around 300 Mbps between CERN and Sunnyvale)
Sunnyvale : Hast 3   ->  GVA v12gva: 1.9 Gbps 

Sunnyvale : Cit-slac19-> GVA v12gva:  2.15 Gbps  (30 Gbytes in 120
seconds) Sender CPU0 load = 25% CPU2 load = 65%) 

Sunnyvale: GVA 2.37 Gbps according to the iperf output using TCP Reno and Jumbo frames
and 128MByte window (requested, 256MB allocated) for 180s.  
Consoles

Sunnyvale: GVA for 600s jumbo, 128MB (requested) got 2.37Gbps,
               for 120s jumbo, 128MB (requested) got 2.34Gbps,
               for 120s jumbo, 64MB (requested)  got 2.15Gbps 
          Consoles

Sunnyvale: GVA for 3700s transferred > 1 TByte in < 1 hour with jumbo, 1 stream, 128MB window
(requested). Console and plot.
Note that TCP performances are better that UDP performance because I am using Jumbo frames.

Next Steps

We need to consider what else we need to measure/understand while we still have the SNV link. We appear to have understood/measured:
1. Jumbos with stock between SNV & GVA

2. MTU 1500 with stock between SNV & GVA
Other possibilities (not prioritized, letters are just to help with later referencing) are below. Please add others that come to your mind. Then we will need to organize who does what and make sure we do not collide during the US time slot today.
A. SNV-GVA FAST TCP optimization with & without jumbos [Cheng]

B. SNV-CHI with jumbo & stock since it is a 10G path (not 2.5G) [Eric]

C. Multi-stream tests between SNV & GVA

D. SNV-GVA HS vs Scalable vs FAST [Cheng]

E. SNV-CHI disk to disk with optimum TCP  [Julian?]

F. I am unsure we can apply for the LSR (no production routers in path, 
   hardware not generally available), but if we can then we need to study the 
   rules and make an effort (can't use iperf since data must not replicate each packet etc.)
   [Fabrizio]
We can make parallel measurements from SNV to CHI with, measurements from SNV to GVA. If we try parallel measurements then Caltech should take disk servers (.90 at SNV), and SLAC/LANL the hastes (Dell 2550s).

Cheng's Results 2/27/03:

From SNV slac19 (198.51.111.90) to CHI haste2 (192.5.175.133) with txq=100, mainly short 120s tests.
2.4.19 Stock TCP
        1500 MTU        4000 MTU        6000 MTU        8000-9000 MTU
        peak 273 Mbps   1.1 Gbps        1.0 Gbps        2.2 Gbps
FAST
        peak 268 Mbps   1.1 Gbps        400 Mbps        2.2 Gbps
HSTCP
        peak 221 Mbps   1.1 Gbps        1.1 Gbps        1.4 Gbps
Scalable TCP
        peak 230 Mbps   1.1 Gbps        1.0 Gbps        2.2 Gbps
The fast output was stable. Fast was able to reach stable max fairly quickly. The only problem I have ever seen was with 5000 - 8000 MTUs between snv and cern for fast. The reno output was less stable. If one calculates the loss using the Mathis formula (loss=((0.75*MTU/RTT)/(rate))2) then one gets the plot below:

Sylvain's Results 2/28/03

From SNV to GVA. Single stream - Jumbo frames - TCP Reno:
----------------------------------------
Iperf : 2.38 Gbps (duration 1 Hour)  (Console)
        (couldn't record headers of the transfer because the maximum 
         file size of our linux system is too small)

        2.35 Gbps (duration 3 minutes) (We have the TCPdump file => 990 MB of headers!!!)

Rapid: 2.189 Gbps (Console) 
       (No TCPdump file because TCPdump running in parallel affect performance)

Rapid: 2.079 Gbps (Console) (I have the TCPdump file) 
Multi stream - Jumbo frames - TCP Reno:
--------------------------------------
Iperf (3 streams): 2.35 Gbps (I have the TCP dump file) Console
I haven't any results with rapid.

SNV to CHI

The bottleneck is 10Gbits/s (c.f. 2.5 Gbits/s for the GVA-CHI link), so we want to try and beat the 2.5Gbits/s barrier. We (Fabrizio & Les) tried to optimize the throughput with stock TCP from SNV (111.90) to CHI (192.91.236.245 aka v12chi). The best result was 1.3G with txq=1000 and MTU=3000 in SNV, and MTU=9000, txq=1000 in CHI, before we managed to crash the cisco 10G interface card. Before the crash, we tried to use values bigger than 4000 for the MTU in SNV, but we could not get any report from iperf, and looking at the cpu utilization it looks like iperf was idle. (on 2/27/03 Eric reported "1500 byte packets, haste-haste: maximum 150Mbps, average about 70. 8000+ byte packets: don't work.", also from Eric "Just ran tests from SV->Chicago between the haste machines, with traceroute -F; I start seeing losses on the last hop (192.5.175.129->192.5.175.133) at about 6000 bytes and at around 7465 bytes I get nothing. In the reverse direction (Chicago-SV), the first hop after the gateway doesn't respond, but I can get 9000 byte packets through to the other side no problem.", and Cheng reports 2/27/03 "I tried 1500, 4000, and 6000 bytes, but I couldn't get > 8000 byte MTU to work. traceroute to 198.5.175.133 only works for the first two hops.") When we tried to use mtu=5000 in CHI and SNV, the cisco 10G card crashed. Adam had to pull the card out from the router and then push in again. He did not reboot the router. The connection came up, but iperf does not gives answer for mtu values equal or bigger than 2000, mtu=1500 works (we did not try values between 1500 and 2000), but performances are quite sad (reno, throughput=222Mbps). We wonder if it is possible that the cisco interface need some reconfiguration. We do not have access (password) to that router. Its looking like there is a problem with jumbo frames between SNV and CHI which has been getting worse, and is now at the stage where jumbos do not work at all on this path.

Sylvain rebooted the router and checked its configuration. Everything seems to be OK. He did not know the origin of the problem. He also checked v12chi but couldn't solve the problem.