NIKHEF logo


Caltech
  UvA home

FAQ on Internet2 Land Speed Record

FAQ put together by: Les Cottrell, SLAC


Cisco Systems, Inc.(R)
StarLight logo

Internet2

The FAQ was created out of responses to the NANOG mailing list.

Who did this?

Who did this work anyway
The record setting team consisted of members from the Nationaal Instituut voor Kernfysica en Hoge-Energiefysica (NIKHEF), the Stanford Linear Accelerator Center (SLAC), the California Institute of Technology (Caltech) and the Faculty of Science of the Universiteit van Amsterdam (UvA). In setting the new record, the team used the advanced networking capabilities of TeraGrid, StarLight, SURFnet, NetherLight, and the wide area optical networking links provided by Level 3 Communications (Nasdaq:LVLT) or the SC2002 event and by Cisco Systems to SLAC and Caltech. The team also received indispensable support from the CERN staff. For more details see the Internet2 publicity release.
Antony Antony of NIKHEF made the submitted measurement and lead the writing of the submission.

What is different about this?

Give 'em a million dollars, plus fiber from here to anywhere and let me muck with the TCP algorithm, and I can move a GigE worth of traffic too
You are modest in your budgetary request. Just the Cisco router (GSR 12406) we had on free loan listed at close to a million dollars, and the OC192 links just from Sunnyvale to Chicago would have cost what was left of the million/per month.
We used a stock TCP (Linux kernel TCP). We did however, use jumbo frames (9000Byte MTUs).
I can only hope that the researchers actually spent $2000-$4000 on moving the gigabit of data, pocketed the rest, and are now living in a tropical foreign country with lots and lots of drugs and women, because anything else would just be too sad for me to contemplate.
We are still living California and Amsterdam and at least the weather in California is great today.
What am I missing here, theres OC48=2.4Gb, OC192=10Gb ...
We were running host to host (end-to-end) with a single stream with common off the shelf equipment, there are not too many (I think none) > 1GE host NICs available today that are in production (e.g. without signing a non-disclosure agreement). There is one advertised by Alacritech. However, when we asked for prices and availability and any benchmarks we received the answer "Alacritech's 10Gig initiative is not far along to provide you with the data you have requested" on 3/10/2003.
Production commercial networks ... Blow away these speeds on a regular basis.
See the above remark about end-to-end application to application, single stream.
So, you turn down/off all the parts of TCP that allow you to share bandwidth ...
We did not mess with the TCP stack, it was stock off the shelf.
... Mention that "Internet speed records" are measured in terabit-meters/sec.
You are correct, this is important, but reporters want a sound bite and typically only focus on one thing at a time. I will make sure next time I talk to a reporter to emphasize this. Maybe we can get some mileage out of Petabmps (Peta bit metres per second) sounding like "petty bumps".
I'm going to launch a couple DAT tapes across the parking lot with a spud gun and see if I can achieve 923 Mb/s!
The spud gun is interesting, given the distances, probably a 747 freightliner packed with DST tapes or disks is a better idea. Assuming we fill the 747 with say 50 GByte tapes (disks would probably be better), then if it takes 10 hours to fly from San Francisco (BTW Sunnyvale is near San Francisco not near LA as one person talking about retiring to better weather might lead one to believe) the bandwidth is about 2-4 Tbits/s. However, this ignores the reality of labelling, writing the tapes, removing from silo robot, packing, getting to airport, loading, unloading, getting through customs etc. In reality the latency is really closer to 2 weeks. Even worse if there is an error (heads not aligned etc.) then the the retry latency is long and the effort involved considerable. Also the network solution lends itself much better to automation, in our case we saved a couple of full time equivalent people at the sending site to distribute the data on a regular basis to our collaborator sites in France, UK and Italy.
Just the fact that you need a ~20 megabyte TCP window size to achieve this (feel free to correct me if I'm wrong here) seems kind of unusal to me.
It is true large windows are needed. To approach 1Gbits/s we require 40MByte windows. If this is going to be a problem, then we need to raise question like this soon and figure out how to address (add more memory, use other protocols etc.). In practice to approcah 2.5Gbits/s requires 120MByte windows.
Spending millions of (probably taxpayer) dollars to win a meaningless record is unethical, IMHO.
The links were not simply set up to make a record attempt. The links and equipment are part of a large peer-reviewed European research effort (see DataTag); a demonstration set up by SURFnet, NetherLight (in the Netherlands) for iGrid2002, and then reused for SC2002; plus loans from Cisco and Level(3). So the impact on the US taxpayer was minimal. The record was incidental to the rest of the work done. However, what makes the press is not the detailed research work, but rather something unique and easier to explain. From another field, the capturing of the air speed record by the SR71 Blackbird was incidental to the purposer for which it was built.
There is no need to waste funding buying uber-fast routers or GigE links around the globe just to learn how to tune stacks or apps. If high-speed TCP research is what you're doing, rig up a latency generator in your laboratory and do your tests that way, just like the TCPSAT folks.
Good point. Following up on and driven by the work leading up to and following the Land Speed Record, some of the Caltech people collaborating on this record together with collaborators from SLAC and elsewhere, are proposing a "WAN in the Lab" that can be used for just such testing. This saves on leasing fibers but there are still considerable expenses to run at 10Gbit/s rates (cpus, NICs, optical multiplexing equipment etc.). It is also a much more controlled environment that simplifies things. On the other hand it misses out on the real world experience, and so eventually has to be tested first on real world lightly used testbeds, and then on advanced research networks and finally on production networks, to understand how issues such as fairness, congestion avoidance, robustness to poor implemementations or configurations etc. really work.

Who needs it?

What kind of production environment needs a single TCP stream of data at 1Gbits/s over a 150ms latency link?
Today High Energy Particle Physics needs hundreds of Megabits/s between California and Europe (Lyon, Padova and Oxford) to deliver data on a timely basis form an experiment site at SLAC to regional computer sites in Europe. Today on production acadmeic networks (with sustainable rates of 100 to a few hundred Mbits/s) it takes about a day to transmit just over a Tbyte of data which just about keeps up with the data rates. The data generation rates are doubling / year so within 1-3 years we will be needing speeds like in the record on a production basis. We needed to ensure we can achieve the needed rates, and whether we can do it with off the shelf hardware, how the hosts and OS' need configuring, how to tune the TCP stack or how newer stacks perform, what are the requirements for jumbo frames etc. Besides High Energy Physics other sciences are beginning to grapple with how to repliacte large databases across the globe, such sciences include radio-astronomy, human genome, global weather, seismic ...
Just trying to see if you can give some inputs how this will change the entire computing world.
Initially the main changes will be for data intenisve sciences such as Elementary Particle Physics, Nuclear Phyysics, Global Weather Prediction, Fusion, astrophysics, biology (in particular human Genome), and seismology, where there are critical need to share large amounts of data across worldwide collaborations. In many cases this is done today by shipping truck or plane loads of data. While a 747 full of high density tapes or disks has a high bandwidth (Tbits/second assuming 10 hours to fly San Francisco to Geneva), it also has high latency (typically 2 weeks to write the data, dismount, label, package, get to shipper, ship, get through customs and reverse the process, and even longer if there are errors). This puts the scientists not at the data source at a 2 week disadvantage, and also is people power expensive (a costs that is not deccreasing with time, unlike networking). Using the network the latency can be reduced to a day or less (today we are shipping about a Tbyte per day by the network from California to France, the UK and Italy), utilizing 100 Mbits/s links. Also the processes can be easily automated resulting in much less grunt type work. With the new performance we will be able to transmit the same data in a few hours or keep up with the increasing data rates (roughly doubling/year).

With Internet performance histoically imporoving at about a factor of 2 per year, Universities and companies with 45 or 155 or 622Mbits/day links to the Internet today will have 100 times that by 2010, so whole new avenues for shaing data and collaborating will be opened up. This may be expected to to have quite dramatic effects on industries like aeropspace, medicine, media distribution etc. (e.g. think of hotels able to donwload movies in less than a minute, movie production shops in New York sharing movies with Hollywood sites etc., let alone movies being as easy to share on the network as music is today).

High speeds are not important. High speeds at a *reasonable* cost are important. What you are describing is a high speed at an *unreasonable* cost.
(Response from David G. Andersen [dga@lcs.mit.edu]):) The bleeding edge of performance in computers and networks is always stupidly expensive. But once you've achieved it, the things you did to get there start to percolate back into the consumer stream, and within a few years, the previous bleeding edge is available in the current O(cheap) hardware.
A cisco 7000 used to provide the latest and greatest performance in its day, for a rather considerable cost. Today, you can get a box from Juniper for the same price you paid for your 7000 that provides a few orders of magnitude more performance.
But to get there, you have to be willing to see what happens when you push the envelope. That's the point of the LSR, and a lot of other research efforts.
(From Les Cottrell:) High speed at reasonable costs are the end-goal. However, it is important to be able to plan for when one will need such links, to know what one will be able to achieve, and for regular users to be ready to use them when the commonly available. This takes some effort up front to achieve and demonstrate.
(From Joe St Sauver:) And that's the key point that I think folks have been missing so far about all this. Internet2 provides excellent connectivity to folks who generally have a minimum of switched 10Mbps ethernet connectivity, and routinely switched 100Mbps connectivity. However, if you look at the weekly Abilene netflow summary reports (see http://netflow.internet2.edu/ , or jump directly to a particular report such as http://netflow.internet2.edu/weekly/20030224/ ) you will see that for bulk TCP flows, the median throughput is still only 2.3Mbps. 95th%-ile is only ~9Mbps. That's really not all that great, throughput wise, IMHO.
Add one further element to that: user expectations. Users hear, "Wow, we now have an OC12 to Abilene [or gigabit ethernet, or an OC48, or an OC192], I'll going to be able to *smoke* my fast ethernet connection ftping files from !" ... but then they find out that no, in fact, if they are seeing 100Mbps for bulk TCP transfers, then they are in true throughput elite, the upper 1/10th of 1% of all I2 traffic.
SO! The I2 Land Speed Record is not necessarily about making everyone be able to do gigabit-class traffic across the pond, it is about making LOTS of faculty be able to do 100Mbps at least across the US.
Emprirically, it is clear to me that this "trivial" accomplishment, e.g., getting 100Mbps across the wide area, is actually quite hard, and it is only by folks pushing really hard (as Cottrell and his colleagues have) that the more mundane throughput targets (say, 100Mbps) will routinely be accomplished.
> On Sat, 8 Mar 2003, Richard A Steenbergen wrote:
> > The amount of arrogance it takes to declare a land speed "record" when
> > there are people out there doing way more than this on a regular
> > basis.
> > Single stream at 900mbs over that distance? Where?
Talk to folks that deal with radio telescopes.
We have been talking to the radio astronomy people. We are aware they have such needs, however, I am unclear whether they have succeeded in transmitting single stream TCP application to application throughput of 900Mbits/s over 10,000km on a regular basis. Perhaps you could point me to whom to talk to. I am aware of the work of Richard Hughes-Jones of Manchester University and others and the Radio Astronomy VLBI Data Transmission (see for example http://www.hep.man.ac.uk/~rich/VLBI_web/) since we have shared notes and talked together a lot on the high performance issues. My understanding is that for today they use special high performance tapes to ship the data around, and are actively looking at using the network.
Could we use this new technology of transmission in our business?
The record was achieved with commercial off the shelf technology. The PCs (Intel P4), the operating systems (Linux), the network interface cards Syskonnect, the routers (Cisco & Juniper), switches, fibers, and long haul links (Level(3)) are all available today. The work we had to do was to select the right components, configure everything for high performance and make it all work.
Internet capacities and performance are increasing by roughly a factor of 2/year thus in 3.5 years you can expect a factor of 10 increase in performance and 100 in ~7 years. In our case the bottleneck was 1Gbits/s so if you have a bottleneck today of 100Mbits/s then in 3.5 years you may have similar performance to what we were using. If you only have a 10Mbits/s bottleneck then it may be expected to take twice as long.
The question will be how great is your need to ship large volumes of data quickly, can you make a business case for the expense, and how soon is it worth the cost?
Is this new technology available soon for the Private entreprise in Europe?
The technology was all commercial off the shelf hardware and software that is available in Europe.

We need more details

What would be helpful is a paper detailing costs of and results from the LSR.
What would be even more helpful is tailoring your results from an operational perspective and presenting it at the next NANOG meeting
You are right that we need to put together a paper explaining what we did, the dirty details, why we believe it is important etc. So far we have been too busy just utilizing the test bed while it was still in place and following up with higher speeds and newer TCP stacks. Prompted by the NANOG discussions I have put together a FAQ (see http://www-iepm.slac.stanford.edu/lsr/faq.html). We will be making presentations at a meeting in San Diego (see http://www-conf.slac.stanford.edu/chep03/) later this month (and this requires a paper following the meeting), we also expect to be making invited presentations at some local interested companies in the Bay Area in the coming month, and next month related information will be presented at the PAM 2003 workshop in San Diego (see http://www.pam2003.org/).
I have not attended a NANOG before, but I am sure one of us would be more than happy to present at a NANOG Meeting. I am unclear as to how one goes about getting an invitation.
Editors note: we gave a presentation at the Salt Lake City NANOG meeting in June 2003.
By the way, how did you deal with slow start? (and loss - was there any? forgive me for not having followed the details ...)
The main thing we did was to use jumbo frames. As for slow start, the slow start was over in about 6 seconds, see the iperf output below (not from submission but run a little while before we made the actual attempt (the attempt was made using the rapid application rather than iperf)):
hp1:~#  /usr/local/Iperf/bin/iperf  -c 198.51.111.10  -i 2 -w 32M  -l 8M -p 5009
 -t 300     
about to getaddrinfo on '198.51.111.10'
done with gai, ai_fam=2 ai_alen=16 addr=0x00000002...
------------------------------------------------------------
Client connecting to 198.51.111.10, TCP port 5009
TCP window size: 48.0 MByte (WARNING: requested 32.0 MByte)
------------------------------------------------------------
[  3] local 145.146.96.26 port 34839 connected with 198.51.111.10 port 5009
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 2.0 sec  56.0 MBytes   235 Mbits/sec
[  3]  2.0- 4.0 sec   184 MBytes   770 Mbits/sec
[  3]  4.0- 6.1 sec   240 MBytes   955 Mbits/sec
[  3]  6.1- 8.1 sec   232 MBytes   963 Mbits/sec
[  3]  8.1-10.1 sec   224 MBytes   973 Mbits/sec
[  3] 10.1-12.1 sec   240 MBytes   977 Mbits/sec
[  3] 12.1-14.0 sec   216 MBytes   949 Mbits/sec
[  3] 14.0-16.1 sec   240 MBytes   976 Mbits/sec
[  3] 16.1-18.0 sec   232 MBytes   1.0 Gbits/sec
[  3] 18.0-20.0 sec   224 MBytes   930 Mbits/sec
[  3] 20.0-22.0 sec   232 MBytes   969 Mbits/sec
[  3] 22.0-24.0 sec   232 MBytes   975 Mbits/sec
[  3] 24.0-26.1 sec   232 MBytes   961 Mbits/sec
[  3] 26.1-28.0 sec   232 MBytes   1.0 Gbits/sec
[  3]  0.0-28.6 sec   3.0 GBytes   886 Mbits/sec
I think we were lucky not to see any congestion. More recently running with FAST TCP from the Caltech group, and Sally Floyd's proposed High Speed TCP implemented by the web100/net100 project we can get over 900Mbits/s consistently without jumbo frames (see for example http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/stacks.png)
What kind of difference did you see with jumbo frames as opposed to standard 1500 byte packets? I did some testing once and things actually ran slightly faster with 1500 byte packets, completely contrary to my expectations... (This was UDP and just 0.003 km rather than 10,000, though.)
The jumbo frames effectively increase the congestion avoidance additive increase of the congestion avoidance phase of TCP by a factor of 6. Thus after a congestion event, that reduces the window by a factor of 2, one can recover 6 times as fast. This is very important on large RTT fast links where the recovery rate(for TCP/Reno) goes as the MTU/RTT^2. This can be seen in some of the graphs at:
http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/stacks.png or more fully at:
http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/
So how much packet loss did you see? Even with a few packets in a million lost this would bring your transfer way down and/or you'd need even bigger windows.
However, bigger windows mean more congestion. When two of those boxes start pushing traffic at 1 Gbps with a 40 MB window, you'll see 20 MB worth of lost packets due to congestion in a single RTT.
A test where the high-bandwidth session or several high-bandwidth sessions have to live side by side with other traffic would be very interesting. If this works well it opens up possibilities of doing this type of application over real networks rather than (virtual) point-to-point links where congestion management isn't an issue.
We saw little congestion related packet loss on the testbed. With big windows SACK becomes increasingly important so one does not have recover a large fraction of the window for a single packet.
Once one gets onto networks where one is really sharing the bandwidth with others performance drops off rapidly (see for example the measuremsnts at http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/#Measurements%20from%20Sunnyvale%20to%20Amsterdam and compare them with those at http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/#TCP%20Stack%20Comparisons%20with%20Single%20Streams
One of the next things we want to look at next is how the various new TCP stacks work on production Academic & Research Networks (e.g. from Internet2, ESnet, GEANT, ...) with lots of other competing traffic.
I was wondering if there was any technical information regarding how and on what networking & host hardware you were able to acheive this world record?
The record was achieved using off the shelf high speed PCs (the one at Sunnyvale was a SuperMicro 2.4GHz Intel Pentium 4 Xeon (see http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2002/hiperf.htm#tech for more gory details) and software (OS, TCP stack, application) at both ends running Linux with the standard TCP stack. The NIC at Sunnyvale was a SysKonnect 1GE. The only non-standard thing was we used jumbo frames (9000 Bytes). We also had to do some tuning to set large windows (e.g. we requested 32MByte windows and were allocated 48MByte windows by Linux). A traceroute can be found at: http://www-iepm.slac.stanford.edu/lsr/submit. See the Overall network layout for hardware details of the StarLight and Sunnyvale connections.
I'm looking for informations of hardware which was used for this record.
The computer servers at Sunnyvale: ea 1 rack unit & 2.5Amps, 120V
Model: ACME Server 6012PE
Motherboard: Supermicro P4DPR-I
CPU : Intel 2.4 GHz
Memory : 1 GB PC2100 DDR ECC Registered
Hard Drive : 80GB IDE, Maxtor, 7200 RPM
They ran Red Hat Linux 2.4.19
At Sunnyvale Cisco GSR 12406 with OC192/POS interface. Parts list from Cisco for the GSR 12406:
10X1GE-SFP-LC-B                             2
CAB-GSR6-US                                 1
CSS5-GBIC-SX=                               20
GLC-SX-MM                                   2
GLC-SX-MM=                                  2
GRP-B                                       1
GSR6/120-AC                                 1
MEM-DFT-GRP/LC-128                          1
MEM-GRP-FL20                                1
OC192/POS-SR-SC                             1
S120K5Z-12.0.21S                            1
At Amsterdam we had a similar setup but the cpus were more a mixed bunch
keeshond Dual PIII 1.X GHz. 2GB RAM, Dual GigE  3C985 SK943   2.4.19
stier    Dual Xeon 2.0 GHz  2GB RAM  Dual GigE  EE1000 3C966, 2.4.19
haan     PIII      700 MHz, .5GB RAM GigE       3C985         2.4.19
HP3      Dual Xeon 2.4 GHz  1GB RAM  GigE       3c985         2.4.18
HP4      Dual Xeon 2.4 GHz  1GB RAM  GigE       3c985         2.4.18
All systems at Amsterdam were Linux Gnu/Debian Woody Kernel 2.4.18 or 2.4.19.

The Amsterdam-Chicago setup. The Sunnyvale Chicago setup.

Why is difficult too get high performance (over > 500-600Mbits/s) with 1500Byte MTUs and the standard (Reno) TCP.
The problem on long links is that the congestion window (with Reno/Tahoe TCP - the standard for most stacks today) only opens up by 1 MTU per RTT, so after a congestion event (when the congestion window is halved) it can take a long time (the throughput goes as y=y0+0.5*t*MTU/RTT^2 where t is the time, MTU=1500Bytes and y0=the starting throughput) to get back up to the optimal throughput. For example for an RTT of ~ 180ms it can take about 4500 seconds (over an hour!) to increase from 200Mbits/s to 1000Mbits/s, and there maybe a congestion event (e.g. packet loss) in this time, in which case one has to start again from roughly half the current throughput. With jumbo frames it increases 6 times as fast since the slope is proportinal to the MTU which for jumbos is 9000Bytes.
"I suspect that the high performance might be due to a large buffer space some place on DropTail router network. In the droptail router with buffer space K, the congestion window drops down to (B*D + T)/2. If T is as much as B*D, then TCP drops its window down to B*D which is the capacity of the link. If RED is used, the result will be quite different, I believe because packet drop will happen well before the buffer fills up. Also DropTail with a large buffer adds a lot of congestion delays.
Could somebody confirm the amount of router buffer space at the bottleneck of this path and whether RED is being used? whether the buffer space is pooled or dedicated to a port?" Injong Rhee
"There is very little buffer space on Cisco 7600 router. By default the size of the buffer is 40 packets and the maximum value is 4096. (40 packets at 10 Gbps => 0.048ms of queuing delay!!!). Next generation of 10 GE Cisco modules will have 150 ms of buffer space. I think that Juniper router have large buffer spaces too.
Ijong is right; the high performance is due to a large buffer at the bottleneck. In the LSR-IPv6 record, the bottleneck was the end-host and its NIC. That's why we need a large txqueuelen on the end host. (ifconfig eth0 txqueuelen 1000000). In the case of LSR-IPv4, we have limited the TCP buffer size.
Please refer to the presentation. On slide #9 and #10 it is explained how tune TCP buffer and txqueuelen. You will see that large buffer spaces considerably affect queuing delay. The only new stack which doesn't affect delay is FAST. FAST maintains low queuing delay. That's one reason why I am a strong supporter of Steven's work."
Other stacks (HSTCP, scalable and GridDT) fill the buffer up before loosing a packet and reducing the congestion window. If you have small buffer the end-to-end delay is not too much affected, if you have large buffer you need QoS mechanisms in order to maintain low queuing delay for real time traffic.
The effect on queuing delay is definitely something to take into account in the evaluation of TCP stack." Sylvain Ravot.

Conclusion

I agree this should not about some jocks beating a record. In fact that was incidental to the other work that was done to set the stage for the record making attempt. I do think it is important to catch the public's attention to why high speeds are important, that they are achievable today application to application (it would also be useful to estimate when such speeds are available to universities, large companies, small companies, the home etc.), and for techies it is important to start to understand the challenges the high speeds raise, e.g. cpu and bus speeds, cpu and router memories, bugs in TCP, OS, application etc., new TCP stacks, new (possibly UDP based) protocols such as tsunami or iSCSI or FCS over TCP or SCP ..., need for 64 bit counters in monitoring, effects of the NIC card, jumbo requirements etc., and what is needed to address them. Also to try and put it in meaningful terms (such as 2 full length DVD movies in a minute, that could also increase the "cease and desist" legal messages shipped ;-)) is important.

Hope that helps, and thanks to you guys in the NANOG for providing today's high speed networks.