FAQ on Internet2 Land Speed Record
FAQ put together by: Les Cottrell, SLAC
The FAQ was created out of responses to the
NANOG mailing list.
Who did this?
Who did this work anyway
The record setting team consisted of members from the Nationaal Instituut
voor Kernfysica en Hoge-Energiefysica (NIKHEF), the Stanford Linear
Accelerator Center (SLAC), the California Institute of Technology (Caltech)
and the Faculty of Science of the Universiteit van Amsterdam (UvA). In
setting the new record, the team used the advanced networking capabilities of
TeraGrid, StarLight, SURFnet, NetherLight, and the wide area optical
networking links provided by Level 3 Communications (Nasdaq:LVLT) or the
SC2002 event and by Cisco Systems to SLAC and Caltech. The team also
received indispensable support from the CERN staff.
For more details see the
Internet2 publicity release.
Antony Antony of NIKHEF made the submitted measurement and lead the
writing of the submission.
What is different about this?
Give 'em a million dollars, plus fiber from here to anywhere and let me
muck with the TCP algorithm, and I can move a GigE worth of traffic
You are modest in your budgetary request. Just the Cisco router
(GSR 12406) we had on free loan listed at close to a million dollars, and
the OC192 links just from Sunnyvale to Chicago would have cost what was
left of the million/per month.
We used a stock TCP (Linux kernel TCP). We did however, use jumbo frames
I can only hope that the researchers actually spent $2000-$4000 on moving
the gigabit of data, pocketed the rest, and are now living in a tropical
foreign country with lots and lots of drugs and women, because anything
else would just be too sad for me to contemplate.
We are still living California and Amsterdam and at least the
weather in California is great today.
What am I missing here, theres OC48=2.4Gb, OC192=10Gb ...
We were running host to host (end-to-end) with a single stream with common
off the shelf equipment, there are not too many (I think none) > 1GE host
NICs available today that are in production (e.g. without signing a
non-disclosure agreement). There is one advertised by
Alacritech. However, when we asked for prices and availability and
any benchmarks we received the answer
"Alacritech's 10Gig initiative
is not far along to provide you with the data you have requested" on 3/10/2003.
Production commercial networks ... Blow away these speeds on a
See the above remark about end-to-end application to application, single
So, you turn down/off all the parts of TCP that allow you to
share bandwidth ...
We did not mess with the TCP stack, it was stock off the shelf.
... Mention that "Internet speed records" are measured in
You are correct, this is important, but reporters want a sound bite and
typically only focus on one thing at a time. I will make sure next time
I talk to a reporter to emphasize this. Maybe we can get some mileage out of
Petabmps (Peta bit metres per second) sounding like "petty bumps".
I'm going to launch a couple DAT tapes across the parking lot with a spud gun
and see if I can achieve 923 Mb/s!
The spud gun is interesting, given the distances, probably a 747
freightliner packed with DST tapes or disks is a better idea. Assuming
we fill the 747 with say 50 GByte tapes (disks would probably be better),
then if it takes 10 hours to fly from San Francisco (BTW Sunnyvale is near
San Francisco not near LA as one person talking about retiring to better
weather might lead one to believe) the bandwidth is about 2-4 Tbits/s.
However, this ignores the reality of labelling, writing the tapes,
removing from silo robot, packing, getting to airport, loading,
unloading, getting through customs etc. In reality the latency is really
closer to 2 weeks. Even worse if there is an error (heads not aligned etc.)
then the the retry latency is long and the effort involved considerable.
Also the network solution lends itself much better to automation, in
our case we saved a couple of full time equivalent people at the
sending site to distribute the data on a regular basis to our
collaborator sites in France, UK and Italy.
Just the fact that you need a ~20 megabyte TCP window size to achieve this
(feel free to correct me if I'm wrong here) seems kind of unusal to me.
It is true large windows are needed. To approach 1Gbits/s we require
40MByte windows. If this is going to be a problem, then we need to
raise question like this soon and figure out how to address (add more
memory, use other protocols etc.). In practice to approcah 2.5Gbits/s
requires 120MByte windows.
Spending millions of (probably taxpayer) dollars to win a meaningless
record is unethical, IMHO.
The links were not simply set up to make a record attempt. The links
are part of a large peer-reviewed European research effort (see
DataTag); a demonstration
set up by SURFnet, NetherLight (in the Netherlands) for
iGrid2002, and then reused for
SC2002; plus loans from Cisco
and Level(3). So the impact on the US taxpayer was minimal. The record was
incidental to the rest of the work done. However, what makes the press is
not the detailed research work, but rather something unique and
easier to explain. From another field, the capturing of the air speed record
by the SR71 Blackbird was incidental to the purposer for which it was built.
There is no need to waste funding buying uber-fast routers or GigE
links around the globe just to learn how to tune stacks or apps.
If high-speed TCP research is what you're doing, rig up a latency generator
in your laboratory and do your tests that way, just like the TCPSAT folks.
Good point. Following up on and driven by the work leading up to and
following the Land
Speed Record, some of the Caltech people collaborating on this record together
with collaborators from SLAC and elsewhere, are
proposing a "WAN in the Lab" that can be used for just such testing. This
saves on leasing fibers but there are still considerable expenses
to run at 10Gbit/s rates (cpus, NICs, optical multiplexing equipment etc.).
It is also a much more controlled environment that simplifies things. On
the other hand it misses out on the real world experience, and so eventually has
to be tested first on real world lightly used testbeds, and then on advanced
research networks and finally on production networks, to understand how issues
such as fairness, congestion avoidance, robustness to poor implemementations
or configurations etc. really work.
Who needs it?
What kind of production environment needs a single TCP stream of data at
1Gbits/s over a 150ms latency link?
Today High Energy Particle Physics needs hundreds of Megabits/s between
California and Europe (Lyon, Padova and Oxford) to deliver data on a
timely basis form an experiment site at SLAC to regional computer sites in
Europe. Today on production acadmeic networks (with sustainable rates of
100 to a few hundred Mbits/s) it takes about a day to transmit just over a
Tbyte of data which just about keeps up with the data rates. The data
generation rates are doubling / year so within 1-3 years we will be
needing speeds like in the record on a production basis. We needed to
ensure we can achieve the needed rates, and whether we can do it with
off the shelf hardware, how the hosts and OS' need configuring, how to
tune the TCP stack or how newer stacks perform, what are the requirements
for jumbo frames etc. Besides High Energy Physics other sciences are
beginning to grapple with how to repliacte large databases across the
globe, such sciences include radio-astronomy, human genome, global
weather, seismic ...
- Just trying to see if you can give some
inputs how this will change the entire computing world.
- Initially the main changes will be for data intenisve sciences such as
Elementary Particle Physics, Nuclear Phyysics, Global Weather Prediction,
Fusion, astrophysics, biology (in particular human Genome), and seismology,
where there are critical need to share large amounts of data across
worldwide collaborations. In many cases this is done today by shipping
truck or plane loads of data. While a 747 full of high density tapes or
disks has a high bandwidth (Tbits/second assuming 10 hours to fly San
Francisco to Geneva), it also has high latency (typically 2 weeks to write
the data, dismount, label, package, get to shipper, ship, get through
customs and reverse the process, and even longer if there are errors).
This puts the scientists not at the data source at a 2 week disadvantage,
and also is people power expensive (a costs that is not deccreasing with time,
unlike networking). Using the network the latency can be reduced to a day
or less (today we are shipping about a Tbyte per day by the network from
California to France, the UK and Italy), utilizing 100 Mbits/s links.
Also the processes can be easily automated resulting in much less grunt
type work. With the new performance we will be able to transmit the same
data in a few hours or keep up with the increasing data rates
With Internet performance histoically imporoving at about a factor of 2
per year, Universities and companies with 45 or 155 or 622Mbits/day
links to the Internet today will have 100 times that by 2010, so whole
new avenues for shaing data and collaborating will be opened up. This
may be expected to to have quite dramatic effects on industries like
aeropspace, medicine, media distribution etc. (e.g. think of hotels
able to donwload movies in less than a minute, movie production shops in
New York sharing movies with Hollywood sites etc., let alone movies being
as easy to share on the network as music is today).
High speeds are not important. High speeds at a *reasonable* cost are
important. What you are describing is a high speed at an
(Response from David G. Andersen [firstname.lastname@example.org]):)
The bleeding edge of performance in computers and networks is always
stupidly expensive. But once you've achieved it, the things you did to get
there start to percolate back into the consumer stream, and within a
few years, the previous bleeding edge is available in the current O(cheap)
A cisco 7000 used to provide the latest and greatest performance in its
day, for a rather considerable cost. Today, you can get a box from
Juniper for the same price you paid for your 7000 that provides a few
orders of magnitude more performance.
But to get there, you have to be willing to see what happens when you
push the envelope. That's the point of the LSR, and a lot of other
(From Les Cottrell:)
High speed at reasonable costs are the end-goal. However, it is important
to be able to plan for when one will need such links, to know what one
will be able to achieve, and for regular users to be ready to use them
when the commonly available. This takes some effort up front to achieve
(From Joe St Sauver:)
And that's the key point that I think folks have been missing so far about all this. Internet2 provides excellent connectivity to folks who generally have a minimum of switched 10Mbps ethernet connectivity, and routinely switched 100Mbps connectivity. However, if you look at the weekly Abilene netflow
summary reports (see http://netflow.internet2.edu/ , or jump directly to a
particular report such as http://netflow.internet2.edu/weekly/20030224/ )
you will see that for bulk TCP flows, the median throughput is still only 2.3Mbps. 95th%-ile is only ~9Mbps. That's really not all that great, throughput wise, IMHO.
Add one further element to that: user expectations. Users hear, "Wow,
we now have an OC12 to Abilene [or gigabit ethernet, or an OC48, or an
OC192], I'll going to be able to *smoke* my fast ethernet connection ftping
files from !" ... but then they find out that no, in
fact, if they are seeing 100Mbps for bulk TCP transfers, then they are in
true throughput elite, the upper 1/10th of 1% of all I2 traffic.
SO! The I2 Land Speed Record is not necessarily about making everyone
be able to do gigabit-class traffic across the pond, it is about making
LOTS of faculty be able to do 100Mbps at least across the US.
Emprirically, it is clear to me that this "trivial" accomplishment, e.g.,
getting 100Mbps across the wide area, is actually quite hard, and it is only
by folks pushing really hard (as Cottrell and his colleagues have) that the
more mundane throughput targets (say, 100Mbps) will routinely be
> On Sat, 8 Mar 2003, Richard A Steenbergen wrote:
> > The amount of arrogance it takes to declare a land speed "record" when
> > there are people out there doing way more than this on a regular
> > basis.
> Single stream at 900mbs over that distance? Where?
Talk to folks that deal with radio telescopes.
We have been talking to the radio astronomy people.
We are aware they have such needs, however, I am unclear whether they
have succeeded in transmitting single stream TCP application to
application throughput of 900Mbits/s over 10,000km on a regular basis.
Perhaps you could point me to whom to talk to. I am aware of the work of
Richard Hughes-Jones of Manchester University and others and the
Radio Astronomy VLBI Data Transmission (see for example
http://www.hep.man.ac.uk/~rich/VLBI_web/) since we have shared notes and
talked together a lot on the high performance issues.
My understanding is that for today they use special high performance
tapes to ship the data around, and are actively looking at using the network.
- Could we use this new technology of transmission in our business?
- The record was achieved with commercial off the shelf
technology. The PCs (Intel P4), the operating systems (Linux),
the network interface cards Syskonnect, the routers
(Cisco & Juniper), switches, fibers, and long haul links
(Level(3)) are all available today. The work we had to do was
to select the right components, configure everything for high
performance and make it all work.
Internet capacities and performance are increasing by roughly a
factor of 2/year thus in 3.5 years you can expect a factor of
10 increase in performance and 100 in ~7 years. In our case the
bottleneck was 1Gbits/s so if you have a bottleneck today of
100Mbits/s then in 3.5 years you may have similar performance to
what we were using. If you only have a 10Mbits/s bottleneck
then it may be expected to take twice as long.
The question will be how great is your need to ship large
volumes of data quickly, can you make a business case for
the expense, and how soon is it worth the cost?
- Is this new technology available soon for the Private
entreprise in Europe?
- The technology was all commercial off the shelf hardware and software
that is available in Europe.
We need more details
What would be helpful is a paper detailing costs of and results from the LSR.
What would be even more helpful is tailoring your results from an operational
perspective and presenting it at the next NANOG meeting
You are right that we need to put together a paper explaining what we
did, the dirty details,
why we believe it is important etc. So far we have been too busy just
utilizing the test bed while it was still in place and following up with
higher speeds and newer TCP stacks. Prompted by the NANOG discussions
I have put together a FAQ (see http://www-iepm.slac.stanford.edu/lsr/faq.html).
We will be making presentations at a meeting in San Diego
(see http://www-conf.slac.stanford.edu/chep03/) later this month
(and this requires a paper following the meeting), we also expect to be
making invited presentations at some local interested companies in the
Bay Area in the coming month, and next month related information will
be presented at the PAM 2003 workshop in San Diego
I have not attended a NANOG before, but I am sure one of us would be more
than happy to present at a NANOG Meeting. I am unclear as to how one goes
about getting an invitation.
Editors note: we gave a
presentation at the Salt Lake City NANOG meeting in June 2003.
By the way, how did you deal with slow start? (and loss - was there any?
forgive me for not having followed the details ...)
The main thing we did was to use jumbo frames. As for slow start,
the slow start was over in about 6 seconds, see the iperf output
below (not from submission but run a little while before we made the
actual attempt (the attempt was made using the rapid application rather
hp1:~# /usr/local/Iperf/bin/iperf -c 220.127.116.11 -i 2 -w 32M -l 8M -p 5009
about to getaddrinfo on '18.104.22.168'
done with gai, ai_fam=2 ai_alen=16 addr=0x00000002...
Client connecting to 22.214.171.124, TCP port 5009
TCP window size: 48.0 MByte (WARNING: requested 32.0 MByte)
[ 3] local 126.96.36.199 port 34839 connected with 188.8.131.52 port 5009
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 2.0 sec 56.0 MBytes 235 Mbits/sec
[ 3] 2.0- 4.0 sec 184 MBytes 770 Mbits/sec
[ 3] 4.0- 6.1 sec 240 MBytes 955 Mbits/sec
[ 3] 6.1- 8.1 sec 232 MBytes 963 Mbits/sec
[ 3] 8.1-10.1 sec 224 MBytes 973 Mbits/sec
[ 3] 10.1-12.1 sec 240 MBytes 977 Mbits/sec
[ 3] 12.1-14.0 sec 216 MBytes 949 Mbits/sec
[ 3] 14.0-16.1 sec 240 MBytes 976 Mbits/sec
[ 3] 16.1-18.0 sec 232 MBytes 1.0 Gbits/sec
[ 3] 18.0-20.0 sec 224 MBytes 930 Mbits/sec
[ 3] 20.0-22.0 sec 232 MBytes 969 Mbits/sec
[ 3] 22.0-24.0 sec 232 MBytes 975 Mbits/sec
[ 3] 24.0-26.1 sec 232 MBytes 961 Mbits/sec
[ 3] 26.1-28.0 sec 232 MBytes 1.0 Gbits/sec
[ 3] 0.0-28.6 sec 3.0 GBytes 886 Mbits/sec
I think we were lucky not to see any congestion. More recently running
from the Caltech group,
proposed High Speed TCP implemented by the
we can get over 900Mbits/s consistently without jumbo
(see for example
What kind of difference did you see with jumbo frames as opposed to standard
1500 byte packets? I did some testing once and things actually ran slightly
faster with 1500 byte packets, completely contrary to my expectations...
(This was UDP and just 0.003 km rather than 10,000, though.)
The jumbo frames effectively increase the congestion avoidance additive
increase of the congestion avoidance phase of TCP by a factor of 6.
Thus after a congestion event, that reduces the window by a factor of 2,
one can recover 6 times as fast. This is very important on large RTT
fast links where the recovery rate(for TCP/Reno) goes as the MTU/RTT^2.
This can be seen in some of the graphs at:
http://www-iepm.slac.stanford.edu/monitoring/bulk/fast/stacks.png or more fully at:
So how much packet loss did you see? Even with a few packets in a
million lost this would bring your transfer way down and/or you'd need
even bigger windows.
However, bigger windows mean more congestion. When two of those boxes
start pushing traffic at 1 Gbps with a 40 MB window, you'll see 20 MB
worth of lost packets due to congestion in a single RTT.
A test where the high-bandwidth session or several high-bandwidth
sessions have to live side by side with other traffic would be very
interesting. If this works well it opens up possibilities of doing this
type of application over real networks rather than (virtual)
point-to-point links where congestion management isn't an issue.
We saw little congestion related packet loss on the testbed. With big
windows SACK becomes increasingly important so one does not have
recover a large fraction of the window for a single packet.
Once one gets onto networks where one is really sharing the bandwidth
with others performance drops off rapidly (see for example the
and compare them with those at
One of the next things we want to look at next is how the various new
TCP stacks work on production Academic & Research Networks (e.g. from
Internet2, ESnet, GEANT, ...) with lots of other competing traffic.
I was wondering if
there was any technical information regarding how and on what networking &
hardware you were able to acheive this world record?
The record was achieved using off the shelf high speed PCs (the one at
Sunnyvale was a SuperMicro 2.4GHz Intel Pentium 4 Xeon (see
for more gory details) and software (OS, TCP stack, application) at both
ends running Linux with the standard TCP stack. The NIC at Sunnyvale was a
SysKonnect 1GE. The only non-standard thing was we used jumbo frames
(9000 Bytes). We also had to do some tuning to set large windows
(e.g. we requested 32MByte windows and were allocated 48MByte windows
by Linux). A traceroute can be found at:
http://www-iepm.slac.stanford.edu/lsr/submit. See the
Overall network layout for hardware details of the StarLight and Sunnyvale
- I'm looking for informations of hardware
which was used for this record.
The computer servers at Sunnyvale:
ea 1 rack unit & 2.5Amps, 120V
Model: ACME Server 6012PE
Motherboard: Supermicro P4DPR-I
CPU : Intel 2.4 GHz
Memory : 1 GB PC2100 DDR ECC Registered
Hard Drive : 80GB IDE, Maxtor, 7200 RPM
They ran Red Hat Linux 2.4.19
At Sunnyvale Cisco GSR 12406 with
Parts list from Cisco for the GSR 12406:
At Amsterdam we had a similar setup but the cpus were more a mixed bunch
keeshond Dual PIII 1.X GHz. 2GB RAM, Dual GigE 3C985 SK943 2.4.19
stier Dual Xeon 2.0 GHz 2GB RAM Dual GigE EE1000 3C966, 2.4.19
haan PIII 700 MHz, .5GB RAM GigE 3C985 2.4.19
HP3 Dual Xeon 2.4 GHz 1GB RAM GigE 3c985 2.4.18
HP4 Dual Xeon 2.4 GHz 1GB RAM GigE 3c985 2.4.18
All systems at Amsterdam were Linux Gnu/Debian Woody Kernel 2.4.18 or 2.4.19.
The Amsterdam-Chicago setup.
The Sunnyvale Chicago setup.
- Why is difficult too get high performance (over
> 500-600Mbits/s) with 1500Byte MTUs and the standard (Reno) TCP.
- The problem on long links is that the congestion window
(with Reno/Tahoe TCP - the standard for most stacks today) only opens
up by 1 MTU per RTT, so after a congestion event (when the congestion
window is halved) it can take a long time (the throughput goes as
y=y0+0.5*t*MTU/RTT^2 where t is the time, MTU=1500Bytes and
y0=the starting throughput) to get back up to the optimal throughput.
For example for an RTT of ~ 180ms it can take about 4500 seconds
(over an hour!) to increase from 200Mbits/s to 1000Mbits/s, and
there maybe a congestion event (e.g. packet loss) in this time,
in which case one has to start again from roughly half the current throughput. With jumbo frames it increases 6 times as fast since the slope is proportinal to the MTU which for jumbos is 9000Bytes.
"I suspect that the high
performance might be due to a large buffer space some place on
router network. In the droptail router with buffer space K, the
window drops down to (B*D + T)/2. If T is as much as B*D, then TCP
its window down to B*D which is the capacity of the link. If RED is
the result will be quite different, I believe because packet drop will
happen well before the buffer fills up. Also DropTail with a large
adds a lot of congestion delays.
Could somebody confirm the amount of router buffer space at the
bottleneck of this path and whether RED is being used? whether the
space is pooled or dedicated to a port?" Injong Rhee
"There is very little buffer space on Cisco 7600 router.
By default the size of the buffer is 40 packets and the maximum
value is 4096. (40 packets at 10 Gbps => 0.048ms of queuing delay!!!).
Next generation of 10 GE Cisco modules will have 150 ms of buffer space.
I think that Juniper router have large buffer spaces too.
Ijong is right; the high performance is due to a large buffer at the
bottleneck. In the LSR-IPv6 record, the bottleneck was the end-host and
its NIC. That's why we need a large txqueuelen on the end host.
(ifconfig eth0 txqueuelen 1000000). In the case of LSR-IPv4, we have
limited the TCP buffer size.
Please refer to the
presentation. On slide #9 and #10 it is
explained how tune TCP buffer and txqueuelen.
You will see that large buffer spaces considerably affect queuing delay.
The only new stack which doesn't affect delay is FAST. FAST
maintains low queuing delay. That's one reason why I am a strong
supporter of Steven's work."
Other stacks (HSTCP, scalable and GridDT) fill the buffer up before
loosing a packet and reducing the congestion window. If you have small
buffer the end-to-end delay is not too much affected, if you have large
buffer you need QoS mechanisms in order to maintain low queuing delay
for real time traffic.
The effect on queuing delay is definitely something to take
into account in the evaluation of TCP stack." Sylvain Ravot.
I agree this should not about some
jocks beating a record. In fact that was incidental to the other
work that was done to set the stage for the record making attempt.
I do think it is important to catch the public's
attention to why high speeds are important, that they are achievable
today application to application (it would also be useful to estimate when
such speeds are available to universities, large companies, small companies,
the home etc.), and for techies it is important to start to understand
the challenges the high speeds raise, e.g. cpu and bus speeds,
cpu and router memories,
bugs in TCP, OS, application etc., new TCP stacks, new (possibly UDP based)
protocols such as tsunami or iSCSI or FCS over TCP or SCP ..., need for 64
bit counters in monitoring,
effects of the NIC card, jumbo requirements etc., and what is needed to
address them. Also to try and put it in meaningful terms (such as 2
full length DVD movies in a minute, that could also increase the "cease
and desist" legal messages shipped ;-)) is important.
Hope that helps, and thanks to you guys in the NANOG for providing
today's high speed networks.