High Speed Terabyte Data
Transfers for Physics
SC2004 Bandwidth Challenge Proposal
We achieved 101.13 Gbps
100 Gbps is equivalent to sending
1/4 of all the content (printed, audio, video) produced on
Earth during the test.
At 100Gbps one could transfer the
contents of all the books and other print collections of the Library of
Congress in under 14 minutes, or three full length DVD movies in about one
The Caltech-SLAC-FNAL entry will demonstrate high speed
transfers of physics data between host labs and collaborating institutes.
Caltech and FNAL are major participants in the CMS collaboration at CERN's Large
Hadron Collider (LHC). SLAC is the host accelerator site for the BaBar
collaboration. We are using state of the art WAN infrastructure and Grid-based
Web Services based on the LHC Tiered Architecture. Our demonstration will show a
typical real-time event analysis application that requires the transfer of large
physics datasets. For this we will use NLR 10GE waves, monitoring the WAN
performance using the MonALISA agent-based system. The analysis software will
use a suite of Grid-enabled Analysis tools developed at Caltech and Univ. of
Florida. We intend to saturate three NLR 10GE waves: Sunnyvale to Pittsburgh, LA
to Pittsburgh and Chicago to Pittsburgh. These links carry traffic between SLAC,
Caltech and other partner Grid Service sites including UKlight, UERJ, FNAL,
- Caltech/HEP/CACR/ NetLab: Harvey Newman, Julian Bunn, Sylvain Ravot, Conrad
Steenberg, Yang Xia,
- SLAC/IEPM: Les Cottrell, Gary Buhrmaster,
- University of Manchester: Richard Hughes-Jones
- Sun: Larry McIntosh, Frank Leers
- Chelsio: Michael Chen
- S2io: Jimmy VanLandingham, Leonid Grossman
- Network access:
- Space, connections:
SC04 HPC BWC
Caltech press release
11/24/04: RNP11/25/04: PhysOrg
Tom's Hardware Guide11/29/04:
This is a joint Caltech, SLAC, University of Manchester project with
Cisco, Level 3, QWest, CENIC, DataTAG, National Lambda Rail (NLR), StarLight, TeraGrid, SurfNet, HP, Sun, as
sponsors. It responds to the SC2004 Bandwidth Challenge
Call for participation (August 16, 2004). We will demonstrate high network
and application throughput on trans-continental
(10 Gbits/s) and
trans-Atlantic (10 Gbits/s) links between Caltech/LA, SLAC/Sunnyvale, FNAL/Chicago,
CERN/Geneva, StarLight/Chicago and SC2004/Pittsburgh
The High Energy Physics (HEP) community is conducting a new round of experiments to
probe the fundamental nature of matter and space-time, and to understand the
early history of the universe. These experiments face
unprecedented challenges due to the volumes and complexity of the
data, and the need for collaboration among scientists working around the
world. The massive, globally distributed datasets are
expected to grow to over 100 Petabytes by 2010, and will require Gbits/s throughputs between sites located around the
globe. In response to these challenges, major HEP centers in the
U.S. including CalTech, SLAC and FNAL have been designing and
building state of the art WAN infrastructures that support a Grid-based system
of physics Web Services. During SC04 we will use NLR 10Gbits/s waves to
demonstrate these Web Services and the MonALISA monitoring service used in
Collider (LHC) and
BaBar experiments and also improve network and disk-to-disk transfer performance
over 10Gbits/s WAN links.
We hope to demonstrate high (> 1 Gbyte/s) disk-to-disk throughput and even
higher (a few tens of Gbits/s sustained) memory-to-memory throughputs between
the above sites. The showcase will be in the
FNAL/SLAC booth (2418),
content, to be found in the
floor layout at
in Pittsburgh, Pennsylvania, November 6-12 2004.
We will also have
|Measuring the Digital Divide (PingER)
||Bandwidth Monitoring (IEPM-BW)
||Available Bandwidth Monitoring (ABwE)
|Internet Traffic Characterization (NetFlow)
||Worldwide Sharing of Internet Performance Information (MonaLISA)
Comparing TCP stacks,
In addition we will be:
SLAC plans to have:
- Pinging sites worldwide from the show-floor and visualizing the Round Trip
Times (RTT) by means of the
Java application. This shows a map of the world with the monitored countries
identified and time series plots ordered by regions showing the RTTs. In case
you do not have Java WebStart installed, here is a
screenshot to give the idea.
- Making quick (1 second), low impact available bandwidth measurements to
various high-performance sites worldwide using ABwE. The visualization is an
interactive map with various time series (mock-up)
showing the available bandwidth, the capacity and cross-traffic to the remote
sites from the show floor.
- Showing the "PingER
Internet Congestion Wave" a visualization by the University of New
England, Armidale, Australia, of how congestion (measured by packet loss)
moves around the world with time of day.
The overall network layout may include CERN, Caltech, FNAL and SLAC as well as collaborators in
Australia, Korea, Japan, Brazil and the UK.
We are looking at 2 racks for an artifact in one corner of the SLAC/FNAL
We have a few Dell PowerEdge 2650 dual cpu servers
From Sun we loaned 11 AMD opteron compute servers
each with two 2.4GHz AMD 64 bit Opteron processors and 4GB of RAM (config),
part number A55-NXB2-1-2GGB5, 1 u high, physical specs).
These compute servers will run Linux 2.6.6 and will make memory to memory data
transfers using iperf 1.7.0. In
addition there will be three V20Zs file servers with Sun 3510
Arrays again running Linux 2.6.6.
We will be using Intel,
Chelsio (T110 with a TCP Offload Engine) and
S2iO 10GE NICs. Five V20Z compute servers will
be located at Sunnyvale, and six at SC2004. One file server will be located at
Sunnyvale and two at SC2004.
|Caltech overall setup, Caltech CENIC PoP,
Caltech Optical Exchange
Hosts at SLAC booth and Sunnyvale and their purposes,
Sunnyvale power, people
with access at Sunnyvale, SLAC Booth Equipment,
booth power requirements,
booth power layout,
booth power form,
layout of network bunker
||Chelsio 10GE T110 card
with 1982 10Mbps 3COM card,
Chelsio TOE NIC and S2io NIC
Data flows for SC2004 prepared by Caltech
||UKLight network connection,
UK MB-NG network,
UK host configurations
Sun 1000-38 rack,
Sun 3510 disk array,
We achieved a maximum bandwidth of over 0.1 Tbits/s (101Gbits/s) between the
Caltech and SLAC SC2004 booths and collaborator sites.
- Final network configuration (ppt,
gif). We used 5 NLR waves, and one of
the 3 TeraGrid waves, each of which has a theoretical capacity of 10 Gbps in
each direction. Also two 10G links to Abilene and one wave over ESnet.
- Weathermap showing 8.7Gbps from SLAC
booth to ESnet/QWest PoP.
- SC004 plots of throughput during challenge.
The 101 comes from adding the aggregate in and out bandwidths, i.e. the cyan
and magenta. The individual links are shown at the bottom. We do not currently
know what color is associated with which link. The measurements were made with
- MonALISA plots of throughput during challenge,
Summary of bandwidths etc. for all the challengers;
time series of aggregate for HEP SC04
bandwidth challenge; histogram for
the aggregate bandwidth for the HEP SC04 bandwidth challenge;
components of HEP SC04 bandwidth
- Movie clips of MonALISA animated bandwidth display at
Caltech booth and SLAC booth. MonALISA read the MIBs for the
various router interfaces.
- For the SLAC booth dedicated paths from SC2004 to Sunnyvale:
- We had two pairs of hosts on the path A between SC2004 and the CENIC/NLR/Level(3)
PoP in Sunnyvale (two hosts on each end of the connection), and one pair
on path B SC2004 to ESnet/QWest PoP in Sunnyvale (one on each end of the
connection), each host had a Chelsio TOE NIC and running iperf/TCP:
- On Link A, in one direction 9.43G (9.07G goodput), and the reverse
direction 5.65G (5.44G goodput), total of 15+G on wire. We purposely made one
direction higher to close to the 10G limit.
- On Link B, in one direction 7.72G (7.43G goodput).
- Basically we saturated all the 6 machines' PCI buses.
- For UDT, we achieved 4.45Gbits/s between two V20Zs with 10Gbits/s
Chelsio NICs in the SLAC booth. Currently on Opteron we can only reach
about 5.2Gb/s at most. This is partly due to the implementation efficiency
as the CPU will be used up. We found that even with iperf/UDP the
performance is still less than iperf/TCP (6Gb/s vs 7.xGb/s). We are still
investigating this problem, but I suspect it is because that many of TCP
functionalities have been off loaded to hardware (e.g., NIC) and since TCP
is much more often used than UDP, its implementation is highly optimized
(while UDP is not). Yunhong Gu, UIC UDT developer, private communication
Example certificate awarded at the SC2004
Bandwidth Challenge Ceremony on November 11, 2004.
Richard, Les and Harvey at the award ceremony,
Phil DeMar (FNAL), Richard Hughes-Jones (Manchester University), Dave Nae
(Caltech), Les Cottrell and Richard Mount (SLAC) with awards.
|S2io NICs with Solaris 10 in 4*2.2GHz O:pteron cpu v40z to one or
more S2io or Chelsio NICs with Linux 2.6.5 or 2..6.6 in 2*2.4GHz V20Zs
||Chelsio TOE Tests between Linux 2.6.6. hosts, 1500B MTU, all
|On LAN through switch in S2io booth, one NIC
7.46 +- 0.07Gbits/s
On LAN through switch in S2io booth both NICs: each NIC ~6 Gbps,
2 NICs 12.08 Gbps, ~ 160% utilization
On LAN through switch in SLAC booth V40Z to two V20zs (scsl-4
with Chelsio NICs simultaneously.
|On WAN from booth (Pittsburgh) to Sunnyvale/ESnet PoP with 2*2.4 GHz
V20Z opteron at booth and 2*1.6GHz V20Z Opteron at Sunnyvale, Chelsios
at both ends, 2MByte window, 16 streams, 1500Byte MTU. Both ends were
running Linux 2.6.6
- Test1: 7.42 Gbits/s, 120mins (6.6Tbits
shipped), 148% CPU utilization
- Test2: 7.412+-0.009 Gbits/s (stream
average 463.3+-0.8Gbits/s), 30 mins (1.7Tbits shipped), 128%
- Test4: 7.39Gbits/s, 30 mins, 168% CPU
|11.4Gbits/s from single V40Z host (spreadsheet)
||CPU Util ~ 0.2GHz/Gbps, 6.6Tbits
in 2 hours one host to one host with 1500Byte MTU,
effect of parallel streams on
GHz / Gbps, spreadsheet
The SLAC equipment loans alone from companies such as Sun, Cisco and Chelsio
totaled about $400,000. This does not include equipment loaned to NLR to
provision the link from Sunnyvale to Pittsburgh.
- We were unable to connect our loaned 10Gbits/s links to SLAC. Thus, in the
last week we had to move the California located equipment to Sunnyvale (about
20 miles from SLAC). Further in the last few days it was learnt that it would
be necessary to position the equipment at two separate locations in Sunnyvale.
This absorbed much valuable time at the critical late stage before we
travelled to Pittsburgh. Further, due to the short time available we did not
have facilities for remote power-cycling of remote hosts so we had to be
careful not to make dramatic changes to them (e.g. reboot them often). We had
to manually reboot one system at Sunnyvale during SC2004.
- Most of the loaned hosts and disk arrays were shipped directly from the
provider to the show in Pittsburgh. Thus we were unable to complete setting up
hosts until we reached Pittsburgh three days before the show. To assist in
setting up at Pittsburgh, we created master disks at SLAC that we carried by
hand to Pittsburgh.
- Keeping host configurations updated. This was done by hand, we did write a
script to propagate files but did not have time to fully test it. Having NFS
might have helped for a single site. However, NFS would have increased
complexity and possibly had some security concerns. We deemed using AFS too
complex for such a short term demonstration.
- Lack of name service meant we had to remember IP addresses. We used
/etc/hosts in the hosts to assist in this. We used VLANs to associate traffic
with off-site hosts with chosen routes. The use of VLANs and multiple remote
sites meant we had many address ranges.
- Security, we did not use ACLs in the router, rather we used
iptables in the hosts.
- Enabling extra ports at the last minute for UDT led to confusion and
- Also traceroute with UDP probes failed to work properly for a large part
of the setup period. This led to delays in discovering incorrect MTU
- Jumbo frames: we set the booth router interfaces to support 9000 Byte MTUs.
However, we missed setting the VLANs to have 9000 Byte MTUs until it was too
late. Thus we were unable to make off-site measurements with large MTUs.
- The large mix of hardware (Opterons (V20Za and a V40Z with various speeds
and system disk sizes) and Xeons), operating systems (Linux 2.6.5 and 2.6.6.
and Solaris 10), NICs (Chelsio and S2io) while increasing our ability to test
more features also increased the complexity of setting things up.
- Coordination between the SLAC and Caltech booths was difficult due to
their physical separation (about 100 yards). This made balancing link loads
and sharing of resources more difficult. Next year we may share a booth
dedicated to HENP network measurement and performance.
- It was hard to get SR XENPAKS to connect the S2io interfaces to the Cisco
- Even though we had several file servers in the SLAC booth, we had
insufficient time to focus on file transfer performance. We were able to
achieve about 110MBytes sustained file transfers using the Fast-SCSI systems
disks with a 250GByte file.
The Chelsio TOE NIC (T110) works very stably on uncongested paths. Most of
the measurements used 1500Byte frames.
2.4GHz Opteron throughput is limited by the PCI-X 133MHz bus bandwidth. We were able to get
almost identical performance on the WAN as the LAN.
We were able to saturate over 99% of a 10 Gbits/s wavelength across country with two
10Gbits/s NIC hosts talking to two 10Gbits/s hosts.
For 10 streams or less, the CPU utilization by the Chelsio TOE NIC is a factor or
2.5. to 3.5 less
than that of the S2io NIC. The CPU utilization appears to be a function of both
achievable throughput and number of parallel streams.
We were able to get 11.4 Gbits/sec from a single V40Z host (using two S2io GE
NICs) to two V20Z hosts on the LAN.
We were able to demonstrate the smooth inter-working of Chelsio and S2io 10GE
NICs, as well as Cisco 650x switch/routers and a Juniper T320.
we were able to achieve about 4.45Gbits/s (transferring 200GBytes) with 9000Byte
MTUs between two V20Zs in the SLAC booth, each V20Z had 2*2.4GHz Opterons and a
Chelsio 10GE card. The cpu utilization was 114%. This is about 2.6 times the cpu
utilization/GHz for a similar transfer rate using the TCP TOE NIC, but similar
to that used by the S2io NIC.
"The smooth interworking of 10GE interfaces from multiple vendors, the
ability to successfully fill 10 gigabit-per-second paths both on local area
networks (LANs), cross-country and intercontinentally, the ability to transmit
greater than 10Gbits/second from a single host, and the ability of TCP offload
engines (TOE) to reduce CPU utilization, all illustrate the emerging maturity of
the 10Gigabit/second Ethernet market. The current limitations are not in the
network but rather in the servers at the ends of the links, and their buses" Dr.
R. Les Cottrell assistant dirrector, Stanford Linear Accelerator Center Computer
Services, and head of the SLAC led part of the team.
Harvey Newman, professor of physics at Caltech and head of the team, said,
"This is a breakthrough for the development of global networks and grids, as
well as inter-regional cooperation in science projects at the high-energy
frontier. We demonstrated that multiple links of various bandwidths, up to the
10 gigabit-per-second range, can be used effectively over long distances.
"This is a common theme that will drive many fields of data-intensive
science, where the network needs are foreseen to rise from tens of gigabits per
second to the terabit-per-second range within the next five to 10 years," Newman
continued. "In a broader sense, this demonstration paves the way for more
flexible, efficient sharing of data and collaborative work by scientists in many
countries, which could be a key factor enabling the next round of physics
discoveries at the high energy frontier. There are also profound implications
for how we could integrate information sharing and on-demand audiovisual
collaboration in our daily lives, with a scale and quality previously
- 10GE Intel cards: card,
Intel LR Xenpak,
- S2IO NICs 2003 and
2004, Chelsio TOE NIC and S2io NIC,
S2IO, Cisco and SLAC folks working on the S2IO demonstration,
- Alex Aizman (Lead Software Architect, S2io),
Larry McIntosh (Sun), Dimitry Yusupov (S2io), Les Cottrell (SLAC), Richard H-J at S2io booth with FedEx
package for replacement S2io cards,
- Mike Chan of Chelsio; Les
and Mike Chen (Chelsio) with Bandwidth Challenge shirts;
Les, Mike, Greg Gangitano (Chelsio) and Gary Buhrmaster;
Les, Mike and Gary.with BWC certificates;
Larry McIntosh (Sun), Mike Chen
(Chelsio) and Frank Leers (Sun) working on the V40Z
Gary Buhrmaster charming Linda Winkler (ANL/StarLight/SciNet),
(NIKHEF, ex SLAC) & Richard H-J,
- Collaborators: U Manchester,
S2io, Sun, Chelsio, FNAL
- Bandwidth Challenge bunker:
- Sun rack with V20Z opterons,
- crate for Sun rack,
- bunker front,
- bunker with Jerrod, Richard H-J & Phil,
- bunker with Richard H-J & Les,
- bunker with Mike, Richard H-J, Phil
- Northern California end:
View from Convention Center
Jerrod Williams' photos,
page of pictures from bandwidth challenge.
More on bulk throughput
Bulk throughput measurements |
Bulk throughput simulation |
Windows vs. streams |
Effect of load on RTT and loss |
Bulk file transfer measurements |
FAST TCP Stack Measurements |
| SC2002 SLAC/FNAL |
SC2002 bandwidth challenge |
SC2003 bandwidth challenge |
Internet2 Land Speed Record
Created August 18, 2004: Les Cottrell,