High Speed Terabyte Data Transfers for Physics

SC2004 Bandwidth Challenge Proposal

We achieved 101.13 Gbps (>0.1 Tbits/s)!

100 Gbps is equivalent to sending and receiving 1/4 of all the content (printed, audio, video) produced on Earth during the test.

At 100Gbps one could transfer the contents of all the books and other print collections of the Library of Congress in under 14 minutes, or three full length DVD movies in about one second.

The Caltech-SLAC-FNAL entry will demonstrate high speed transfers of physics data between host labs and collaborating institutes. Caltech and FNAL are major participants in the CMS collaboration at CERN's Large Hadron Collider (LHC). SLAC is the host accelerator site for the BaBar collaboration. We are using state of the art WAN infrastructure and Grid-based Web Services based on the LHC Tiered Architecture. Our demonstration will show a typical real-time event analysis application that requires the transfer of large physics datasets. For this we will use NLR 10GE waves, monitoring the WAN performance using the MonALISA agent-based system. The analysis software will use a suite of Grid-enabled Analysis tools developed at Caltech and Univ. of Florida. We intend to saturate three NLR 10GE waves: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago to Pittsburgh. These links carry traffic between SLAC, Caltech and other partner Grid Service sites including UKlight, UERJ, FNAL, AARnet.

Participants

  • Caltech/HEP/CACR/ NetLab: Harvey Newman, Julian Bunn, Sylvain Ravot, Conrad Steenberg, Yang Xia,
  • SLAC/IEPM: Les Cottrell, Gary Buhrmaster,
  • University of Manchester: Richard Hughes-Jones
  • Sun: Larry McIntosh, Frank Leers
  • Chelsio: Michael Chen
  • S2io: Jimmy VanLandingham, Leonid Grossman

Contributors

Publicity Releases

  • NLR proposal
  • SC04 HPC BWC entry
  • 11/23/04: S2io
  • 11/23/04: BusinessWire
  • 11/23/04: CCNMatthews
  • 11/23/04: TMCnet
  • 11/24/04 Caltech press release
  • 11/24/04: RNP
  • 11/25/04: PhysOrg
  • 11/28/04: Interactions News Wire
  • 11/29/04: Chelsio
  • 11/29/04: SlashDot
  • 11/29/04: Tom's Hardware Guide
  • 11/29/04: GRIDToday
  • 11/29/04: Yahoo
  • 11/29/04: ScienceBlog
  • 11/29/04: CCNews
  • 11/30/04: Sun
  • 11/30/04: SuperComputing
  • 11/30/04: TMCnet
  • 11/30/04: FindLaw
  • 11/30/04: OrangeCrate
  • 11/30/04: RedNova
  • 11/30/04: PRNewsWire
  • 11/30/04: mySaN.de
  • 11/30/04: ShortMedia
  • 11/30/04: CodeNewbie
  • 12/1/04: AMPATH, Brazil
  • 12/1/04: ComputerWorld
  • 12/1/04: TechSpot
  • 12/1/04: TechWorld
  • 12/1/04: DigitalSilence
  • 12/3/04: SLAC
  • 01/17/05: DSstar
  •  

    Caltech

    Cisco Systems, Inc.(R)
    StarLight logoSURFnet


    Project Description | Technical Details | Results | Challenges | Conclusions | Quotes | Photos | FAQ

    Project description

    This is a joint Caltech,  SLAC, University of Manchester project with Cisco, Level 3, QWest, CENIC, DataTAG, National Lambda Rail (NLR), StarLight, TeraGrid, SurfNet, HP, Sun, as sponsors. It responds to the SC2004 Bandwidth Challenge Call for participation (August 16, 2004). We will demonstrate high network and application throughput  on trans-continental (10 Gbits/s) and trans-Atlantic (10 Gbits/s) links between Caltech/LA, SLAC/Sunnyvale, FNAL/Chicago, CERN/Geneva, StarLight/Chicago and SC2004/Pittsburgh

    The High Energy Physics (HEP) community is conducting a new round of experiments to probe the fundamental nature of matter and space-time, and to understand the early history of the universe. These experiments face unprecedented challenges due to the volumes and complexity of the data, and the need for collaboration among scientists working around the world. The massive, globally distributed datasets are expected to grow to over 100 Petabytes by 2010, and will require Gbits/s throughputs between sites located around the globe. In response to these challenges, major HEP centers in the U.S. including CalTech, SLAC and FNAL have been designing and building state of the art WAN infrastructures that support a Grid-based system of physics Web Services. During SC04 we will use NLR 10Gbits/s waves to demonstrate these Web Services and the MonALISA monitoring service used in Large Hadron Collider (LHC) and BaBar experiments and also improve network and disk-to-disk transfer performance over 10Gbits/s WAN links.

    We hope to demonstrate high (> 1 Gbyte/s) disk-to-disk throughput and even higher  (a few tens of  Gbits/s sustained) memory-to-memory throughputs between the above sites. The showcase will be in the FNAL/SLAC booth (2418), content,  to be found in the floor layout at SC2004  in Pittsburgh, Pennsylvania, November 6-12 2004.

    We will also have several demos, and slides shows illustrating:

    Measuring the Digital Divide (PingER) Bandwidth Monitoring (IEPM-BW) Available Bandwidth Monitoring (ABwE)
    Internet Traffic Characterization (NetFlow) Worldwide Sharing of Internet Performance Information (MonaLISA)  
    Comparing TCP stacks, long version    

    In addition we will be:

    Bandwidth Challenge Technical Details

    SLAC plans to have:

    The overall network layout may include CERN, Caltech, FNAL and SLAC as well as collaborators in Australia, Korea, Japan, Brazil and the UK.

    We are looking at 2 racks for an artifact in one corner of the SLAC/FNAL booth.

    We have a few Dell PowerEdge 2650 dual cpu servers (specifications). From Sun we loaned 11 AMD opteron compute servers (V20z), each with two 2.4GHz AMD 64 bit Opteron processors and 4GB of RAM (config), part number A55-NXB2-1-2GGB5, 1 u high, physical specs). These compute servers will run Linux 2.6.6 and will make memory to memory data transfers using iperf 1.7.0. In addition there will be three V20Zs  file servers with Sun 3510 Storage Arrays again running Linux 2.6.6. We will be using Intel, Chelsio (T110 with a TCP Offload Engine) and S2iO 10GE NICs. Five V20Z compute servers  will be located at Sunnyvale, and six at SC2004. One file server will be located at Sunnyvale and two at SC2004.

    Caltech overall setup, Caltech CENIC PoP, Caltech Optical Exchange Point Hosts at SLAC booth and Sunnyvale and their purposes, SLAC/Sunnyvale Equipment, Sunnyvale power, people with access at Sunnyvale, SLAC Booth Equipment, booth power requirements, booth power layout, booth power form, layout of network bunker Chelsio 10GE T110 card with 1982 10Mbps 3COM card,
    Chelsio TOE NIC and S2io NIC
    Data flows for SC2004 prepared by Caltech UKLight network connection, UK MB-NG network, UK host configurations Sun 1000-38 rack, Sun-900-38 rack, Sun 3510 disk array, Sun V20Z, Sun V40Z

    Operational Information,

    Results

    We achieved a maximum bandwidth of over 0.1 Tbits/s (101Gbits/s) between the Caltech and SLAC SC2004 booths and collaborator sites.

    Example certificate awarded at the SC2004 Bandwidth Challenge Ceremony on November 11, 2004. Richard, Les and Harvey at the award ceremony, Phil DeMar (FNAL), Richard Hughes-Jones (Manchester University), Dave Nae (Caltech), Les Cottrell and Richard Mount (SLAC) with awards.

    S2io NICs with Solaris 10 in 4*2.2GHz O:pteron cpu v40z to one or more S2io or Chelsio NICs with Linux 2.6.5 or 2..6.6 in 2*2.4GHz V20Zs Chelsio TOE Tests between Linux 2.6.6. hosts, 1500B MTU, all Linux 2.6.6
    On LAN through switch in S2io booth, one NIC (table, plot): 7.46 +- 0.07Gbits/s
    On LAN through switch in S2io booth both NICs: each NIC ~6 Gbps, 2 NICs 12.08 Gbps, ~ 160% utilization
    On LAN through switch in SLAC booth V40Z to two V20zs (scsl-4 and scsl-6) with Chelsio NICs simultaneously.
    On WAN from booth (Pittsburgh) to Sunnyvale/ESnet PoP with 2*2.4 GHz V20Z opteron at booth and 2*1.6GHz V20Z Opteron at Sunnyvale, Chelsios at both ends, 2MByte window, 16 streams, 1500Byte MTU. Both ends were running Linux 2.6.6
    • Test1: 7.42 Gbits/s, 120mins (6.6Tbits shipped), 148% CPU utilization
    • Test2: 7.412+-0.009 Gbits/s (stream average 463.3+-0.8Gbits/s), 30 mins (1.7Tbits shipped), 128% utilization
    • Test4: 7.39Gbits/s, 30 mins, 168% CPU utilization
     
    11.4Gbits/s from single V40Z host (spreadsheet) CPU Util ~ 0.2GHz/Gbps, 6.6Tbits in 2 hours one host to one host with 1500Byte MTU, effect of parallel streams on GHz / Gbps, spreadsheet

    The SLAC equipment loans alone from companies such as Sun, Cisco and Chelsio totaled about $400,000. This does not include equipment loaned to NLR to provision the link from Sunnyvale to Pittsburgh.

    Challenges

    1. We were unable to connect our loaned 10Gbits/s links to SLAC. Thus, in the last week we had to move the California located equipment to Sunnyvale (about 20 miles from SLAC). Further in the last few days it was learnt that it would be necessary to position the equipment at two separate locations in Sunnyvale. This absorbed much valuable time at the critical late stage before we travelled to Pittsburgh. Further, due to the short time available we did not have facilities for remote power-cycling of remote hosts so we had to be careful not to make dramatic changes to them (e.g. reboot them often). We had to manually reboot one system at Sunnyvale during SC2004.
    2. Most of the loaned hosts and disk arrays were shipped directly from the provider to the show in Pittsburgh. Thus we were unable to complete setting up hosts until we reached Pittsburgh three days before the show. To assist in setting up at Pittsburgh, we created master disks at SLAC that we carried by hand to Pittsburgh.  
    3. Keeping host configurations updated. This was done by hand, we did write a script to propagate files but did not have time to fully test it. Having NFS might have helped for a single site. However, NFS would have increased complexity and possibly had some security concerns. We deemed using AFS too complex for such a short term demonstration.
    4. Lack of name service meant we had to remember IP addresses. We used /etc/hosts in the hosts to assist in this. We used VLANs to associate traffic with off-site hosts with chosen routes. The use of VLANs and multiple remote sites meant we had many address ranges.
    5. Security, we did not use ACLs in the router, rather we used iptables in the hosts.
      1. Enabling extra ports at the last minute for UDT led to confusion and delays.
      2. Also traceroute with UDP probes failed to work properly for a large part of the setup period. This led to delays in discovering incorrect MTU configurations.
    6. Jumbo frames: we set the booth router interfaces to support 9000 Byte MTUs. However, we missed setting the VLANs to have 9000 Byte MTUs until it was too late. Thus we were unable to make off-site measurements with large MTUs.
    7. The large mix of hardware (Opterons (V20Za and a V40Z with various speeds and system disk sizes) and Xeons), operating systems (Linux 2.6.5 and 2.6.6. and Solaris 10), NICs (Chelsio and S2io) while increasing our ability to test more features also increased the complexity of setting things up. 
    8. Coordination between the SLAC and Caltech booths was difficult due to their physical separation (about 100 yards). This made balancing link loads and sharing of resources more difficult. Next year we may share a booth dedicated to HENP network measurement and performance.
    9. It was hard to get SR XENPAKS to connect the S2io interfaces to the Cisco router.
    10. Even though we had several file servers in the SLAC booth, we had insufficient time to focus on file transfer performance. We were able to achieve about 110MBytes sustained file transfers using the Fast-SCSI systems disks with a 250GByte file.

    Conclusions

    The Chelsio TOE NIC (T110) works very stably on uncongested paths. Most of the measurements used 1500Byte frames.

    On a 2.4GHz Opteron throughput is limited by the PCI-X 133MHz bus bandwidth. We were able to get almost identical performance on the WAN as the LAN.

    We were able to saturate over 99% of a 10 Gbits/s wavelength across country with two 10Gbits/s NIC hosts talking to two 10Gbits/s hosts. 

    For 10 streams or less, the CPU utilization by the Chelsio TOE NIC is a factor or 2.5. to 3.5 less than that of the S2io NIC. The CPU utilization appears to be a function of both achievable throughput and number of parallel streams.

    We were able to get 11.4 Gbits/sec from a single V40Z host (using two S2io GE NICs) to two V20Z hosts on the LAN.

    We were able to demonstrate the smooth inter-working of Chelsio and S2io 10GE NICs, as well as Cisco 650x switch/routers and a Juniper T320.

    With UDT we were able to achieve about 4.45Gbits/s (transferring 200GBytes) with 9000Byte MTUs between two V20Zs in the SLAC booth, each V20Z had 2*2.4GHz Opterons and a Chelsio 10GE card. The cpu utilization was 114%. This is about 2.6 times the cpu utilization/GHz for a similar transfer rate using the TCP TOE NIC, but similar to that used by the S2io NIC.

    Quotes

    "The smooth interworking of 10GE interfaces from multiple vendors, the ability to successfully fill 10 gigabit-per-second paths both on local area networks (LANs), cross-country and intercontinentally, the ability to transmit greater than 10Gbits/second from a single host, and the ability of TCP offload engines (TOE) to reduce CPU utilization, all illustrate the emerging maturity of the 10Gigabit/second Ethernet market. The current limitations are not in the network but rather in the servers at the ends of the links, and their buses" Dr. R. Les Cottrell assistant dirrector, Stanford Linear Accelerator Center Computer Services, and head of the SLAC led part of the team.

    Harvey Newman, professor of physics at Caltech and head of the team, said, "This is a breakthrough for the development of global networks and grids, as well as inter-regional cooperation in science projects at the high-energy frontier. We demonstrated that multiple links of various bandwidths, up to the 10 gigabit-per-second range, can be used effectively over long distances.

    "This is a common theme that will drive many fields of data-intensive science, where the network needs are foreseen to rise from tens of gigabits per second to the terabit-per-second range within the next five to 10 years," Newman continued. "In a broader sense, this demonstration paves the way for more flexible, efficient sharing of data and collaborative work by scientists in many countries, which could be a key factor enabling the next round of physics discoveries at the high energy frontier. There are also profound implications for how we could integrate information sharing and on-demand audiovisual collaboration in our daily lives, with a scale and quality previously unimaginable."

    Photos


    More on bulk throughput
    Bulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | FAST TCP Stack Measurements | QBSS measurements

    Demonstrations
    SC2001 challenge | iGrid2002 demonstration | SC2002 SLAC/FNAL SC2002 bandwidth challenge | SC2003 bandwidth challenge | Internet2 Land Speed Record

    Created August 18, 2004: Les Cottrell, SLAC