Fermilab October 23-25, 2006, https://indico.fnal.gov/conferenceDisplay.py?confId=417
Rough Notes by Les Cottrell
CMS magnet test completed. It still has to be lowered into pit.
MONARC predicted tier 0 & 1, with 622Mbps connections and air freight for backup. Tier 2 introduced in 1999. Tier 0-1 10-40Gbpas, tier 102 up to 10Gbps. Tier 2 for universities and small labs to manage. Have 100 identified, also federated tier 2s. Main job is Monte Carlo simulations plus large disk caches, and serving tens of users for end-user analysis CERN/Outside ~ 1.4, T0/sum(T1)/sum(T2)~ 1:2:2. Tier 2 sizing for 2007: 1M+Si2K disk ~ 200TB, WAN 10Gbps or 2.5Gbps (2.5. not economically favorable) evolving with technology 2008 onwards.
Learning how to use 10Gbps effectively. New buses (e.g. PCI-X2, PCI-Express) enable 10Gbps.
ESnet transferred 1PB in April 2006, FNAL went from 0.2 PB to 1PB in the space of a month.. Amsterdam 80-100% per year growth.
LHCnet now 10Gbps, going to to 30, 40, 60,80G. CERN-BNL 2005 0.5Gbps, 2006: 5Gbps, 2007: 15Gbps, 2008 20Gbps, 2009: 30, 2010: 40Gbps. Network speed growth for LHCnet slowing down limited by technology.
Service challenges 7Gbits/s to disk more recently 1.6GBytes/sec. SC4 did a PByte/month. PhEDEx for CMS achieved 1GBytes/sec. to disk.
10^7 event samples:
Analysis Object Data (AOD) 1.5-1TB
1.2-2.5 hours (@0.9Gbps) @8Gbps
RECO 2.5-5TB 6-12 0.14
RAW+RECO 17.5-21 43-86
MC 200 98
ESnet plans to go to 160-400Gbps in 2011-12.
SC05 sustained rate of 1PB/day. SC06 10 * 10Gbps lambdas. Emphasis on disk-to-disk i.e. applications that run at disk speed.
Circuit set-up scheduling: Lambda stations, OSCARS, Terapaths.
Tier 2s increasing in performance but not really fully addressed yet. There are opportunistic tier 1-1 links being put into place.
LHCOPN meetings 4 times/year. Working Groups operations (DANTE), monitoring (USA), routing (CERN), security (UK).
Not clear where Ultralight fits in.
PerfSONAR to be deployed as the network monitoring infrastructure, see (lhcopn.cern.ch Twiki)
Creating a managed operational network out of a set of L2 circuits is not trivial. T2 traffic not carried by LHCOPN or USLHCnet. Carried by country NRENs which vary from country to country (e.g. Lithuania pays same for 100Mbps as LHCnet pays for 10Gbps). Global links are in most cases are opportunistic and not integrated. Will require bi- and tri-lateral etc. agreements. Are general purpose ESNet/I2 network sufficient. Are new T2-T? e2e circuits needed, if so: which ones, how do we determine, who negotiates and with whom. CERN is the location of a T1/CAF, access by T2 should be anticipated but if not in megatable then not real. CERN will accept such requests if justified and agreed by the LCG MB. Be prepared to adjust model and funding to meet evolving requirements. Hard to predict and will take time depending on scale of changes. Important to get starting position right.
ESnet & Internet2 partnering to build an optical infrastructure. LHCnet funded by DOE OS/HEP program. Use of LHCOPN not restricted to LHC. LHCnet budget is fixed so count on Moore's law. Needs 24x7 operations and support an issue. US T1/T2 connections to EU T1/T2 not doable over LHCnet. US T1 to Y2 connected by NFS funded combination of ESnet, Abilene direct connections. DOE funds T1 and SLAC T2 via ESnet.
LHCnet procurements: market priced exposure, difficult to predict costs. US connectivity to EU T1/T2s unclear. Dynamic bandwidth reservation is being worked on. US T2 connections to StarLight/MANLAN are being negotiated - ESnet AUP allows direct connections; Abilene connections to universities. T2s do not connect to LHCOPN.
CMS is MonALISA based, LHCOPN is perfSONAR based integration may be an issue.
Have 6*10 Gbps waves to Chicago. ESnet using two for SDN and production.
T1 CMS site ingests data from CERN via LHCOPN (lightpath) carried by LHCnet. Exchange data with other T1s. LHCOPN lightpath includes a security plan, and operations plan, perfSONAR requirements. FNAL is a large T1 <-> T1 is bursty, few times a year in CMS model. Scavenger service on OPN is available for these transfers. Is supported in the OPN operation and security plan. FNAL to UCSD T2 got 240MBytes/s. Experiments full data is distributed among T1s. FNAL has to server T2s in S. America, in US (well provisioned) and in Europe & Asia. Most of T2s are in Eurasia so lot of traffic to Eurasia. CSA06 not fully exercising the network. This is partly since many non US T2 are new, some e2e issues (windows etc.) provisioning still needs work. US LHCnet has no ability to route onto GEANT. All US T2s use I2-GEANT. Unclear if it is working right. Have 2Gbps to DESY, have worked that link and are happy. Estonia and Spain have also started working with and get 200-600Mbps. I.e. one needs to tune up links, etc. and get the same performance from FNAL as one gets to the sites favorite tier 1. Planning needs to understand what will actually happen when trying to do physics.
Current frontier OPN establish interface to upper levels for T0 T1 OPN. Accepted technology is perfSONAR provided by most OPN nets, others are gaps.
Emerging models of circuit-like circuit for high-impact data movement. FNAL implementing circuit-based services for 2 years. Some circuits (e.g. CMS) are heavily used in support of service challenges. Complexity of "circuits" is higher than IP serviced, more complex and difficult to debug. Use alternate path technology. Use minimal-size source/dest netblock pairs, e.g. US-CMS T1-T1 uses LHCOPN, rest uses ESnet. Aggregating 10Gbps links adds to complexity. Configuring router/switches is complex and error prone. Have to simultaneously look at counters on multiple links along a path, this takes coordination of people from multiple administrative domains (e.g. end sites, ESnet, GEANT, RENATER, MANLAN NY, HOPI etc.) Monitoring must cover fail-over testing. Need automated monitoring.
50% ESnet traffic flows to about 100 sites Internationally. Currently most is BaBar, D0, CDF not yet LHC. Total traffic approaching 1.5TB/month, about 40% is the top 1000 host-host workflows. Large scale science in 5 years will be 95% traffic. Much of this will be moved to dedicated circuits.
72% of top 1000 flows are parallel data movers. Most is circuit like (high bandwidth, stable, goes on for a long time). Circuits can last for months. No longer is traffic statistical, it is deterministic.
ESnet 2006-07 a routed IP network with sites dually connected on metro area rings or dually connected directly to core ring. A Switched network providing virtual private networks for data intensive science. Use Cisco switches and Juniper routers.
ESnet budget increase by 1.7 over next 5 years to improve ESnet infrastructure. Will build an MPLS/OSCARS network. 2008 finish southern link. Every large DoE Lab will have redundant connections. Rings in Bay Area, Chicago and BNL, loops for ORNL, JLab. All updates will be adiabatic i.e. things will continue to run through upgrade.
ESnet has a NOC shared with NERSC. Engineering staff is on-call 7x24 for core IP network. An engineer is on call for a week at a time. Typical is about two engineer calls a week. Alerts are raised by Spectrum. Video conferencing is 8x7 coverage.
14 members (ISPs, Cisco & Internet2), 150 university to provide infrastructure necessary for big science. Developed by researchers for researchers. Owned by US researchers. AUP free. In place & operational. Connected to 15+ regional optical networks. It is cost-effective. Serves network researchers, science researchers for big applications, clinicians, ... Own fiber: 20yr IRUs level 1, Willtel, AT&T. Core services: wavenet, framenet, packetnet. Provide colo, remote hands and fibre IRUs.
To support production research science, i.e. a stable infrastructure. Funded out of the NSF International Research Networking Connections program. Data intensive science is driving the technology. It funds two trans Atlantic networks. Also provides connectivity to Pacific Waves. IRNC is part of GLIF. GLIF is a consortium of institutions, organizations, NRENs and consortia. IRNC peers in Amsterdam that peers with GEANT.
Keyspan provides two paths to BNL from ManLAN. Three networks at BNL have access to OPN. Security limits things due to firewall. Run 6500 with one ACL. Will use generic Internet for tier 2 connections. Look forward to firewalls being able to keep up. Monitoring tools: mon utility to look at external services, also use Cacti to look at SNMP router stats, NDT from Rich Carlson, and NPAD. New Terapaths testbed, e2e controls BNL border router and interworks with OSCARS to UMich. Local issues with T1-T0 connectivity.
40% resources at CERN. Majority of computing resources are located away from the host lab. Event size is ~ 250KB, data rate ~ 250MB/s, 150Hz for DAQ target event rate. Nominal CMS T1 has PB dedicated disk space by 2008. Strict hierarchies do not exist for CMS, T2 have to be able to connect to any T1. T1 & T0 are CMS experiment resources and activities nearly entirely specified. T2 are place where more flexible user oriented activities occur. T2s do Monte Carlo and analysis. Data not called safe until copy at T1. There is an issue of synchronizing the data at T1s. This happens as a burst of activity. This can include T1-T1 transfers. MC simulation is statistical across multi T2-T1 links and fairly constant. Since the data is shared across multiple T1 with none having a complete set, T2s will need to access multi T1's (i.e. not strictly hierarchical). T1 transfers are very reliable, i.e. first try succeeds in getting the data through. Not so good for T2 transfers, kind of like T1s were a year ago.
Proposed national net for LHC T2 and T3 connectivity in US. The name T23 (Tier 2 and Tier 3 connectivity) also is the name of a US tank (see http://afvdb.50megs.com/usa/pics/mediumtankt23.html). ATLAS: 1 tier 1 at BNL, 5 T2 (mostly multi-sites), 30 T3 sites (2 labs and universities), CMS similar situation (1/7/42). 68 different institutions at the T3 level. CMS is a huge mesh. LHCOPN does not provide support for T2/T3 (no peering or transit). Need to focus on T2/T3s come up with an architecture. It will be multi-network, not just I2.
Dynamic circuit service. Call setup model requires telco type metrics. Propose creating a working group to write a document for the architecture for T23
Window sizes 4096 87380 1084576. Swapper tries vainly to free memory for more network buffers. Linux grants memory reservation requests optimistically. When runs out kills processes.
Interrupt handling: NIC driver "NAPI" mode handles input packets per interrupt, most work happens at software interrupt level. Transmit has higher priority than receive, not necessarily right. Packets dropped inside host due to lack of servicing the receive buffer (looks like a packet drop). If the process is suspended while receiving, the socket data structures are locked. Packets arriving during that time go on backlog queue are not processed by TCP. Delay can be Nx100ms. Linux scheduler rewards interactive processes in two ways: priority boost & extra time slices. Interactivity measured by accumulation of interruptible sleep time > run time. A process receiving a relatively slow TCP steam fits that criterion all too well and can starve other processes and in fact a slow stream will get better service than a fast stream. So set thresholds so sleeps shorter than a certain time do not count then get much better fair share. FNAL has a solution that will try and get incorporated in future Linux distribution. Linux memory management has got very/too complex so will be rewritten in Linux.
Networks are becoming more complex especially as add dedicated circuits, QoS and scheduling, more challenging where data going & why. Backbone interfaces will stay at 10Gbps, but will have multiple connections. Clusters and a single computer can saturate a link. Increasing numbers of high bandwidth users. Big flows are not predictable. Users want to know if there is enough available capacity. Look at the effects of load spreading. Need to differentiate cause of problem, is it application, middleware, Grid, OS, file system, LAN, WAN etc. Applications are more dependent on the network. Need to see whole e2e paths. Also track whether network is up or down. Understand how applications affect the network. Traditional alarming, diagnostics.
PerfSONAR goes away from an integrated analysis & viz, measurement infrastructure, and performance tools, to having standard interfaces so can mix and match. Big collaboration. Not a product yet.
Where do we go next with monitoring for LHCnet.
Need automatic tools to analyse networks and find and report problems. Ideally monitoring should also fix problems. MonALISA has some very advanced graphics and analysis. It is not open source.
Need reliable predictable service levels, using opportunistic bits of available links not sufficient. Need to be scalable, predictable funding, redundancy with multiple landing points, need coherence, well defined processes for negotiating and follow up, need defined roles for ESnet, I2, USLHCnet, don't want to build network but have vendors caught up.
T1-T1 across LHCOPN has created some political issues (competes with NRENs). Need Global Connectivity Committee (GCC) to propose policy for many different cases: E2E inside GN2 (GEANT 2) cloud (in principal agreed to by NRENs), E2E one end in GN2 cloud, E2E neither end in GN2 cloud, E2E inside or across the cloud in multiple hops, End points well defined or aggregated users. Marginal cost of the last mile is not involved in this. It is mainly focused on trans Atlantic links. Most networks have a simple AUP, i.e. one end of a conversation has to be on one of its sites. So far the only real concept is "reciprocity" transporting across an outside network at similar marginal cost.
GN2 is an EU project, funding from NREN's & EU (50:50), DANTE is an implementation organization, policy is decided at the NREN policy committee meeting (NREN PC), subgroups work on different aspects, e.g. Global Connectivity Committee for LHC.
One of the problems is that NRENs do not know how to classify LHCnet. If LHCnet were to become part of ESnet this might help. This was discussed including Craig Tull from DoE, Bill Johnson, Harvey Newman and Don Petravick.
Highest T0-T1, T1-T1 or T1-T2 is ~ 2Gbits/s storage to storage. Get higher aggregate rates. DAQ rates are fairly constant over life of experiment within a factor of 2. T1-T2 is part of physics mission, rate goes up with accumulated data collected. ATLAS still has to do a lot of work to get T1-T2s working, I do not think T2s have been involved in the ATLAS data challenges. T2 vary in size by a factor of 10 in terms of disk storage. Typically a T2 will have about 40 people (physicists), while tier 3s will have about 8. Tests have been made with combined Atlas/CMS service challenges across the Atlantic.
Connectivity to offshore T2's must be addresses: needs coherence, we need at least a tentative definition of ESnet, I2 & US LHCnet roles; responsibilities in the US are not symmetric; CMS requires substantial connectivity; ATLAS testimony not sufficient; BNL accommodate ACL based security system -- need an inventory of end sites.
Mission and organization. Need to claim only the mission we can execute. US LHCnet is currently a network to CERN (FNAL, BNL), US LHCnet needs policy level documents, defendable set of plans. The accepted TA network forecast should distinguish CERN connectivity from all others. (FNAL, BNL) LHCnet feels this may be difficult. Ultralight is an overlay network, US LHCnet is a network for transatlantic connectivity for LHC. LHCOPN is an overlay network to do something simple i.e. get data from T0 to T1s and some amount of T1 to T1s.
Need clear statements for the mid-November GCC GEANT meeting will help develop a constructive relationship with GEANT. FNAL/BNL => Euro T2 is a prime candidate, expect input from US Experiments computing project. Expect commissioning activities to ramp on, on the same time scales. Atlas/BNL need to clearly state their model, and participate if computing model indicates.
Offshore T2s: FNAL/BNL expect production services to be in place and maintained. Interface to "upper levels" of systems must be implemented as they emerge. Accepted interface mechanisms must be used.
Production: There is a requirement to monitor links as they traverse diverse networks. We see perfSONAR in USLHCnet plans. USLHCnet says nothing exists to interface to except MonALISA. There is an evident rift in the accepted infrastructures. FNAL/BNL trouble tickets & such -- we do not understand the level of integration.
Comments on US LHCnet key milestones: FNAL/BNL 2007 there is a milestone for peering in Europe in LHCnet plans.
T2 200TB of disk, 1-10Gbps connectivity, does simulation and analysis, ability to rapidly refresh local "cache" disks, must be able to move simulation results back to tier 1. Showed the networks for the 5 ATLAS tier 2s. All sites plan to upgrade to 10Gbits/s.
Tier 3: some will have significant computing resources. No official manpower supported, need help with complicated issues. Need to document performance between sites. Deploy initial tier-2 network monitoring include ML, IEPM-BW and PerfSONAR. Need to demonstrate WAN disk-t-disk transfers utilizing 90% of sites bottleneck bandwidth between two tier 2 sites. Need to document tuning and optimization at both sites necessary to achieve this. Deploy "beta" end-host agent (LISA or descendent) on selected edge servers. Update/tune network monitoring system(s), review effectiveness.
CMS AOD event is 50KB. When going back to raw data and complete simulation, analysis selections on a complete trigger stream. 1% selection of data and MC would be 4TB, 10% would be 40TB. There are ~ 40 people working at tier2. If half people access small section at level of twice a month, this is already 50MB/s on average & everyone is working asynchronously. The original analysis (MONARC were once a week), 10% selections will happen. 100MB/s x 7 T2 would be 6Gbps from a T1. 2/3 of that is trans Atlantic for FNAL. Size of selections, number of active people and frequency of selections all have significant impact on the total network requirements, can easily arrive at 500MB/s for bursts. Data is processed by batch slot at 1 MB/s. So with 100 quad core hosts we have 400MB/s.
T2 are a resource for physics community. Use of T3 is foreseen but not mandatory. Typical T3 is 4-8 people. Smaller sustained use than T2, but similar turn around requirements. All 7 CMS tier 2 to be connected at 10Gbps by end of year (only one to go).
Synchronous Payload envelopes are 90 columns by 9 rows, 4 columns are for control and monitoring leaving 86 columns. An OC48 is 48 SPEs. In classic SONET these are all contiguous SPEs starting at boundaries 48, 96, 144 so can get holes as add/drop circuits. LAN-PHY is a way to stick Ethernet frames into OC192C. Sticking 1GE would require an OC48c which is very inefficient. GFP (Generic Frame Protocol) is an improved way to encapsulate. It has two encapsulation techniques GFP-F (Frame mapped: map frames (e.g. Ethernet) to SONET frames) and GFP-T (transparent mapped bytes from frame to SONET frames). VCAT Virtual Concatenation: scalability - SONET pipes can be sized to match payloads, eliminates holes, compatibility (only ends needs to support VCAT). Achieved by creating VCG (Virtual Concatenation Groups) by eliminating the boundaries and the contiguity requirement, even down to the fibre layer.
Optical control planes: IETF developed GMPLS, ITU developed ASON. Both do routing, path computation etc. but different approaches - overall vs. layered. GMPLS doesn't support inter-domain capabilities (DRAGON is an NSF funded project to investigate). OIF has influential recommendations to use E-NNI from ASON, and has provided direction. There is much work left to be done in this space!
Interest is since SONET prices have come down now competitive with 10GE, it has better debugging, and can support common Ethernet support.
LHCnet funded by DoE-HEP. US LHC funded by DoE-HEP and NSF-MPS. LHCnet mission: "Provide reliable, dedicated, high-bandwidth connectivity for US DoE-HEP between US and CERN, specifically targeted at the US LHC experiments T1 & T2 facilities, but not restricted to LHC experiments alone. How do we define a "US DoE-HEP facility" without excluding facilities which are funded by partners. Connectivity for US D0E-HEP between US and CERN ... not restricted to LHC. LHCnet as transit network (e.g. to Asia) is not appropriate. US LHCnet non-CERN domain of responsibility will end at peering point.
Look at both ESnet and Internet2 AUPs. Typical network AUP authorizes usage must include at least one end point that is a formal client of the network, e.g. for Abilene this means Internet2 members, for ESnet it is DoE facilities (e.g. Labs).
About 40 people were present representing ATLAS, CMS, CERN, Internet2, ESnet, NLR, DoE, T0, T1 and T2 sites. It was a very successful meeting in exposing many issues. Action items include: getting an AUP for LHCnet, getting traffic estimates for T1-T2s, to assist in helping ISPs understand LHCnet make it closely associated as part of ESnet. The immaturity of PerfSONAR was exposed and the need to figure out where MonALISA and PerfSONAR fit together was exposed. Also teh need to better understand the T1-T2 traffic requirements became apparent. There will be another meeting next year.