SLAC site

IEPM-BW project and PPDG - meeting, Toronto Feb '02

IEPM-BW results | Bulk throughput measurements | Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements | SC2001 Bandwidth Challenge


There was a session at the PPDG meeting in Toronto from 1:30pm to 2:30pm on Wednesday 20th February 2002, on the "IEPM-PPDG Collaboration". The IEPM-PPDG collaboration was kicked off Dec 20 2001. The purpose of the IEPM-BW project is to develop and use an infrastructure to make active end-to-end application and network performance measurements for high performance network links such as are used worldwide by Grid applications .... As such we believe an active collaboration between the IEPM and the PPDG is of mutual benefit.

The initial goals of this collaboration are:

This might result in changes in our directions and priorities. It would also give us higher visibility and a larger audience for the results.


This was the agenda: 

Executive summary of presentation

The IEPM-BW is an outgrowth of the IEPM-PingER project, focused on high performance PPDG, HENP, ESnet sites. First pilot implementation for SC2001. It is funded under the DoE/MICs base program. Will make application (bulk throughput) and network throughput measurements on a regular basis. Provide archive, data, analysis, reporting, validation, forecasting. Uses: validating tools, trouble shooting, forecasting, application steering 

We are measuring from SLAC to all PPDG sites. Have measurements to 32 sites going back to end of last year. Want to deploy measurement tools to other sites so they can make relevant measurements for them (e.g. to their collaborators).  The infrastructure tools are not portable yet. The intent is to make them portable. Possible first replica site might be FNAL, there is particular interest from D0. 

Initial focus is for a small number of monitoring sites, e.g. PPDG sites, each monitoring collaborator sites of interest, i.e. hierarchical, mirroring the needs of HENP tiered sites for replicating data.

There are questions of scalability. The current focus is for a small number of monitoring sites, e.g. PPDG sites, each monitoring collaborator sites of interest, i.e. hierarchical mirroring the needs of HENP tiered sites for replicating data.

Will work with Paul Avery to identify and get monitoring to GriPhyN & iVDGL sites. Collaborating with NWS/UCSB for forecasting, will work with FNAL for replica monitoring site, NLANR/CAIDA for bandwidth estimation validation, ANL, LBNL & BNL for publication, LBNL, Rice, UDelaware for bandwidth prediction..

Questions and answers (notes taken by Lee Lueking of FNAL):

(ML= Miron Livny, LC = Les Cottrell, LL = Lee Lueking, HN = Harvey Newman, RP = Ruth Pordes, TN = Thomas Ndousse, PA = Paul Avery).

ML: Is an application a file transfer. What can I do if my ping is lousy? What do I do with the information. 

Les: The only application being measured at the moment is file copy/transfer. Can do end to end measurement, attempt to do forecasting. Working with Rich Wolski and Martin Swany to provide information to NWS to help with forecasting.

 HN: Also can look for problems on your network segment. As an end user it is not clear what can be done to fix this. When there are problems, human intervention is required. The application does not use this information, but this info is important part to feedback to problems to ISP or other suppliers. 

RP: This is not a new proposal and charter, funded as pinger. The collaboration with the sites not in US is evolving. Main drive is currently at SLAC, would like to get Fermilab involved.

ML: How scalable are these measurements, and how frequently can applications probe data sources before it becomes a problem? How many resources are you going to put behind it? 

LC: This works now because the resources are not being used, may become a problem in the future. Typically a single forecast takes a couple of seconds.

ML: What can I do as a user do with this? 

HN: There are a few people who need to be able to look at this and diagnose problems. 

ML: not clear this scales to hundreds of production sites. 

RP: An alternative is to instrument every application. Do you discuss alternatives?

HN: Could you take this data, how long it takes to get started, and how the transfer is proceeding to predict how long a transfer will take? 

LC: That is the intention.

RP: Is this the kind of thing that iGOP at Indiana would do as part of iVDGL? Jim Williams would be interested in this. Who would incorporate the data into a standard information service? Jenny may work to make it available as part of the information working group.

RP: How will FNAL become involved? 

LC: have  been working with Al Thomas to get this started. 

LL: We would like to work within this framework to benchmark the networks from FNAL to D0 test sites in the coming weeks. 

TN: Is there only a centralized way of doing these tests, or could anyone do these tests? 

LC: It is currently centralized, but may soon become easier and better documented. With this we can deploy the measurements to other sites, who can then add their own tests. Currently we do not have an honest answer as to how much it would take to replicate the testing resources SLAC has set up at a second site..

LC: Is there anyone who would like access to the raw data? (silence). What other sites should we measure? GriPhyN, IU, etc. 

ML: One thing one can do if Condor is running at a node, can use it to schedule and launch these kinds of tests. LC: Has a problem with UW, because of their fire walls. Also, there are interesting issues related to testing throughputs through firewalls. 

PA: Would like to look at interesting sites from a Florida perspective. So that would be a good site to make measurements from.

LC: The load on the network is about 6 Gbits every 90 minutes.

HN: Question of firewalls, not just the issue of security, but the bandwidth limitations.

Created February 18, 2002, last update February 20, 2002.
Comments to