Meeting between SLAC & FNAL PingER/IEPM developers, at FNAL Chicago

Author: Les Cottrell. Created: October 17

Attendees: Phil DeMar, Al Thomas, Maxim Grigoriev, Donna LaMoore - FNAL; Warren Matthews, Les Cottrell - SLAC.

XIWT | IEPM | PingER | Tutorial

Page Contents

SAS to MySQL conversiosn Iperf
Current operation PPDG monitoring
Other Issues.  
Timeping  

SAS to MySQL migration

New MySQL stuff is running on new machine. Now need to copy data from SAS to MySQL. New machine (Sun) will have Apache server with FrontPage extensions, Perl graphics etc. New DoE directive to review sensitivity of information. Will leave old PingER stuff on old machine. New machine with all new software ready "next week" (Maxim) or at least before Thanksgiving (Frank). Still need to organize disks. New plotting will be part of new machine/package. After cutover old machine will become a test machine. A question is whether to burn on ROM the old data in text form. There have been a few demands for bulk archive data. The alternative is to allow it make read only available via anonymous FTP. It was agreed the latter is preferable.

It was agreed to provide offsite access to the FNAL data for the purposes of distributed analysis. The main initial customer would be Warren, however, other people at other sites may be interested. For the minute it was agreed to use tcp wrapper to open access to Warren or SLAC only. This may need review in future, for example if we need capability to access the data for PPDG sites.

Need schema fields that enables storing and retrieving individual RTTs and sequence number (see later under Metrics for more on this). It was not in the SAS database. There can be variable number of pings in a set. Also need a time stamp for each ping. It was agreed to add saving all the RTTs and sequence numbers. Problems are more likely to be with performance rather than disk space. Maxim will review how best to implement this.

Current Operation

DNS is a big stumbling block in losing data. The way it works today is the configuration file provides both the IP address and name. Often the IP address gets out of date. Frank proposed timeping maintains a small DB of names & address. If have DNS name and successfully resolves then update the DB. If does not translate then use last IP address. Could have a problem in that another machine at another site has been given the old address (this has actually happened in the past). Will run tool monthly to compare configuration file with the database, so can manually verify that new IPs are good.

Les will review the Beacons list to see which hosts have been lost, look for new ones and come up with a new list. Warren will provide a list of Beacons that look suspicious (e.g. not responding). Frank will email recent gather logs to show which sites are having problems.

Warren has spent a lot of effort filling in blanks of data not gathered. We need to do this for the FNAL data. To facilitate this it will help for SLAC to have access to the FNAL data for comparison purposes. Also Warren will provide read access to FNAL for the SLAC data.

Other Issues

CVS is in place, Maxim is populating. Their CVS database is Kerberized.  FNAL Kerberos has been modified to handle cryptocards. The mods will be folded into future MIT 5 release. All FNAL products that we care about are available in tar balls. There is a FNAL Kerberized ssh. Read the FNAL strong authentication documentation. It is extensive. Will need to authenticate to use accounts. Can either get a cryptocard or run Kerberos at SLAC. May be a conflict installing FNAL Kerberos 5 at SLAC with the SLAC Kerberos which is probably version 4. One does not need Kerberos for read only access to data, this can be provided via the web server  Matt Crawford is the architect of the FNAL strong authentication plan. We talked to him later on and got copies of  the Strong Authentication at FNAL manual. 

Port to NT, we have had 2 enquiries in 5 years. Timeping has a lot of system calls. Given the low interest we will offer to help somebody who is interested in doing it and to potentially help with integration later. 

PingER developers. We have an email group for it. As we make the data more publicly available. Warren will talk with the UCL developer and Robin Tasker to see how great the divergence. It would be good to try and make person contact maybe at the GGF in February.

Directives allow more customized configuration files. We agreed to discuss by email possible directives with aview to what would be useful and how to implement.

Payload message, is a good idea but  Frank has been unable to see any support for it in the standard ping. Implementing our own ping would require suid which might make deployment difficult at many sites.

Poisson scheduling of the half hourly intervals. Frank agreed on Maxim's behalf to look at random (Poisson) scheduling of the half hour jobs. One has to look to make sure the scheduler does not itself die.

XML the new NIMI/GIMI servers will use XML so we need to   learn about.

LDAP appears to be the tool of choice for grid related tools. ESnet will issue OIDs (Warren has one). LDAP is optimized for reading and tree like structured data. So after some consideration LDAP is inappropriate for replacing MySQL.  It might be useful to provide access for resource brokers who want to look at recent network information. Agreed desirable to do. It would talk to resource brokers via LDAP and interface those calls to MySQL.

Timeping

This has been ported from VMS to Unix. It can optionally measure RTTs & losses along the route. The route histories by themselves are very useful both for diagnostics and could be useful for Internet tomography. 

Iperf

We now have regular Iperf measurements going to about 20 sites for SC2001.  We would like to install a server at FNAL on the DMZ. There are problems installing it on the FNALU cluster (lack of pthreads among others). The server will probably be a Linux box. Frank has some hardware available. Les will send a list of things that he needs on the Linux server, such ac C, C++, gunzip, tar, make, cron, pthreads, pico, ssh, perl, large window buffers. Ports 5000 - 5011 should be accessible, i.e. not blocked by a firewall. The server should run a production version of Linux (i.e. Linux 2.2 or 2.4 (preferred)) with the appropriate glibc. Frank will put the machine together get and address and install Linux and the critical system tools. Les will install bbcp, iperf etc. after he gets access. It will be available by the end of next week. It may be rebuilt and/or upgraded after SC2001. Following a discussion later with Matt Crawford, it appears an acceptable solution will be to run the iperf server all the time. This removed the need to be able to remotely start and stop it from the momnitoring site (SLAC). We may need to monitor the availability of the server and restart it automatically, or alternatively put it into inetd.

Metrics

Warren has derived new metrics from the PingER data. To do this he needs the packet by packet information. Some sites already gather this information. It would good to get the new metrics  made available in the FNAL plots. Maxim will look at this. Warren also added MTBF and MTTR but they do not appear to be very useful.  

PPDG monitoring

We need to raise our profile in the PPDG community. This is to ensure our measurements are of use to the PPDG, to keep them informed of what we are planning, what we have available etc. This might result in changes in our directions and priorities. It would also give us higher visibility and a larger audience for the results. One immediate interest is to continue the iperf monitoring beyond the SC2001, where the approval and interest of PPDG members would be valuable to ensure the infrastructure put in place at sites for SC2001 stays in place is supported. Another interest is to tie our measurements into providing predictive capabilities for resource brokers etc., which would be in line with our current interests in the Network Weather Service (NWS). This type of work is alos being pursued in the European Datagrid by the Daresbury folks, who are also involved in PingER. Further it could also leverage efforts to assist with developing network aware applications such as bbcp.

In the long term we should probably aim to install the equivalent of NIMI/GIMI dedicated measurement machines/bes at critical sites, somewhat akin to the AIME proposal. This will provide more flexibility, consistency, more granular security capabilities, improved management for installations/upgrades. This phase would be more ambitious and would probably require additional modest funding, at the least to acquire configure and install the probes.  

The group agreed that we should pursue tighter collaboration with the PPDG. Les will discuss the issue with Richard Mount (a PPDG PI) to ascertain interest, the best approach, where it might fit in the PPDG projects, and then look into composing an email to some key PPDG members proposing closer collaboration between the IEPM and PPDG. 


[ Feedback ]