There are three separate places where code is run in an individual
transaction. There is the persistent Clarens web-service[2],
generally
located at the execution site. The job runs a jobmond process
which runs the communication and executes the commands on the worker
node. The jobmond persists for as long as the job is executing.
Finally there is the client code, which a users executes when they
want to interact with the remote job, this persists only until the
transaction has completed. This configuration is diagrammed in Figure .
The central server has the primary responsibility of authenticating the user and implementing the access control that limits users to interacting with jobs for which they have permission.
While the jobmond persists for as long as the job is executing, it cannot be used as a TCP or other kind network server since it is assumed that many sites will disallow user jobs from listening on worker node ports. In order to accommodate this a wake-up/call-back mechanism has been implemented. When the central server needs to send a command to a worker node the server calls a wake up module, which can have multiple different implementations depending on the site. The currently existing implementations are UDP broadcast and a central TCP server with open connections to all jobs. Each of these as different limitations. Once awake, jobmond communicates using secure and authenticated Clarens calls.
Inside the Clarens server the user-side communicates withe the job-side using interprocess communication (FIFOs).
The Clarens infrastructure is responsible for authenticating the user and the job using the X509 certificates they provide. Once that has been completed the JobMon code allows for matching between the user and job GRID subject using regular expression based criteria. A system administrator configuring their JobMon service can the the match criteria. This is generally simply matching the user name and authority extracted the user subject to the user name and authority extracted from the job subject.
For example with kx509 generated certificates from Fermilab, we require that the certificate name and the user id match. This is specified with a configuration entry of the form:
/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=<name> /UID=<user>
/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=cdf /CN=<name>/0.9.2342.19200300.100.1.1=<user>
In order to remove completed jobs from the server registry, jobs are
required to periodically reregister. The load these reregistrations
place on the Clarens server limits the overall number of jobs a single
server can support. Tests have shown that dual 3.4 GHz Xeon can support
4 Hz of registration calls and maintain a 1 second latency. This
corresponds to supporting a 3000 job farm with a 12 minute reregistration
time. For a 10 Hz rate the latency is increased to 10 seconds.
The since the reregistration is only to identify jobs that are no longer
running the reregistration time can be set fairly long. The only cost
associated with this is that jobs that do not terminate cleanly will be
left in the list of known jobs for longer.
http://cdfcaf.fnal.gov/
http://clarens.sourceforge.net/
http://physics.ucsd.edu/~schsu/project/JobMon/
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -nonavigation -split 0 JobMonDesign.tex
The translation was initiated by Elliot Lipeles on 2005-08-29