Authors: |
Shih-Chieh Hsu | |
Elliot Lipeles | |
Conrad Steenberg | |
Frank Würthwein |
|
|
|
Based on:
|
|
Hosted by:
|
|
|
Installation
Overview
JobMon consists of three parts that run in three separate places:
- the central server is a continuously running service,
- the daemon is started by each job on its worker node which lasts the duration of the job,
- the client tools are run interactively by the user
on the their desktop and terminate after each request.
The central server is a Clarens service which is configured by
a site administrator and is responsible for all the authentication. The daemon
and the client each have an X509 proxy which is used for authentication with
the central Clarens server. The daemon needs to run a wake up service which
is used to notify it that there is a user request.
The following sequence of events for JobMon provides a quick orientation of the
role each of these components plays:
- The central server is started by site administrator and remains running indefinitely
- The users submits a job configured to use JobMon
- Once running the job registers with the central server
- The user makes an interactive request from command line,
which contacts the central server and pends until the result is ready
- The central server wakes up the corresponding job
- The job contacts central server, gets and executes the user request,
and returns the result to central server
- The central server gives the result to the pending request and the
user command line returns the result
Configuration Choices
There are currently two implemented wake up mechanisms (TCP and UDP broadcast)
suitable to different site configurations. In addition, the central manager
can run either as an edge service at a site or external to the site. For a small
site, the TCP method is prefered because the connection state makes sure that when
a job exits it is immediately recorded. TCP can also go through most firewalls
so that the central manager can be outside the site. The disadvantage of TCP
is that is requires one open port on the central manager for each active job.
For a large system this can be a problem. The UDP mechanism is very light weight,
but requires that the central manager be on the same subnet as the worker nodes
and that the grid jobs are allowed to listen at ports. The most restrictive networking
environment considered for the design is where the workernodes can only make outgoing
TCP connections within a site. In this case, the TCP method is used with the central
manager located at the edge of the site.
Download
- cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/jobmon checkout JobMon
Prerequisites
- Clarens + Apache + pyhthon2.2.2
- fixed IP connection to WAN
Procedure
- Installing Clarens server
Ref
- wget -q -O - http://hepgrid1.caltech.edu/clarens/setup_clump.sh |sh
- export opkg_root=/opt/openpkg
- $opkg_root/etc/rc apache2 start|stop|restart
- Point browser to see Clarens welcome message
- Non-SSL=> http://some.host.name:8080/clarens
- SSL=> https://some.host.name:8443/clarens
- A valid pkcs is required for the SSL connection
- For example, you can get pkcs from a x509 cert by doing
$ ns pkc2 export -out exported.pfx -inkey /tmp/x509up_u$UID -in /tmp/x509up_u$UID
- Load pkcs to your browser:
[Edit]->[Preference]->[Privacy & Security]
->[Certificates]->[Manage Certificates]->[Import]
- test Clarens server
- $ clarens-proxy-init
Enter pass phrase for /home/dir/.globus/userkey.pem:
- $ clarens-ping http://some.host.name:8080/clarens/
Contacting http://some.host.here:8080/clarens/...
- Installing JobMon service
- Download package
cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/jobmon checkout JobMon
- Copy whole JobMon into Clarens
cp -r JobMo $opkg_root/share/apache2/clarens
- Attach and edit JobMon service configuraion to clarens config file
cat clarens_config_local.py >> $opkg_root/share/apache2/clarens/clarens_config_local.py
- Restart Clarens apache server
/$opkg_root/etc/rc apache2 restart
- Start JobMonTcpServer if using TCP wakeup method.
TCP wakeup method is recommanded, it provides real time bookkepping of the
jobs. UDP wakeup method waits a clean up cycle.
/usr/bin/python JobMonTcpServer.py &
- test JobMon service [Usage]
Configuration
LogFiles
- JobMon services
- registerJob service: /tmp/JobMon_register.log
- query/getJobtoDo/outputJobResult:/tmp/JobMon_query.log
- JobMonTCPServer: JobMon_TCPServer.log
- InterProcessCalls
- Contains query requests: /tmp/JobMon_$PID_$TIMESTAMP.fi
e.g./ tmp/mp/JobMon_1554_1114138762.fi
- Wait for query results: /tmp/JobMon_$PID_$TIMESTAMP.fo
e.g. /tmp/JobMon_1557_1114210134.fo
- Register jobs: /tmp/JobMon_register_$JOBNAME.fifo
e.g. /tmp/JobMon_register_job_923.section_1.fifo
- Clean up lost connection jobs(not re-register within time limit): /tmp/JobMon_LastTimeCleanJob.log
FAQ
- Client connection error message:
xmlrpclib.Fault: Fault 401: "Not Authorized to access method 'echo.echo' on this server"
Check server /opt/openpkg/var/apache2/log/error_log
[Mon Jun 20 10:29:34 2005] [notice] Traceback (most recent call last): \n
File "/opt/openpkg/share/apache2/clarens/system/__init__.py", line 2149, in update_cert_db \n
cidict=pickle.loads(cidb[user_dn])\nEOFError\n
Solution: Remove whole /openpkg/ and reinstall Clarens
- SSL connection via browser failed:
Error message: Could not establish an encrypted connections
with fcdfcaf019.fnal.gov because your certificate is expired.
Solution: Thanks to Francesco Delli Paoli:
Fermilab KCA e1fce4e9.0 might be expired. Check latice one from http://security.fnal.gov/pki/.
Update it from directory /opt/openpkg/etc/grid-security/certificates.
|
|
News
Aug 2nd, 2005
Now hosted on SourceForge
|