JobMon
Interactive GRID Job Monitoring
Authors:
  Shih-Chieh Hsu
  Elliot Lipeles
  Conrad Steenberg
  Frank Würthwein

Based on:

Clarens logo

Hosted by:

SourceForge Logo

Installation

Overview

JobMon consists of three parts that run in three separate places:

  1. the central server is a continuously running service,
  2. the daemon is started by each job on its worker node which lasts the duration of the job,
  3. the client tools are run interactively by the user on the their desktop and terminate after each request.
The central server is a Clarens service which is configured by a site administrator and is responsible for all the authentication. The daemon and the client each have an X509 proxy which is used for authentication with the central Clarens server. The daemon needs to run a wake up service which is used to notify it that there is a user request.

The following sequence of events for JobMon provides a quick orientation of the role each of these components plays:
  1. The central server is started by site administrator and remains running indefinitely
  2. The users submits a job configured to use JobMon
  3. Once running the job registers with the central server
  4. The user makes an interactive request from command line, which contacts the central server and pends until the result is ready
  5. The central server wakes up the corresponding job
  6. The job contacts central server, gets and executes the user request, and returns the result to central server
  7. The central server gives the result to the pending request and the user command line returns the result

Configuration Choices

There are currently two implemented wake up mechanisms (TCP and UDP broadcast) suitable to different site configurations. In addition, the central manager can run either as an edge service at a site or external to the site. For a small site, the TCP method is prefered because the connection state makes sure that when a job exits it is immediately recorded. TCP can also go through most firewalls so that the central manager can be outside the site. The disadvantage of TCP is that is requires one open port on the central manager for each active job. For a large system this can be a problem. The UDP mechanism is very light weight, but requires that the central manager be on the same subnet as the worker nodes and that the grid jobs are allowed to listen at ports. The most restrictive networking environment considered for the design is where the workernodes can only make outgoing TCP connections within a site. In this case, the TCP method is used with the central manager located at the edge of the site.

Download

  • cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/jobmon checkout JobMon

Prerequisites

  • Clarens + Apache + pyhthon2.2.2
  • fixed IP connection to WAN

Procedure

  • Installing Clarens server Ref
    1. wget -q -O - http://hepgrid1.caltech.edu/clarens/setup_clump.sh |sh
    2. export opkg_root=/opt/openpkg
    3. $opkg_root/etc/rc apache2 start|stop|restart
    4. Point browser to see Clarens welcome message
      • Non-SSL=> http://some.host.name:8080/clarens
      • SSL=> https://some.host.name:8443/clarens
        • A valid pkcs is required for the SSL connection
        • For example, you can get pkcs from a x509 cert by doing
          $ ns pkc2 export -out exported.pfx -inkey /tmp/x509up_u$UID -in /tmp/x509up_u$UID
        • Load pkcs to your browser:

          [Edit]->[Preference]->[Privacy & Security] ->[Certificates]->[Manage Certificates]->[Import]

    5. test Clarens server
      • $ clarens-proxy-init
        Enter pass phrase for /home/dir/.globus/userkey.pem:
      • $ clarens-ping http://some.host.name:8080/clarens/
        Contacting http://some.host.here:8080/clarens/...
  • Installing JobMon service
    1. Download package

      cvs -d:pserver:anonymous@cvs.sf.net:/cvsroot/jobmon checkout JobMon

    2. Copy whole JobMon into Clarens

      cp -r JobMo $opkg_root/share/apache2/clarens

    3. Attach and edit JobMon service configuraion to clarens config file

      cat clarens_config_local.py >> $opkg_root/share/apache2/clarens/clarens_config_local.py

    4. Restart Clarens apache server

      /$opkg_root/etc/rc apache2 restart

    5. Start JobMonTcpServer if using TCP wakeup method. TCP wakeup method is recommanded, it provides real time bookkepping of the jobs. UDP wakeup method waits a clean up cycle.

      /usr/bin/python JobMonTcpServer.py &

    6. test JobMon service [Usage]

Configuration

  • apache server parameters

    The following apache parameters have been optimized for a large site. For a small site the default (or other) configuration should be adequate. The load on the central server can be reduced by lowering the job reconnect time, but this will increase the chances of having the user have to wait to find out their jobs has already completed.
        /opt/openpkg/etc/apache2/httpd.conf
        # prefork MPM
        # StartServers: number of server processes to start
        # MinSpareServers: minimum number of server processes which are kept spare
        # MaxSpareServers: maximum number of server processes which are kept spare
        # MaxClients: maximum number of server processes allowed to start
        # MaxRequestsPerChild: maximum number of requests a server process serves
        
    StartServers 15 MinSpareServers 15 MaxSpareServers 100 MaxClients 150 MaxRequestsPerChild 0
  • JobMon service parameters
    • Example config file clarens_config_local.py. This must be in the directory $opkg_root/share/apache2/clarens.
          clarens_config_local.py
          "JobMon_query_timeout":   "10",
          "JobMon_query_allow": "[<*>,<*>],
               [/DC=fnal,/CN=hello],
               [/DC=<*>/DC=< ORG>/O=Fermilab/OU=People/CN=<*>/UID=< user>,
                /DC=gov/DC=< ORG>/O=Fermilab/OU=People/CN=<*>/UID=< user>]",
          "JobMon_reconnect_time":  "600",
          "JobMon_cleanjob_time":   "900",
          "JobMon_fifofile_path":   "/tmp/",
          #two wakeup methods  UDPbroadcast or TCPbroadcast
          "JobMon_wakeup_method":   "TCPbroadcast",
          "JobMon_tcpserver_port":  "2000",
          "JobMon_tcpserver_hostname":  "fcdfcaf019.fnal.gov",
                 
    • Detailed explanations:
      • JobMon_query_timeout: The time out for user query
      • JobMon_query_allow : The x509 subject DN matching between user and jobs. It could be multiple definitions. If one of the criteria matches, the user get permission to talk to jobs. For example.
        1. The wildcard character means always allow
        2. [<*>,<*>]: allow all kinds of format
        3. [/DC=fnal,/CN=hello]: The user DN equals to /DC=fnal and the job DN equals to /CN=hello
        4. Regular expression matching:
             [/DC=<*>/DC=< ORG>/O=Fermilab/OU=People/CN=<*>/UID=< user>,
              /DC=gov/DC=< ORG>/O=Fermilab/OU=People/CN=<*>/UID=< user>]
                            
          1. The user DN must be in the format of /DC=/DC=/O=Fermilab/OU=People/CN=/UID=
          2. The job DN must be in the format of /DC=gov/DC=/O=Fermilab/OU=People/CN=/UID=
          3. The items contain arrow bracket require matching between the two, i.e.,
                /DC=<*> == /DC=gov, 
                /DC=< ORG> = /DC=< ORG>, 
                /CN=<*> == CN=<*>, 
                /UID=< user> == /UID=< user>
                                  
      • JobMon_reconnect_time: The period for JobMonDaemon update its status to JobMon Clarens server
      • JobMon_cleanjob_time: The period for JobMon server to check and clean up the registration lists. It will remove a registered job if its last updated time exceeds the JobMon_reconnect_time.
      • JobMon_fifofile_path: The location to put interprocess commuunication files. "/tmp/" is the default place.
      • JobMon_wakeup_method: There are only two options, either UDPbroadcast - using UDP/IP wakeup method or TDPbroadcast - using TDP/ID wakeup method
      • JobMon_tcpserver_hostname: The hostname must be understood behind the NAT.
      • JobMon_tcpserver_port: JobMonTcpServer port number. Required if using TCP wakeup method
    • Notes:
      • UDPbroadcast only works if Clarens server are in the edge of NAT. UDP/IP is not transparent across the NAT.
      • TCPbroadcast needs to start JobMonTcpServer in Clarens server.
        $python JobMonTcpServer.py &

LogFiles

  • JobMon services
    1. registerJob service: /tmp/JobMon_register.log
    2. query/getJobtoDo/outputJobResult:/tmp/JobMon_query.log
  • JobMonTCPServer: JobMon_TCPServer.log
  • InterProcessCalls
    1. Contains query requests: /tmp/JobMon_$PID_$TIMESTAMP.fi
      e.g./ tmp/mp/JobMon_1554_1114138762.fi
    2. Wait for query results: /tmp/JobMon_$PID_$TIMESTAMP.fo
      e.g. /tmp/JobMon_1557_1114210134.fo
    3. Register jobs: /tmp/JobMon_register_$JOBNAME.fifo
      e.g. /tmp/JobMon_register_job_923.section_1.fifo
    4. Clean up lost connection jobs(not re-register within time limit): /tmp/JobMon_LastTimeCleanJob.log

FAQ

  • Client connection error message: xmlrpclib.Fault: Fault 401: "Not Authorized to access method 'echo.echo' on this server"
    Check server /opt/openpkg/var/apache2/log/error_log
    [Mon Jun 20 10:29:34 2005] [notice] Traceback (most recent call last): \n
    File "/opt/openpkg/share/apache2/clarens/system/__init__.py", line 2149, in update_cert_db \n
    cidict=pickle.loads(cidb[user_dn])\nEOFError\n
    Solution: Remove whole /openpkg/ and reinstall Clarens
  • SSL connection via browser failed:
    Error message: Could not establish an encrypted connections with fcdfcaf019.fnal.gov because your certificate is expired.
    Solution: Thanks to Francesco Delli Paoli:
    Fermilab KCA e1fce4e9.0 might be expired. Check latice one from http://security.fnal.gov/pki/.
    Update it from directory /opt/openpkg/etc/grid-security/certificates.

News

Aug 2nd, 2005

Now hosted on SourceForge

Modified on Wed Nov 30 10:37:17 CST 2005 by Elliot Lipeles