Matteo Risoldi - Files & Docs

Managing jobs in condor

Submitting a job

To submit a job, the first thing you need is to create a command file for that job. This is a text file containing infomation about the job, such as the name of the program to run, its arguments, input/output files, and some options about how condor should behave with respect to this job.

There are many possible options and tags for this file; you can find plenty of description on the condor website (http://cs.wisc.edu/condor) , in the manual.
Anyhow, a typical template would be this:

###############################
# Condor command file example

universe = standard
executable = myprogram
arguments = arg1 arg2
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = outfile.dat
error = errfile.dat
log = logfile.dat
Initialdir = /home/control/condorfiles/

queue

Some description: the universe parameter is normally standard if you are running a program which has been relinked using condor_compile. This allows for advanced features like checkpoints, remote system calls, etc. If you could not relink the program, you should use the value "vanilla" for universe instead. If the program is a Java program, you should use "java".

The executable parameter is the name of the program. If it's a java program, include also the .class extension.

Arguments must hold the arguments you would normally pass to your program on the command line. You can omit this parameter if you don't need arguments. For a java program, you must also put the program name as the first parameter.

Should_transfer_files tells condor whether or not the files specified further down should be transferred to the executing machine before execution.

Output specifies the name of the file where you will find everything your program writes to the standard output. Error is the file corresponding to the standard error.

Log is a file in which condor will write info about how the job is doing.

The location of the three above files is relative to the program's directory, unless you specify the Initialdir parameter, in which case the files are found relative to the directory hereby specified.

The final queue clause tells condor that the job definition is over, and that it can be queued for execution in the pool.

To send the job to the pool, after you create the command file (and save it to, e.g., mycmdfile.condor), you should run the following command:

condor_submit mycmdfile.condor

Checking the jobs status

To see how many jobs are in the queue, how many are running/idle/suspended, how long they have been running, and so on, you cna use the condor_q command. See the condor webite for a more detailed description of the output of this command.

Removing a job

If you want to remove a job from the queue, you should run the condor_rm command, followed by the number of the job to remove, as reported by condor_q. To remove all the jobs, you can use condor_rm -all.

Checking the status of the pool

To see how the machines in the pool are doing, run the condor_status command. You will see all the machines (more than one entry for those with multi-cpus), and their status indicated. The status "Owner" means the machine is not available for condor (someone is using the console, or is remotely logged in). The status "Unclaimed" means that the machine is not being used, and can take up a job. The status "Claimed" means that the machine is currently matched to a job, and it's running it.
Note that the information about machines which you can see through condor_status is not istantaneously updated; for example, some minutes may elapse before you can see a machine as "Claimed" after it actuallky started a job. So the condor_status output is useful, but must not be taken as the absolute truth.