On Computerome, there are two types of machines:
- 27 systems each with 32 CPU cores and 1TB of memory
- Approximately 500 systems each with 28 CPU cores and 128 GB of memory.
We have a new environment which makes use of the modules environment command. So in order to get
msub in you PATH variable execute:
Now you can submit jobs via the command
msub.We strongly encourage you to take advantage of modules in your pipelines as it gives you better control of your environment.In order to submit jobs that will run on one node only you will only have to specify the following resources:
- How long time you expect the job to run ⇒ '-l walltime=<time>'
- How much memory your job requires ⇒ '-l mem=xxxgb'
- How many CPUs ⇒ '-l nodes=1:ppn=<number of CPUs>' ;for the 1TB nodes number of CPUs can be from 1 to 32, for the other nodes it will be from 1 to 28.
- The <group_NAME> for your current project ⇒ '-W group_list=<group_NAME> -A <group_NAME>' .
To run a job with 23 CPUs, 100GB memory lasting an hour you can use the command:
or using msub:
The parameters nodes, ppn, mem is just an example and you should be change to suit your specific job
When you want to test something in the batch system, it is strongly recommended to run in an interactive job, by using the following:
This will give you access to a single compute node, where you can perform your testing without affecting other users.
Computerome is now offering an even more straightforward way to work interactively, the way you do on your own computer or a local linux server, instead of having to submit everything through the queuing system.Just login and type
iqsub and the system will ask you 3 simple questions, after which you'll be redirected to a full, private node.
Script file example
A script for a file to be submitted with qsub might begin with lines like:
$PBS... variables are set for the batch job by Torque.
Specifying a different project account
If you run jobs under different projects, for instance pr_12345 and pr_54321, you must make sure that each project gets accounted for separately in the system's accounting statistics.You specify the relevant project account (for example, pr_54321) for each individual job by using these flags to the qsub command:
or in the job script file, add line like this near the top:
Please use project names only by agreement with your project owner.
Estimating job resource requirements
First time you run your script, you may not have a clear picture of what kind of resource requirements it has.To get a rough estimate, you could submit a job to a full node, with large walltime:Regular compute node (aka. 'thinnode'):
To see the actual resource usage, see output from command
As a result of recent performance and stability improvements to the queuing system, the 'tracejob' command is currently not available to regular users, but must be run as a privileged account on a particular set of servers.
We are working on providing a solution to this, but in the meantime, please contact Computerome support if you need to get results from 'tracejob'
Alternatively you can add theses linse to the bottom of your script
module load shared moab
checkjob -v $PBS_JOBID
They wil gennerate something like the following:
To calculate what you should use for the "-l mem=" parameter you have to times the number of tasks with "Max Util Resource Per Task" "MEM:" Here it would bbe 20 * 12 gb = 240gb.
Look at resources_used.xyz for hints.
Requesting a minimum memory size
A number of node features can be requested, see the Torque Job Submission page. For example, you may require a minimum physical memory size by requesting:
i.e.: 2 entire nodes, 16 CPU cores on each, the total memory of all nodes >= 120 GB RAM.
To see the available RAM memory sizes on the different nodes types see the Hardware page.
Waiting for specific jobs
It is possible to specify that a job should only run after another job has completed succesfully, please see the -W flags in the qsub page.To run <your script> after job 12345 has completed succesfully::
Be sure that the exit status of job 12345 is meaningful: if it exits with status 0, you second job will run. If it exits with any other status, you second job will be cancelled.It is also possible to run a job if another job fails (``afternotok``) or after another job completes, regardless of status (``afterany``). Be aware that the keyword ``after`` (as in ``-W depend=after:12345``) means run after job 12345 has *started*.
Submitting jobs to 32-CPU nodes
The quad-processor, 8-core Intel *Sandy Bridge* Xeon E5-4610 v2 nodes (32 CPU cores total) we define to have a node property of fatnode (nodes f001-f027). You could submit a batch job like in these examples:: 2 entire fatnodes, 32 CPUs each, total 64 CPU cores
Explicitly the f015 node, 32 CPU cores:
2 entire fatnodes, 32 CPUs each, memory of all nodes => 1500 GB RAM)
Submitting jobs to 28-CPU nodes
The dual-processor, 14 core Intel E5-2683 v3 (28 CPU cores total) we define to have a node property of thinnode (nodes cn001-cn540).You could submit a batch job like in these examples::2 entire thinnodes, 28 CPUs each, total 56 CPU cores)
Explicitly the cn038 node, 28 CPU cores
Submitting 1-CPU jobs
You could submit a batch job like in this example:
Running parallel jobs using MPI
In order to optimize performance, the queuing system is configured to place jobs on nodes connected to the same InfiniBand switch (30 nodes per switch) if possible.
To get nodes close to each other, use
procs=<number_of_procs> and leave out
ppn=.To avoid interference with other jobs,
procs= should be a multiple of cores per node (ie. 28 for mpinode).
Submitting multiple identical jobs can be done using job arrays. Job arrays can be created by using the -t option in the qsub submission script. The -t option allows many copies of the same script to be submitted at once. Additional information about -t option can be found in the qsub command reference. Moreover, PBS_ARRAYID environmental variable allows to differentiate the different jobs in the array. The amount of resources required in the qsub submission script is the amount of resources that each job will get.
For instance adding the line:
in the qsub script will cause running the job 15 times with not more than 5 actives jobs at any given time.
PBS_ARRAYID values will run from 0 to 14, as shown below: