Nandadevi Job Scheduler (Slurm)

         Nandadevi cluster has been partitioned into 8 partitions/queues using the latest opensource version of SLURM job scheduler. The details are as under.

SlNo Queue Name No of Cores Purpose
1 nandaq 20 nodes,28 cores pernode, 560 cores cluster (23 TF) For Parallel jobs using MPI
2 nandaknlq 4 nodes with Intel Xeon PHI 7230 Processor, 64 cores and 64GB RAM pernode, Total 256 cores, (10 TF) For OpenMP,Manycore and Parallel Jobs
3 nandasq 34 nodes,16 cores pernode, 544 cores with 64 GB memory (14 TF) For multiple serial jobs and OpenMP applications
4 nandagpuq 8 nodes having 16 Nvidia K20 GPU For CUDA and OpenACC applications
5 nandaitraq 2 nodes, 40 cores Used by the ITRA project members
6 nandaphenoq 2 nodes, 40 cores Used by the PHENO project members
7 nandaifcq 3 nodes with 10 Nividia K80 GPU User by the IFC project members

New Sample slurm file  for submitting jobs to any one of the above queues. Download for MPI jobs, OMP jobs,Serial jobs,Check-Pointing

Do the necessary changes in this sample files and use it.

Updated Slurm Jobscript (Feb2021).

#!/bin/bash
#SBATCH -N 1
#SBATCH --job-name=<Your_jobname>
##Available Partitions
##nandaq(OR)nandaknlq(OR)nandasq(OR)nandagpuq(OR)nandaphenoq(OR)nandaitraq(OR)nandaifcq
#SBATCH --partition=nandaq
#SBATCH --output=Job.%j.out
#SBATCH --error=Job.%j.err
#SBATCH --export=all
#SBATCH --mail-user=<username>@imsc.res.in
#SBATCH --mail-type=ALL
#SBATCH -D </Working_dir_path usually /lustre/username/...>
 
# Load your modules in the script.
module load module_name 
module unload module_name
 
MACHINE_FILE=nodes.$SLURM_JOBID                                           
scontrol show hostname $SLURM_JOB_NODELIST > $MACHINE_FILE
 
export OMP_NUM_THREADS=$SLURM_NTASKS
srun <your executable>  >& out_$SLURM_JOBID
##MPI Case
###   Only One Exectable allowed
#mpirun -np <npvalue> <your/executable/with/path>  >& out_$SLURM_JOBID
wait
 
 
 
Note: Please give a full path of the executable file, if the executable file are in different location
Ex: srun /home/username/proj1/a.out >& out_$SLURM_JOBID
 

Sample slurm file  for submitting jobs to any one of the above queues. Download for MPI jobsOMP jobs,Serial jobs,Check-Pointing

OLD JOBSCRIPT

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=28
#SBATCH -J <testrun>
# Available Partition names/Queuename are: 1) nandaq, 2) nandaknlq, 2)nandasq, 3)nandagpuq, 4)nandaphiq
#SBATCH -p <partitionname>
#SBATCH --export=all
#SBATCH --mail-user=<username>@imsc.res.in
#SBATCH --mail-type=ALL

cd $SLURM_SUBMIT_DIR
echo $SLURM_JOB_NODELIST > hostfile_$SLURM_JOBID

module load intel/2017
##OpenMP Case
## Make sure the ppn value and OMP_NUM_THREADS value are same or
## leave with SLURM_NTASKS env variable
###   Only One Exectable allowed
export OMP_NUM_THREADS=$SLURM_NTASKS
<your executable> >& out_$SLURM_JOBID

##MPI Case
###   Only One Exectable allowed
#mpirun -np <npvalue> <your/executable/with/path>  >& out_$SLURM_JOBID


 

Job Submission

      Create a pbs script using the above sample with suitable modification. Submit the job using the following command

           sbatch slurm_script.sh

Once the queue is accepted by torque it will return a JOBID and that can be used for finding status of the job OR deleting the job

 

Job Status

The following command will display the status of the job.

            squeue -a -n -1 [job id]

It will display the status of all the jobs. The output will have Job ID, Username, Queue, Jobname SessID, NDS, TSK, S, Elap Time.

Single letter under the column 'S' will give you about the state of the job in queue. Details of each letter is given below

 Q - job is queued, eligible to run or routed.

 R - job is running.

 E - Job is exiting after having run.

 H - Job is held.

 T - job is being moved to new location.

 W - job is waiting for its execution time

 

Job Deletion

The following command will delete a job from the queue

              scancel <jobid>

PBS command like, qstat, qsub, qel also works

Zircon - This is a contributing Drupal Theme
Design by WeebPal.