The Njal Supercomputing Cluster
Home | Research | People | Pictures | Hardware | Get an Account | FAQ | Documentation
Software Documentation
Njal is a Beowulf cluster installed with Gentoo Linux and Xen. Job scheduling and batch processing is handled by Torque and message passing supported with MPICH2.
The following software is available on Njal.

Cluster Software

Scientific Software Software Development Source Control Tutorials
Torque

Torque allows users to queue their jobs and schedules them for completion. Each command has its own man page. The commands are outlined below.

Command Description
qsub Submit job to the queue
qdel Remove job from the queue
qhold Place job on hold
qrls Release hold on the job
qstat Show the status of the queued jobs

To run a batch job on Njal you will need to submit a PBS script using qsub. For example, for a simple serial job (non-MPI) the PBS script might look like

#PBS -l nodes=2:ppn=2
#PBS -N myjob
#PBS -j oe
# Send me mail on job start, job end and if job aborts
#PBS -M mymail.account@sunysb.edu
#PBS -m bea

cd ~/mycode
./myprogram

This will start a job called myjob who's executable is located in the ~/mycode path and named myprogram. This job will run on 2 nodes using 2 processors per node (ppn). For instruction on starting parallel (MPI) batch jobs see the mpiexec section.

qsub -I starts an interactive job and this is useful for debugging your code. For example, to start a interactive session for a small job using 1 node and 1 processor use

qsub -q small -I -l nodes=1:ppn=1.

Torque will give you a shell on one of the compute nodes, but without using rsh or rlogin; saving resources and most importantly obeying Torque's system policies for node allocation!

Njal has both Myrinet and Ethernet nodes. To specify to that you want only nodes with Myrinet you use the Mryinet attribute (mryi) on the command line. For example,

qub -I -l nodes=40:myri:ppn=2

will allocate 80 CPUs on Myrinet nodes.

For more information on Torque see the Torque website.


Job Monitoring

Monitoring tools have been installed for obtaining concise cluster status information, such as a list of free nodes or CPU usage. The three main commands are:

Command Description
lspbs Shows the availability of compute-nodes.
gstat -al1 Shows CPU utilization for running jobs.
qstat -an1 Shows what is happening on the queues.

Web Monitoring: visit our Ganglia web interface for detailed compute-node information and health status info.


MPI

There are different MPI versions available. Njal uses MPICH2, specifically mpich2-1.0.3. For extended information on MPI see, www.mcs.anl.gov/mpi.
Mpiexec

When running parallel code on Njal you need to use mpiexec. Mpiexec talks to PBS to find out which compute-nodes were assigned to your job, executes your program on those nodes, and forms a bridge between the PBS server and your processes. Mpiexec helps clean up the slave MPI processes when the job is aborted on the master node. Also, resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs.

When running an MPI program you will need to use mpiexec in your PBS script. For example,

#PBS -l nodes=78:ppn=2
#PBS -N myjob
#PBS -j oe
cd ~/mycode
mpiexec -comm pmi ./myprogram

will ask PBS to allocate 156 CPU's for running the MPI program called myprogram.

You can get a Poor-man's parallel debugger by using mpiexec within an interactive PBS job. Please take a look at the mpiexec website to see how to do this and more.


BLAS

Njal provides the following BLAS libraries:

Library Description
ATLAS ATLAS provides C and Fortran77 interfaces to a portably efficient BLAS implementation.
LIBGOTO A high-performance BLAS by Kazushige Goto (should kick the pants off ATLAS).
REFERENCE Basic Linear Algebra Subprograms F77 reference implementations
CBLAS

Njal provides the following CBLAS libraries:

Library Description
ATLAS ATLAS provides C and Fortran77 interfaces to a portably efficient BLAS implementation.
GSL GSL CBLAS implementation.
REFERENCE C wrapper interface to the F77 reference BLAS implementation

LAPACK

Njal provides the following LAPACK libraries:

Library Description
ATLAS Full LAPACK implementation using available ATLAS routines
REFERENCE FORTRAN reference implementation of LAPACK Linear Algebra PACKage


Compilers

Njal provides compilers for C, C++ and Fortran with the GNU Compiler Collection (GCC) 4.1.2. For detailed documentation visit the GCC website. If your work requires another compiler please contact the system administrators.
Copyright (C) 2002-2007 Konstantin Likharev and Joseph Spadavecchia.