Chalawan HPC Documentation

Documentation Status

Chalawan is the high performance computing cluster that powered through the use of parallel programming, increasingly relies on massive datasets and compute-intensive workloads with a wide range of applicability and deployment.

Getting an account

To start using Chalawan Cluster, user must first submit an online application to get computing time credit and storage space. The merit of your proposal will also be considered, taking into account amount of requested resources. This will also provide our clusteradmin with a better understanding of what our users need and how best to prepare the environments and applications for you.

Online application

Sign Up

Sign up for an Online Application account. If you are eligible, your account should be activated within one working day. After the account is activated, please sign in to the system with the username (email address) and password you provided at the sign up and start filling the online application form.

Submit your application

Fill an online application form and submit. Once completed, you should receive notification of our decision within a week and further query regarding required software and setup if any.

Account setup notification

Once your account and required setup are ready, our system admin will send you an email regarding your account details, allocated Slurm credit, storage quota and how to login to Chalawan cluster.

Accessing the Chalawan

The command-line interface

Our operating system is based on GNU/Linux. Thus, a command-line interface or command language interpreter (CLI) is the primary mean of interaction with our HPC. In case you are not familiar with the command-line interface, free-online course at Codecademy is a good place to start.

Accessing the Chalawan

For Microsoft Windows user, see Connect to the Remote Server from Microsoft Windows.

The Chalawan cluster is an isolated system which resides in the NARIT’s internal network. At the present time, we have two systems, Castor and Pollux (hereafter the computing systems).

Castor is the old system which is assigned with the IP address 192.168.5.100. It contains 16 traditional Compute nodes suited for CPU-intensive tasks. Pollux is the newest one assigned with the IP address 192.168.5.105. It contains 3 GPU nodes and 3 traditional Compute node which have been refurbished from Castor. If you are using the internet inside NARIT network you can directly connect to these systems via the Secure shell (ssh) command.

Connection from outside NARIT network

However, if you are using the internet outside NARIT, you need to log in to the gateway machine, A.K.A. stargate, first. The gateway machine’s IP address and other information are given to you once you get the permission to access the Chalawan Cluster.

Secure shell (ssh) through an intermediate host (the gateway)

This is the easiest method that using the ProxyJump directive. If this method doesn’t work for you because you are using the very old version of ssh, please read the next section.

To use ProxyJump, you can simply add the flag -J followed by user@gateway.ip:port. The example below shows how to connect to Castor (don’t forget to replace gateway.ip and port with the given information from the email).

[user@local ~]$ ssh -J user@gateway.ip:port user@192.168.5.100

asasasa

Connection from Outside NARIT Work

topics

wording…

Transferring Files & Data

rsync & scp (secure copy)

Jump host

Multi-stage

Cloud Storage

It is possible to mount your cloud storage drive using Rclone.

JupyterLab Interface

While accessing the cluster via Chalawan JupyterLab, user may use the interface to upload or download local file into or out off the cluster (see Using Jupyerterlab).

Note

The upload filesize limit via JupyterLab interface is set to 100MB. User will be asked to confirm when trying to upload a file with size larger than 15MB. Please consider using altternative method to upload/download many large files as this would negatively affect connections of other users.

Using Module Environments

Topic

wording…

Job Submission

Topic

wording…

Compute

Castor & Pollux

Node Configuration

Storage

Lustre

wording…

Policy & Queue

topic

wording…

Slurm Credit Allocation & Application

topic

wording…

Acknowledgement & Publication

topic

wording…

Slurm Workload Manager

Nodes and partitions

Before submitting any job to Pollux, you must learn about available resources. To do that, we use sinfo to view information about Compute nodes and partitions. Once it is run, the command will print the information like the output below.

[user@pollux]$ sinfo
HOSTNAMES PARTITION     AVAIL CPUS(A/I/O/T) CPU_LOAD ALLOCMEM FREE_MEM GRES    STATE TIMELIMIT
pollux1   chalawan_gpu  up    0/24/0/24     3.68     0        54028    gpu:4   idle  infinite
pollux2   chalawan_gpu  up    0/28/0/28     3.71     0        246330   gpu:4   idle  infinite
pollux3   chalawan_gpu  up    0/28/0/28     3.60     0        246343   gpu:4   idle  infinite
castor1   chalawan_cpu* up    0/16/0/16     0.01     0        55444    (null)  idle  infinite
castor2   chalawan_cpu* up    0/16/0/16     0.01     0        55434    (null)  idle  infinite
castor3   chalawan_cpu* up    0/16/0/16     0.01     0        55455    (null)  idle  infinite
castor4   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor5   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor6   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor7   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor8   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor9   chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor10  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor11  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor12  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor13  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor14  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor15  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor16  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor17  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite
castor18  chalawan_cpu* up    0/16/0/28     0.01     0        92160    (null)  idle  infinite

Here we introduce the new field, PARTITION. Partition is like a specific group of Compute nodes. Note that the suffix “*” identifies the default partition. AVAIL shows a partition’s state: up or down while CPUS(A/I/O/T) shows count of nodes with this particular configuration by node state in the form “available/idle/other/total”.

Basic job submission

Slurm environment variables

Upon startup, sbatch will read and handle the options set in the environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. The full details are on sbatch manual (man sbatch), section “INPUT ENVIRONMENT VARIABLES”. For example, we have a script name task1.

#!/bin/bash

#SBATCH -J task1              # Job name
#SBATCH -t 00:01:00           # Run time (hh:mm:ss)

echo "Hello World!"

The default partition is chalawan_cpu, but we want to submit a job to chalawan_gpu instead, we can do either

[user@pollux]$ sbatch -p chalawan_gpu ./task1.slurm
or

[user@pollux]$ export SBATCH_PARTITION=”chalawan_gpu” [user@pollux]$ sbatch ./task1.slurm

There are also output environment variables of the batch script which are set by the Slurm controller, e.g., SLURM_JOB_ID, SLURM_CPUS_ON_NODE. For the full details, see “OUTPUT ENVIRONMENT VARIABLES” on sbatch manual (man sbatch). You may combine them with your script for convenience. The example below shows the results when we print out some of these values.

[user@pollux]$ cat ./echo.slurm
#!/bin/bash

#SBATCH -J echo                 # Job name
#SBATCH -o %x-%j.out            # Name of stdout output file

echo "Job name: $SLURM_JOB_NAME"
echo "Job ID: $SLURM_JOB_ID"
[user@pollux]$ sbatch ./echo.slurm
[user@pollux]$ cat ./echo-130.slurm
Job name: echo
Job ID: 130

We use the command sbatch followed by a batch script to submit a job to Slurm. sbatch then exits immediately after the script is successfully transferred to the Slurm controller assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

[user@pollux]$ sbatch [OPTIONS...] executable [args...]

The batch may contain options preceded with #SBATCH before any executable commands in the script. For example, we create a simple batch script to print a string “Hello World!” called task1.slurm. Inside the file looks like this

#!/bin/bash

#SBATCH -J task1             # Job name
#SBATCH -t 00:01:00          # Run time (hh:mm:ss)

echo "Hello World!"

After submission with the command sbatch task1.slurm, if there is an empty slot, your task will run and exit instantly. You will find the output file, slurm-%j.out at the current working directory where %j is replaced with the job allocation number. The words “Hello World!” is appeared inside that output file. By default, both standard output and standard error are directed to the same file.

[user@pollux]$ sbatch ./task1.slurm
Submitted batch job 128
[user@pollux]$ cat ./slurm-128.out
Hello World!

Batch vs Interactive jobs

We use the command sbatch followed by a batch script to submit a job to Slurm. sbatch then exits immediately after the script is successfully transferred to the Slurm controller assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

[user@pollux]$ sbatch [OPTIONS...] executable [args...]

The batch may contain options preceded with #SBATCH before any executable commands in the script. For example, we create a simple batch script to print a string “Hello World!” called task1.slurm. Inside the file looks like this

#!/bin/bash

#SBATCH -J task1             # Job name
#SBATCH -t 00:01:00          # Run time (hh:mm:ss)

echo "Hello World!"

After submission with the command sbatch task1.slurm, if there is an empty slot, your task will run and exit instantly. You will find the output file, slurm-%j.``out at the current working directory where ``%j is replaced with the job allocation number. The words “Hello World!” is appeared inside that output file. By default, both standard output and standard error are directed to the same file.

[user@pollux]$ sbatch ./task1.slurm
Submitted batch job 128
[user@pollux]$ cat ./slurm-128.out
Hello World!

Frequently used sbatch options

There are many options you can add to a script file. The frequently used options are listed below. Each option must be preceded with #SBATCH. For other available options, you can learn from the Slurm website or using the command sbatch -h or man sbatch.

Option

Description

-J, --job-name=<name>

name of job

-N, --nodes=<N>

number of nodes on which to run (N = min[-max])

-n<count>

number of tasks to run

-c, --cpus-per-task=<ncpus>

number of cpus required per task

-e, --error=<err>

file for batch script’s standard error

-o, --output=<out>

file for batch script’s standard output

-p, --partition=<partition>

partition requested

-t, --time=<minutes>

time limit

--mem=<MB>

minimum amount of real memory

--gres=<list>

required generic resources

Running Parallel Jobs

Shared memory

Shared memory job runs multiple processes which share memory together on one machine. User should write a script to request for running a job on a single Compute node with the following maximum number of threads on each machine:

  • 16 on the partition chalawan_cpu

  • 24 on the node pollux1

  • 28 on the nodes pollux2 and pollux3

It is also recommended for a program is written with OpenMP directive and C/C++ multi-threading. An example script is displayed here

#!/bin/bash

#SBATCH -J shared     # Job name
#SBATCH -N 1          # Total number of nodes requested
#SBATCH -n 16         # Total number of mpi tasks
#SBATCH -t 120:00:00  # Run time (hh:mm:ss)

mpirun -np 16 -ppn 1 [ options ] <program> [ <args> ]

Distributed memory

For distributed memory, each process has its own memory and does not share with any others. A distributed memory job can run across multiple Compute nodes. It requires a program that is written with the specific parallel directive, e.g. the Message Passing Interface (MPI). Moreover, it requires an additional set up to scatter the processes over Compute nodes. Suppose we want to run a job with 16 processes which spawn 4 processes on each compute node, we may write:

#!/bin/bash

#SBATCH -J distributed # Job name
#SBATCH -N 4           # Total number of nodes requested
#SBATCH -n 16          # Total number of mpi tasks
#SBATCH --ntasks-per-node=4 # Total number of tasks per one node
#SBATCH -t 120:00:00   # Run time (hh:mm:ss)

mpirun -np 16 -ppn 4 [ options ] <program> [ <args> ]

Job Management

Job status

However, if there is no empty slot, the job will be listed in a pending (PD) state. You can view it by using the command squeue. The output column ST stands for state. A running job is displayed with a state R. The Compute node where the job is running on is shown in the last column.

[user@pollux]$ squeue
JOBID PARTITION     NAME    USER ST    TIME NODES NODELIST(REASON)
1587  chalawan_    task2    goku PD    0:00     1  (Priority)
1585  chalawan_    task1    user PD    0:00     1  (Resources)
1584  chalawan_    task0  vegeta R  3:21:49     1  pollux3

Job deletion

To remove a running job or a pending job from the queue, please use the command scancel followed by the job id. To cancel all your jobs (running and pending) you can run scancel -u <username>.

Job history

sacct displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.

Slurm Cheatsheet

All Slurm command started with ‘s’ followed by abbrevation of action word. Here we list the basic commands for submitting or deleting a job and query the information from it.

sacct is used to report job or job step accounting information about active or completed jobs.

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

Which software is installed on Chalawan

For all users on our system, the Chalawan HPC clusters come with a set of softwares preconfigured. The software on the Chalawan HPC may be divided into two categories:

  • General software

  • Sharded licence software

Users can manage the preloaded software and compilation tools which are available on the Chalawan HPC system by using the “module” tool.

To list the loaded softwares, type,

module list

If you find out your program is not avaliable in our HPC system, you can try to install your program by using complilation tools or contact our HPC staff.

Installing Software on the Chalawan HPC

If you find out your desirable software is not installed on our HPC system, you can use these instructions to install it.

First thing that you must know before installing the software is which compilation tools or the system requirements are used for installing/compling the software. Then, you can follow these “general” steps to install your software.

1. Preparaing a software

You must upload/download your code into your desirable location, for example, you can download the software package via wget by typing,

wget -p internet/sourcecode.tar.gz foldername

We suggest that you place it in the home folder (e.g., “/home/username/prgogram/”). If the software is archives, you can extract it into the specific folder by typing

tar -zxvf sourcecode.tar.gz softwarefolder

2. Installing a software

Before starting to install a software, you must check the system requirements. If you download the source code from the developper, you can find the installation instructions in the package (e.g., README file). You can load the the specific complier version (for example, gnu version 9.3.0) by typing,

module load gnu9/9.3.0

Please note that, when you loaded the different version of softwares, it might be changed the complier version accordingly. Therefore, plaese make sure that all pre-installing modules/compliers are matched.

To install a software, you first move the terminal into the software folder and then starting config the software by typing

cd softwarefolder
./configure

Then, we compile all source codes by typing

make
make install

If all sources are complied correctly, you will find the new installed software file in that folder.

3. Loading a software

In order to easily use your installed software, we can load or make an alias by setting it in the bash script.

First, you open the bash script (.bashrc) located in your home folder “/home/username/.bashrc”.

Then, you load your software and make an alias to call your software by adding this line into the “.bashrc” file,

alias myprogram='softwarefolder/sotwarename'

where “myprogram” is the software alias that you can change and “softwarefolder/sotwarename” is the location of your software. Note that, before setting an alias, you must check that this alias is avaliable first (ex., typing this alias in terminal).

Finally, you reload the bash script by typing (one time),

source .bashrc

After you finished all theses steps, you can call your software via the alias by typing,

myprogram

Software Module Scheme & Module environment

The Chalawan HPC is already pre-installed software packages for all users. To find out a list of software package that is available, type:

module av

To check a list of software packages is currently loaded, type:

module list

Load/Unload module

If you would like to change/swap the module, you can load /unload the module by typing:

module load modulename
module unload modulename

Python (Version, Environment and Packages)

Default Python version on the Chalawan HPC

Python is preinstalled on the Chalawan HPC clusters. You can call the Python program by typing,

  • Python version 2.x

    python
    
  • Python version 3.x

    python3
    

It is the default program, and pre-installed some packages, for example, astropy, matplotlib, numpy, pandas, scipy and etc. You can easily run your code by typing,

python myprogram.py
python3 myprogram.py

Create your environment

Users may be needed to manage the packages which is not allowed with the default. Therefore, if you want to manage the installed packaged, you should create your own environment (packages). Note that, due to the limitation of free space in your home folder, you should recheck the freespace and cache files when you installed the new packages.

You can use the following steps to crate your own environment;

To load the Anaconda module to manage the environment, type,

module load anaconda

To create your own environment, for example, the environtment name “myenv”, type,

conda create -n myenv

If you want to set a specific version of Python, type:

conda create -n myenv python=3.7

When you needed to load your environment for coding or running your program, type,

module load anaconda
conda activate myenv

Managing Packages

We can use Anaconda to manage the Python packages.

To install the new package, type,

conda install packagename

Or, via the “pip” module,

pip install packagename

To uninstall the package, type,

pip uninstall packagename

Packages that you installed may be upgradable for fixing some bugs and improving perfermence.

To update all packages of your environment, type,

conda update -n myenv --update-all

Or, with a specific package, type,

pip install packagename --upgrade

Basic Linux commands

In this page, you can learn the basic commands of linux systems. It helps to work effectively in Linux.

File Management

pwd

Find out what directory you are currently in.

cd

Move to another working directory

ls

List all files and directories.

cp

Copy file

scp

Copy file across systems

mv

Move file

rm

Remove file

tar

Extract and zip files

wget

Download a file via the internet

Using JupyterLab

Getting Started with Chalawan HPC Lab

Chalawan HPC Lab is a next-generation web-based user interface powered by Jupyterhub . It enables you to work with documents and activities such as terminals, Jupyter notebooks, text editors and custom components in a flexible, integrated, and extensible manner. This allow you to access the Pollux from the internet outside NARIT so you don’t have to access through stargate or VPN anymore.

Login to Chalawan HPC Lab

1.1 Go to the Chalawan HPC Lab website https://lab.narit.or.th

1.2 Enter your username and password (these are the same as the one you normally use to login to the Chalawan headnode).

1.3 Click login.

_images/jupyterlab_login.png

Start Jupyter server

Once logged in, if you do not have any currently running Jupyter server , you will be ask to spawn a new server. Spawners will control how Jupyterhub starts the individual server for each user. In order to work with the Jupyterlab, you should specify the resources for your job; CPU core(s), RAM, Time limit and etc. Please note that the Jupyterlab use the time credit from your HPC account, thus, please stop the server when you finished your jobs. If you have a running Jupyter server, this step will be skipped and you will be brought to the Jupyterlab interface.

Running on Headnode (pollux)

For developing, testing, submitting Slurm jobs or just to access your files, you may spawn the Jupyter server on the Chalawan headnode. But please bear in mind that any heavy usage or long running jobs on the headnode are not allowed and you must either submit the jobs using Slurm “sbatch” (using the “Terminal” from the “Launcher”) for a batch job. For interactive jobs that required heavy usage, you must spawn a Jupyter server on compute node(s)

_images/jupyterlab_front.png

Running on Compute node(s)

For heavy usage, you must select other spawn options than the “Chalawan headnode”. A few template profiles are provided. These options will allocated computing resources as listed in the options via Slurm job scheduler. For more flexibility, please use the “Advance Slurm job config” option. If the requested resources are available, your Jupyter server and Jupyterlab (see section 3) will be launched. If the remaining resources on the cluster does not meet your request, the spawner will wait in the queue for 1 minute before it aborts and give you a fail message.

_images/jupyterlab_spawopt.png

“Advance Slurm Job config”

_images/jupyterlab_spawoptadv.png

Managing your Jupyter servers

Each user is allowed to simultaneously run 5 extra servers apart from the main “Default Server”. The server management page can be accessed via the “Home” tab or “File > Hub Control Panel” menu. You may use this interface to start, stop or add a server.

_images/jupyterlab_manageserv.png

Stop and manage your Jupyterlab server

Go to the HUB Homepage https://lab.narit.or.th/hub/home or you can click from the file menu.

_images/jupyterlab_menuhub.png

This webpage you will see your default server and your additional server(s). You can add up to 5 additional servers by field a name of your server and click “Add New Server”.

To stop your server, click “Stop Server”.

Contact us

E-mail: hpc@narit.or.th

Joining Chalawan Workspace : Join Slack Chalawan Workspace for support

Frequently asked questions

normal

I can not log-in to Pollux system, what can I do ?

I can not submit jobs to slurm queue in the Pollux system, what can I do ?