Body
This page documents answers to questions that we've been asked or that we're surprised we haven't been asked yet. These questions could be about working in Unity's environment, a difference between Unity and OSC (or other HPC environments), or unexpected behavior of Unity (without judgement of why this is unexpected or who finds it unexpected).
Access and file transfer
File transfer
Unity runs an sftp service; the simplest way to move files to or from Unity is by using an sftp
or scp
command on your local computer.
Windows OpenSSH
Recent versions of Windows 10 include an implementation of the OpenSSH client. This allows Windows users to work with commands such as ssh
and scp
at the command line (PowerShell and Windows Terminal can make this a productive environment). However, the message authentication code (MAC) that the Windows OpenSSH client uses by default is not compatible with recent security enhancements on Unity login nodes. To ssh into a Unity login node from a Windows client from on campus or through the ASC VPN, you need to specify a different MAC. Something like this works:
> ssh -m hmac-sha2-512 name.#@unity.asc.ohio-state.edu
General issues
Don't request all of a node's memory
While the list of compute nodes shows the total amount of memory for each node, not all of the memory is available for jobs; some is required for the node's operating system. This means that if you request 192 GB, for example, that job cannot run on the nodes with 192 GB (or less) of memory. Instead, the job will try to run on a node with more than 192 GB (with our current nodes on Unity, at least 256 GB). You may either have to wait for one of the larger nodes to become available or not be able to run because the nodes with the most memory are exclusive. Instead, ask for a little less than the full amount of memory available on a node (say, 184 GB to run on a 192-GB node).
How do I change my default shell?
You interact with Unity through a program called a shell; over the years, many shells have been developed. Probably the most commonly-used shell is bash, which is the default for Unity. In a stand-alone Linux system, you can change your default shell with the chsh
command. However, Unity gets its user information from ASCTech's directory service; to change your default shell on Unity, submit a request or send email to asctech@osu.edu.
R and Python are installed locally on some (but not all) of the compute nodes. This can lead to some confusion--you don't know if R or Python is there or what version it might be.
To use Python, you should first load an appropriate Python module. Even if you're going to use Python from a conda environment, you'll need to first load a Python module in order to access the conda
command.
When using R (whether in batch mode or interactive mode), first load an appropriate compiler module and then load R.
module load intel
module load R
You can also specify a particular version of R if you don't want the default (after loading the compiler module, run module avail
to see what versions of R are available.
module load gnu/9.1.0
module load R/4.0.2
Install R, Python and Perl packages
Like most HPC centers, we ask users to install their own packages and modules for some environments.
OSC has instructions for installing R, Python and Perl packages.
You can also compile and install your own software.
Common issues when installing packages in R
Makevars
Using install.packages()
in R typically involves downloading source code and compiling and installing your package in your home directory. Usually this is C, C++ or Fortran code. If you were compiling these things outside of R, you could include options on the command line or in a make file. It's less convenient to include options with install.packages()
, but for some packages you have to.
Sometimes you see fleeting lines of output during compilation like this:
lmrob.c(2709): error: identifier "i" is undefined
for(int i=0; i < n; i++) {
In an older standard of the C language, you could declare a variable in a for()
loop (the "int i=0
" part). Now you can't. But you can tell the compiler to use the older standard. If you were typing the command to compile yourself, you'd just type a flag. To get R to do this, you use a file called Makevars
in the .R
directory in your home directory (so ~/.R/Makevars
). You'll probably have to create .R
and Makevars
. So at a linux command prompt, do this:
$ mkdir ~/.R
$ cd .R
$ vi Makevars
There's a dot before R, R has to be upper-case, and use whatever text editor you want to edit Makevars
. Makevars
just needs one line:
CFLAGS= -std=c99
That gives a flag to the C compiler to use the c99 standard. Then start R and run install.packages()
. After you're done installing packages that need c99, you have to remember to get rid of the Makevars
file or it'll get used the next time you install a package, and you may not need it next time. You can just rename Makevars
as Makevars.c99
, so that R won't find it and you won't need to remember all this the next time you do need it (it can be convenient to have a bunch of Makevars.XXX
files for different cases--you can give C and Fortran compilers different flags or change which compilers get used, for example). A similar flag is gnu99.
There's more information on Makevars
in the R Installation and Administration manual. A Google search on the initial error message during compilation can help you identify the pathology and the fix.
C++ extensions
Some of the packages have another problem. A line like this
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found
is telling you about missing C++ extensions. The fix for this is to load cxx17 before starting R. So altogether:
$ ml intel
$ ml cxx17
$ ml R/3.5.1
Then start R and run install.packages()
.
Failed dependencies
Occasionally a package installation can fail because a previous installation of a dependent package failed but left enough of the dependent package behind to let subsequent installations think that the dependent package is okay to use. An example might make this clearer. We recently saw this error during installation of the vctrs
package.
Error: package or namespace load failed for ‘vctrs’:
.onLoad failed in loadNamespace() for 'vctrs', details:
call: library.dynam(lib, package, package.lib)
error: shared object ‘backports.so’ not found
In the appropriate local library path (~/R//x86_64-pc-linux-gnu-library/3.5
), there was a directory namedbackports
, but there was also a directory named 00LOCK-backports
. The install.packages()
function creates the corresponding 00LOCK-
directory while it's installing a package and removes it when the package is successfully installed. If the package installation fails, the 00LOCK-
directory doesn't get deleted. It's apparently not much use in telling you what went wrong, but running remove.packages("backports")
and then install.packages("vctrs")
allowed vctrs
to be installed successfully.
Adding conda environments to Jupyter Notebook kernels
One time, add ipykernel
to your conda environment.
When building new conda environment
$ ml python/3.7-2020.02
$ conda create --name tf25 tensorflow=2.5 ipykernel
$ conda activate tf25
$ python -m ipykernel install --user --name envs-tf25 --display-name "Python (tf25)"
$ conda deactivate
When using existing conda environment
$ ml python/3.7-2020.02
$ conda activate tf24
$ conda install ipykernel
$ python -m ipykernel install --user --name envs-tf24 --display-name "Python (tf24)"
$ conda deactivate
To use such an environment from Jupyter Notebook in OnDemand, start Jupyter Notebook and select the name of your environment's kernel from the New button in the upper right.
Slurm notes and issues
Lifetime of a simple job
Submit a job
You submit a job with the sbatch
command.
$ sbatch sleep.sh
A Slurm job script is similar to the Torque/Moab scripts used previously on Unity. Here's a simple one; it just writes a message to a file, sleeps for 90 seconds, and writes another message.
#!/usr/bin/env bash
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=sbatch-sleep-test
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<name.#>@osu.edu
cd $SLURM_SUBMIT_DIR
echo "Going to sleep: `date`" >> sbatch-sleep-test.txt
sleep 90
echo "Awake: `date`" >> sbatch-sleep-test.txt
Is it running?
You can see all the jobs that are running
$ squeue
You can see the status of your jobs (replace <name.#>
with your own name.#).
$ squeue -u <name.#>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1003 batch sbatch-s shew.1 R 0:07 1 u066
You can get more information on using squeue
with squeue --help
or man squeue
.
Done
By default, a Slurm job writes its standard output and standard error to a file named slurm-<job_id>.out
in the directory from which the job was submitted (in the example above, we redirected standard output to another file, so an empty slurm-<job_id>.out
file was created).
As with PBS jobs, you can ask Slurm to send you email when a job begins and ends. However, email from Slurm is not as informative; all the useful information is in the email's subject and the body is empty. For example, completion of the above job resulted in email from "SLURM User" with the subject line "Slurm Job_id=1003 Name=sbatch-sleep-test Ended, Run time 00:01:31, COMPLETED, ExitCode 0". From the value of ExitCode you can infer whether the job succeeded (usually ExitCode 0) or failed (usually ExitCode not 0), but not much else.
Interactive sessions
Using salloc
and srun
To run an interactive session, you first use the salloc
command to allocate the desired resources on the cluster, then connect a shell to those resources by using the srun
command. Here's the simplest example of this:
[shew.1@unity-login1 ~] $ salloc
salloc: Pending job allocation 2811719
salloc: job 2811719 queued and waiting for resources
salloc: job 2811719 has been allocated resources
salloc: Granted job allocation 2811719
salloc: Waiting for resource configuration
salloc: Nodes u123 are ready for job
[shew.1@unity-login1 ~] $ srun --jobid=2811719 --pty /bin/bash
[shew.1@u123 ~] $
This allocates the default one node, one core and 3 GB or memory for one hour on a compute node that you have permission to run on. If you want more resources or a specific node, salloc
has many options. You can type salloc --help
or man salloc
to see these options. For example, to ask for eight cores and 120 GB of memory on node u123, you can say this:
$ salloc --nodelist=u123 --ntasks=8 --mem=120g --time=04:00:00
Many of the arguments have short versions; the previous command could have been written:
$ salloc -w u123 -n 8 --mem=120g -t 04:00:00
Apparently the memory flag does not have a short version. Note also that the srun
command requires that you input the jobid from the salloc
command.
A further bit of manual work is required here: You type exit
to quit the shell you started with the srun
command, but the allocation remains. To free resources for other users, you need to remember to cancel the allocation after you're done with it:
[shew.1@u123 ~] $ exit
[shew.1@unity-login1 ~] $ scancel 2811719
Using sinteractive
A simpler way to get an interactive session is to use sinteractive
, which is a Perl script that combines salloc
and srun
, at the expense of a smaller set of options. For example, you can get a similar interactive session, including a shell connected to the allocated resources, with just one command:
[shew.1@unity-login1 ~] $ sinteractive -w u123 -M 120000 -t 04:00:00
An oddity of sinteractive
is that the memory request has to be entered as an integer number of megabytes--it doesn't understand 120g or 120gb.
A further simplification is that when you type exit
to quit the shell, the allocated resources are freed.
In Progress...