Major acknowledgment to Ilia Kulikov and for creating part of this doc.
Do NOT run compute-heavy jobs on log-in nodes.
· HPC admins (and other HPC users) will be very upset and you might get into trouble.
Do NOT email NYU HPC unless absolutely necessary.
Copyright By PowCoder代写 加微信 powcoder
· Check the tutorials written by NYU HPC staff
· Check slurm documentation
· Ask on Campuswire
Part 1: Logging in to Greene
If you’re not on the NYU network, then you have three options (choose one):
1. Use NYU VPN (please figure this out on your own: link) and directly ssh to Greene
2. Gateway (ssh -> Greene
3. CIMS cluster (if you have access) -> Greene (if you don’t have a CIMS account, don’t worry—you’re not at any disadvantage)
If you’re on the NYU network including through NYU VPN: ssh
(base) Eugenes-MacBook-Pro-2:~ eugenechoi$ ssh
_ ___ ___ _ _ _ ____ ____
| \\ | \\ \\ / / | | | | | | | _ \\ / ___|
| \\| |\\ V /| | | | | |_| | |_) | |
| |\\ | | | | |_| | | _ | __/| |___
|_| \\_| |_| \\___/ |_| |_|_| \\____|
/ ___|_ __ ___ ___ _ __ ___
| | _| ‘__/ _ \\/ _ \\ ‘_ \\ / _ \\
| |_| | | | __/ __/ | | | __/
\\____|_| \\___|\\___|_| |_|\\___|
Last login: Tue Mar 1 07:20:19 2022 from 10.27.6.80
Part 2: Looking around the filesystem using bash
Please learn these commands on your own, if you’re not familiar with them: cd, pwd, rm (as well as flags like r or f), ls (as well as flags like l or a or h), du (as well as flags like h or s), cp, scp. Use man ls to check the documentation for ls, for example. Some quick examples:
· touch
· rm
· Warning: you cannot undo rm, so please be careful when using this command.
· cp
· mv
· cd
Login node should not be used to run anything related to your computations, use it for file management (git, rsync), jobs management (srun, salloc, sbatch).
Bash holds the set of environment variables that are used to help other software to link some libraries or helper scripts:
LD_LIBRARY_PATH=:/share/apps/centos/8/usr/lib:/share/apps/centos/8/lib64
SSH_CONNECTION=216.165.22.148 32920 216.165.13.138 22
ARCHIVE=/archive/ik1147
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
HOSTNAME=log-2.nyu.cluster
SCRATCH=/scratch/ik1147
Important: different filesystems
Ok. We were on Greene (corresponding to filesystem B below). There’re other filesystems.
Very important: please make sure you understand this figure here!
-(A)- ——————(B)——————–
Local —> Greene login node —> Greene compute node
—> Burst login node —> GCP compute node
——-(C)——-
Filesystem A: Local
For example, your laptop.
Filesystem B: Greene
HPC Cluster overview
Number of nodes
3014G (3T)
4 x V100 32G (PCIe)
4 x RTX8000 48G
HPC quotas
filesystem
long term storage
2TB/20K inodes
probably nothing
50GB/30K inodes
experiments/stuff
YES (60 days)
5TB/1M inodes
Check quota by myquota.
For this course:
· You probably won’t be using /archive.
· You will store very very few things (maybe just a few lines of environment-related code) on /home.
Where to store your data
· /scratch/[netid]
· How to get on the Greene login node? ssh (see above).
· How to request Greene GPUs? Later.
Burst: something between B and C
· How to get on the Burst node? After ssh do ssh burst.
· Mostly containing files from B, but not C.
Filesystem C: NYU HPC GCP
Where to store your temporary data (will disappear after you exit the node)?
· /tmp or /mnt/ram
Where to store the data you want to keep?
· /scratch/[netid] (recommended)
How to get on GCP compute nodes? Our class will have one account ds_ga_1012_2022sp, and four partitions:
1. interactive
2. n1s8-t4-1 (NVIDIA T4 GPU)
3. n1s8-p100-1 (NVIDIA Tesla P100)
4. n1s8-v100-1 (1 compute node with 8 CPU cores and 1 NVIDIA Tesla V100)
Read more about GCP GPU specs here.
· For simple scripts and file operations:
· srun –account=ds_ga_1012_2022sp –partition=interactive –pty /bin/bash
· Check hostname: this is on Google Cloud.
· lscpu: 1-2 CPUs.
· free -m: around 2GB memory.
· For GPUs
· srun –account=ds_ga_1012_2022sp –partition=n1s8-v100-1 –gres=gpu –time=1:00:00 –pty /bin/bash
Always use the interactive partition, if you’re only doing very simple operations (i.e., moving files around, editing code using vim, etc.).
How to copy files around?
From A to B (you must be on NYU network; VPN also okay)
· On A, do scp [optional flags] [file-path]
From B to A (you must be on the NYU network)
· On A, do scp [optional flags] [local-destination-path]
From B to C
· On C, do scp [optional flags] greene-dtn:[file-path] [gcp-destination-path]
From C to B
· On C, do scp [optional flags] [file-path] greene-dtn:[greene-destination-path]
From A to C
· A -> B -> C
From C to A
· C -> B -> A
Part 3: Slurm, Burst, Singularity, and your typical workflow
Slurm and singularity are very popular in both academic and industry settings.
Slurm is a job management system that allocated resources (computers) to you given your requests and also runs some scripts if you pass them in.
Singularity is software to instantiate the container-based userspace (can be seen as a virtual machine). The main idea of using a container is to provide an isolated user space on a compute node and to simplify the node management with a single OS container image.
Part 3.1: Interactive setting
The typical workflow for interactive (running of some script/debugging) looks like this:
1. Login: Greene’s login node.
2. (Only if using NYU HPC GCP) Log in to Burst node.
3. Request a job / computational resource and wait until Slurm grants it.
1. If you want to connect to the GCP filesystem, then you need to request an interactive job (see below). If you want to connect to the Greene filesystem, then do nothing (you’re already on it).
2. You always need to request a job for GPUs (no matter for GCP or Greene).
4. Execute singularity and start container instance.
5. Activate conda environment with your own deep learning libraries.
6. Run your code, make changes/debugging.
Step 1: Log in to Greene’s login node.
Described above. NEVER run compute-heavy jobs on the login nodes. You’ll get scolded by HPC admins and you may get into trouble.
Step 2: Burst
Then, check our hostname by hostname. Do not run compute-heavy jobs on this node either!
Step 3: Requesting compute node(s)
Step 3a: If you’re using NYU HPC GCP; interactive setting
Note: If you want CPUs, you can use Greene and it’s almost always available on Greene. If you want GPUs, you might need to wait for a long time, so that’s why we recommend you to use NYU HPC GCP.
On NYU HPC GCP, our class will have one account ds_ga_1012_2022sp, and four partitions: interactive, n1s8-t4-1, n1s8-p100-1, n1s8-v100-1.
Confusing thing: One partition is called interactive. This interactive and the “interactive setting” in the header do not refer to the same thing.
For simple scripts / file operations
· srun –account=ds_ga_1012_2022sp –partition=interactive –pty /bin/bash
· After getting onto GCP node:
· Check hostname: this is on Google Cloud.
· lscpu: 1-2 CPUs.
· free -m: around 2GB memory.
· srun –account=ds_ga_1012_2022sp –partition=n1s8-t4-1 –gres=gpu –time=1:00:00 –pty /bin/bash
bash-4.4$ nvidia-smi
Sun Feb 20 23:40:58 2022
+—————————————————————————–+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp :Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 51C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+——————————-+———————-+———————-+
+—————————————————————————–+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
If you’re canceling your srunsubmitted job, use control+D or exit.
PROBLEM: If you don’t touch your keyboard for a while, or if your connection is unstable, then the job may die.
A workaround:
· (recommended by HPC staff): on burst node,
· sbatch –account=ds_ga_1012_2022sp –partition=interactive –time=1:00:00 –wrap “sleep infinity”
Then, squeue -u [netid] -i 5. What’s -i? Check man!
(base) ~]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
51726 interacti wrap ec2684 CF 0:04 1 b-9-1
After a while:
(base) ~]$ squeue -u ec2684
JOBID PARTITION NAME USER ST squ TIME NODES NODELIST(REASON)
51726 interacti wrap ec2684 R 0:07 1 b-9-1
(Btw: To exit this job, use scancel [jobid].)
Now, we can ssh onto the node!
(base) ~]$ ssh b-9-1
The authenticity of host ‘b-9-1 (
ECDSA key fingerprint is SHA256:eE/bk6mc/5wyamL8WtQd1e8MkAmMO1R5EM9XRoG9VCM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added ‘b-9-1’ (ECDSA) to the list of known hosts.
__ __ __ __ __ __
/\\ “-.\\ \\ /\\ \\_\\ \\ /\\ \\/\\ \\
\\ \\ \\-. \\ \\ \\____ \\ \\ \\ \\_\\ \\
\\ \\_\\\\”\\_\\ \\/\\_____\\ \\ \\_____\\
\\/_/ \\/_/ \\/_____/ \\/_____/
__ __ ______ ______
/\\ \\_\\ \\ /\\ == \\ /\\ ___\\
\\ \\ __ \\ \\ \\ _-/ \\ \\ \\____
\\ \\_\\ \\_\\ \\ \\_\\ \\ \\_____\\
\\/_/\\/_/ \\/_/ \\/_____/
______ ______ ______
/\\ ___\\ /\\ ___\\ /\\ == \\
\\ \\ \\__ \\ \\ \\ \\____ \\ \\ _-/
\\ \\_____\\ \\ \\_____\\ \\ \\_\\
\\/_____/ \\/_____/ \\/_/
===============================================
– Hostname…………: b-9-1
– IP Address……….: 10.144.0.12
– Disk Space……….: remaining
===============================================
– CPU usage………..: 0.31, 0.26, 0.10 (1, 5, 15 min)
– Memory used………: 266 MB / 1814 MB
– Swap in use………: 0 MB
===============================================
To get GPUs:
· As discussed above: srun –account=ds_ga_1012_2022sp –partition=n1s8-t4-1 –gres=gpu –time=01:00:00 –pty /bin/bash
· If you’re afraid of disconnecting: sbatch –account=ds_ga_1012_2022sp –partition=n1s8-t4-1 –gres=gpu –time=01:00:00 –wrap “sleep infinity” and then ssh onto this compute node
Warning: In this case, you’ll leave the compute node by using control+d or exit (but if you leave, the job will still be running). If you’re canceling your job, use scancel. Note: Each student only has a fixed amount of compute hours for the semester.
Step 3b: (strongly discouraged by HPC admins) If you’re using Greene; interactive setting
Note: If you want CPUs, you can use Greene and it’s almost always available on Greene. If you want GPUs, for the first few jobs, your wait time might be relatively short. But you might need to wait for a long time later (because unfortunately higher priorities are assigned to full-time researchers), so that’s why we recommend you to use NYU HPC GCP.
· Your jobs will probably have very low priorities (unfortunately full-time researchers will have much higher priorities), so wait time will likely be very long (e.g., >24 hours), whereas you only need to wait a few min (usually) for GCP GPUs.
· Your jobs will be killed if you have low utility.
HPC has hundreds of nodes (computers) connected to a high-speed network with a shared filesystem ($SCRATCH).
If you decide to use Greene machines, skip Step 2.
Again, using NYU HPC GCP by following step 3a) is strongly encouraged.
CPU: srun –nodes=1 –cpus-per-task=1 –mem=32GB –time=1:00:00 –pty /bin/bash
GPU: srun –nodes=1 –cpus-per-task=1 –mem=32GB –time=1:00:00 –gres=gpu:1 –pty /bin/bash
Again, check the man page or here.
Check what this job looks like in the slurm queue:
~]$ squeue -u ${USER}
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3167927 rtx8000 bash ik1147 R 2:05 1 gr030-ib0
GPU status after you requested the GPU in the interactive setting:
~]$ nvidia-smi
Thu Feb 25 18:11:06 2021
+—————————————————————————–+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp :Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:06:00.0 Off | 0 |
| N/A 39C P0 59W / 250W | 0MiB / 45556MiB | 0% Default |
| | | N/A |
+——————————-+———————-+———————-+
+—————————————————————————–+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+—————————————————————————–+
Running python:
~]$ python
bash: python: command not found
(unless you configured your environment before, of course)
Step 4: Setting up singularity and activating conda environment
First, we copy over the empty filesystem image where we will put our conda environment later (you only need to do this once in the semester):
# On Greene
cd ${SCRATCH}
cp /scratch/work/public/overlay-fs-ext3/overlay-25GB-500K.ext3.gz .
/# On Burst: first get on GCP
srun –account=ds_ga_1012_2022sp –partition=n1s8-t4-1 –gres=gpu –time=1:00:00 –pty /bin/bash
# Then download the overlay filesystem
cd /scratch/[netid]
scp greene-dtn:/scratch/work/public/overlay-fs-ext3/overlay-25GB-500K.ext3.gz .
Unzip the ext3 filesystem. May take 5-10 min here.
# On both Greene and GCP
gunzip -vvv ./overlay-25GB-500K.ext3.gz
Filesystems can be mounted as read-write (rw) or read-only (ro) when we use it with singularity.
· read-write: use this one when setting up env (installing conda, libs, other static files)
· read-only: use this one when running your jobs. It has to be read-only since multiple processes will access the same image. It will crash if any job has already mounted it as read-write.
Now lets start a CPU-only job and launch singularity container with the fresh filesystem we just copied over (you need to do the below every time you want to run GPU jobs):
# On GCP (assuming our current directory is /scratch/[netid])
scp -rp greene-dtn:/scratch/work/public/singularity/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif .
singularity exec –bind /scratch/[netID] –overlay /scratch/[netID]/overlay-25GB-500K.ext3:rw /scratch/[netID]/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif /bin/bash
#gpu – rw and ro
singularity exec –nv –bind /scratch/[netID] –overlay /scratch/[netID]/overlay-25GB-500K.ext3:rw /scratch/[netID]/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif /bin/bash
singularity exec –nv –bind /scratch/[netID] –overlay /scratch/[netID]/overlay-25GB-500K.ext3:ro /scratch/[netID]/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif /bin/bash
Important: if you want to use GPUs inside the singularity, add –nv argument after exec.
# On Greene and GCP
wget
–2021-02-26 17:40:31–
Resolving repo.anaconda.com (repo.anaconda.com)… 104.16.130.3, 104.16.131.3, 2606:4700::6810:8303, …
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 94235922 (90M) [application/x-sh]
Saving to: ‘Miniconda3-latest-Linux-x86_64.sh’
Miniconda3-latest-Linux-x86_64.sh 100%[============================================================================================>] 89.87M 197MB/s in 0.5s
2021-02-26 17:40:32 (197 MB/s) – ‘Miniconda3-latest-Linux-x86_64.sh’ saved [94235922/94235922]
Now install the conda package. If the installer asks you where to install, type in /ext3/miniconda3. Agree to change your bashrc in the end if installer asks so):
# on Greene and GCP
Singularity> bash ./Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
PREFIX=/ext3/miniconda3
Unpacking payload …
Collecting package metadata (current_repodata.json): done
Solving environment: done
We got python! Now let’s install a few more libraries:
pip3 install torch==1.10.2+cu113 torchvision==0.11.3+cu113 torchaudio==0.10.2+cu113 -f
pip install transformers
pip install nlp
pip install sklearn
Many python libraries store some static files like pretrained models on disk when you import a particular model. Lets re-route the cache location to $SCRATCH (this is /scratch/[netid] on both Greene and GCP; if the variable doesn’t exist, type in /scratch/[netid] instead of $SCRATCH; or simply set SCRATCH=/scratch/[netid]).
First, create folders in scratch:
(base) mkdir $SCRATCH/.cache
(base) mkdir $SCRATCH/.conda
Now remove all existing cache:
(base) rm -rfv .conda
(base) rm -rfv .cache
Now create symbolic links (symlinks) to scratch:
(base) ln -s $SCRATCH/.conda ./
(base) ln -s $SCRATCH/.cache ./
(base) ls -l .conda
lrwxrwxrwx 1 ik1147 ik1147 22 Feb 26 18:02 .conda -> /scratch/ik1147/.conda
Let’s check how ‘heavy’ our filesystem became:
(base) du -sh /ext3/
4.6G /ext3/
We are capped at 25G so we are good to go. Feel free to install other packages along the way, but remember to mount filesystem with rw, otherwise you will get read-only errors.
Now change your working directory to /home/[net ID]
1. touch .bashrc
2. touch .bash_profile
Enter .bashrc file using vim by vim .bashrc and pasted in following lines of code by:
1. press i
2. paste in the following code block
3. press esc, followed by :wq and enter
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com