Empire Cluster
Introduction
Empire is the Computer Science production research GPU cluster. It allows faculty and grad students to work on research projects via a queue system.
Nodes
- Frieren : 128 Cores, 1.1 TB RAM, 8x H200 NVL GPUs (2x 4-way NVLinks)
Account
Access is controlled by a Computer Science specific account. Access is limited to CS Faculty and Grad students only. If you do not have a CS account you can create one here: https://admin.cs.vt.edu/create
Primary Access
Empire headnode access is provided via an SSH session. Connect to the headnode via SSH. Linux and Macs usually come with SSH client already active. Windows users can install the client following these directions: https://code.visualstudio.com/docs/remote/troubleshooting#_installing-a-supported-ssh-client
Submitting Jobs
Jobs are typically queued via the "sbatch" command and a control script that tells the queue what resources you need and the actual commands to run.
Example Batch Script
#!/bin/bash ###################################### # Simple GPU submit script for SLURM # # Run with "sbatch script_name.sh" # # Variables: %j = job ID # # Recommend 4 CPUs per 1 GPU request # ###################################### #SBATCH --job-name="gpu_test" # Job name #SBATCH --output=gpu_test_%j.o # Standard output log file #SBATCH --error=gpu_test_%j.e # Standard error log file #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 # Number of tasks #SBATCH --cpus-per-task=4 # CPUs per task #SBATCH --mem=16G # Requested Memory #SBATCH --time=00:00:30 # Wall time limit (HH:MM:SS) #SBATCH --gres=gpu:1 # Request one generic GPU #SBATCH --partition=batch # Specify the queue name (batch) # Start timestamp # echo -n "Start Time: " date # Load necessary modules (system-specific) # # module load python # module load cuda/11.8 # Example CUDA version # Activate a Python virtual/conda environment if needed # # source /path/to/your/env/bin/activate # Command(s) to run your program # # python your_gpu_script.py echo "Hello World" # Finish timestamp # echo -n "End Time: " date
Interactive Jobs
The preferred method for starting an interactive job is to first create a tmux session so that if you get disconnected you can re-attach to the session.
tmux new -s <session_name>
Next setup your environment allocation for your job. Be sure to include a time limit, # of CPUs, RAM, and # of GPUs
salloc -t 2:00:00 --cpus-per-task=4 --mem=8G --gres=gpu:1
At this point, you can monitor the queue and wait for your job to go live using the "squeue" command. Once live you can attach to your job using the srun command.
srun --pty /bin/bash
Please note that you are not limited to just bash but can run entire scripts via the srun command. bash is just the best way to get an interactive shell on the node with which to interact. When you type "exit" you are concluding your srun session. However, this does not conclude your entire interactive session. You have to exit again to exit the salloc session as well. This will conclude your job in the queue. A final exit is require to exit your tmux session but this is optional as many users will keep their tmux session when working with the headnode.
To re-attach to a tmux session (if you get disconnected or close your ssh window) use the following:
tmux a -t <session_name>
Other Useful Commands
- squeue - used to show what is in the queue and the status of each job
- scontrol show node - used to show the status of the nodes
- scontrol show job <jobid> - used to show the info of a specific job
- scancel <jobid> - used to cancel a job