Empire Cluster

Introduction

Empire (empire.cs.vt.edu) is the Computer Science production research GPU cluster. It allows faculty and grad students to work on research projects via a queue system.

Nodes

Frieren : 128 Cores, 1.1 TB RAM, 8x H200 NVL GPUs (2x 4-way NVLinks)

Account

Access is controlled by a Computer Science specific account. Access is limited to CS Faculty and Grad students only. If you do not have a CS account you can create one here: https://admin.cs.vt.edu/create

Primary Access

Empire headnode access is provided via an SSH session. Connect to the headnode via SSH. Linux and Macs usually come with SSH client already active. Windows users can install the client following these directions: https://code.visualstudio.com/docs/remote/troubleshooting#_installing-a-supported-ssh-client

Submitting Jobs

Jobs are typically queued via the "sbatch" command and a control script that tells the queue what resources you need and the actual commands to run.

Example Batch Script

#!/bin/bash
######################################
# Simple GPU submit script for SLURM #
# Run with "sbatch script_name.sh"   #
# Variables: %j = job ID             #
# Recommend 4 CPUs per 1 GPU request #
######################################

#SBATCH --job-name="gpu_test"    # Job name
#SBATCH --output=gpu_test_%j.o   # Standard output log file
#SBATCH --error=gpu_test_%j.e    # Standard error log file
#SBATCH --nodes=1                # Number of nodes
#SBATCH --ntasks=1               # Number of tasks
#SBATCH --cpus-per-task=4        # CPUs per task
#SBATCH --mem=16G                # Requested Memory
#SBATCH --time=00:00:30          # Wall time limit (HH:MM:SS)
#SBATCH --gres=gpu:1             # Request one generic GPU
#SBATCH --partition=batch        # Specify the queue name (batch)

# Start timestamp #
echo -n "Start Time: "
date

# Load necessary modules (system-specific) #
# module load python
# module load cuda/11.8 # Example CUDA version

# Activate a Python virtual/conda environment if needed #
# source /path/to/your/env/bin/activate

# Command(s) to run your program #
# python your_gpu_script.py
echo "Hello World"

# Finish timestamp #
echo -n "End Time: "
date

Interactive Jobs

The preferred method for starting an interactive job is to first create a tmux session so that if you get disconnected you can re-attach to the session.

tmux new -s <session_name>

Next setup your environment allocation for your job. Be sure to include a time limit, # of CPUs, RAM, and # of GPUs

salloc -t 2:00:00 --cpus-per-task=4 --mem=8G --gres=gpu:1

At this point, you can monitor the queue and wait for your job to go live using the "squeue" command. Once live you can attach to your job using the srun command.

srun --pty /bin/bash

Please note that you are not limited to just bash but can run entire scripts via the srun command. bash is just the best way to get an interactive shell on the node with which to interact. When you type "exit" you are concluding your srun session. However, this does not conclude your entire interactive session. You have to exit again to exit the salloc session as well. This will conclude your job in the queue. A final exit is require to exit your tmux session but this is optional as many users will keep their tmux session when working with the headnode.

To re-attach to a tmux session (if you get disconnected or close your ssh window) use the following:

tmux a -t <session_name>

Other Useful Commands

squeue - used to show what is in the queue and the status of each job
scontrol show node - used to show the status of the nodes
scontrol show job <jobid> - used to show the info of a specific job
scancel <jobid> - used to cancel a job

Empire Cluster

Contents

Introduction

Nodes

Account

Primary Access

Submitting Jobs

Example Batch Script

Interactive Jobs

Other Useful Commands

Navigation menu

Empire Cluster

Introduction

Nodes

Account

Primary Access

Submitting Jobs

Example Batch Script

Interactive Jobs

Other Useful Commands

Navigation menu

Search