HowTo:CS Launch GPU: Difference between revisions
No edit summary |
No edit summary |
||
Line 65: | Line 65: | ||
* '''Click''' on the ''Create'' button | * '''Click''' on the ''Create'' button | ||
* It might a couple of minutes for the workload to start | * It might a couple of minutes for the workload to start | ||
* After the workload starts, you can ''Execute Shell'' into the container to run interactive commands using python3: <code>python3 /code | * After the workload starts, you can ''Execute Shell'' into the container to run interactive commands using python3: | ||
** <code>python3 main.py --train --amp --checkpoint_dir ./checkpoints</code> | |||
** <code>python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints</code> | |||
== Create Jupyter Ingress == | == Create Jupyter Ingress == |
Revision as of 10:30, 11 July 2024
Introduction
This guide gives an example of using GPU resources with the Endeavour container cluster. This guide is based off of Nvidia's example at: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae/setup and applied to our local teaching environment.
Dataset
We use the MovieLens 20m dataset. The VA-CF model was trained on the MovieLens 20M dataset. MovieLens 20M is a movie rating dataset. It includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. the goal of our model is to predict the rate of a new movie for a user considering the previous sets of (movie, rate) of the user. The model will be trained using a dataset of the movie and the rate for the movie. After that, the trained model predicts the rate of a new movie for a user.
Overview
The guide is broken up into the following steps. We assume that you already a project available on the Endeavour cluster, and are familiar with how the CS launch service works. See: HowTo:CS_Launch
- Create your docker image
- Create the container workload
- Create an ingress for jupyterlab
- Connect to jupyter and run the model
Create Docker Image
If you want to skip this step, I have a pre-built image for this example by using the image: container.cs.vt.edu/carnold/gpu:latest
Create Docker registry
You will need to host your docker image in a docker registry. A docker registry is available with our Gitlab instance.
- Login to https://git.cs.vt.edu
- Click on New project button
- You can create a blank project, all we need is to use the container registry which gets created automatically. Make the project Public for ease of use.
- From the menu on the left, select Deploy->Container Registry This will give you your image registry URL that you will need for both uploading and deploying.
Build Docker Image
The docker image will be based on the Nvidia VAE for TensorFlow: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae_for_tensorflow
- SSH to rlogin.cs.vt.edu
- Make a directory to hold the files:
mkdir gpu
- Download the image files:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/vae_for_tensorflow/versions/20.06.3/zip -O vae_for_tensorflow_20.06.3.zip
- Unzip the files:
unzip vae_for_tensorflow_20.06.3.zip
- Download the dataset:
wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
- Modify the Dockerfile:
vim Dockerfile
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3 FROM ${FROM_IMAGE_NAME} ADD requirements.txt . RUN pip install -r requirements.txt WORKDIR /code COPY . . RUN mkdir -p /data/ml-20m/extracted; \ cd /data/ml-20m/extracted; \ unzip /code/ml-20m.zip ENTRYPOINT ["jupyter", "notebook", "--ip", "0.0.0.0", "--port", "8888", "--allow-root"]
- Build the image:
podman build . -t container.cs.vt.edu/carnold/gpu
Note: you will need to substitute your own registry URL from before
Upload the docker image
podman login container.cs.vt.edu
podman push container.cs.vt.edu/carnold/gpu:latest
Note: you will need to substitute your own registry URL from before
Create Workload
Now it is time to actually deploy your GPU container.
- Log into your Endeavour project on https://launch.cs.vt.edu
- Click on Wordloads on the left
- Click on Create button
- Select the Deployment tile
- Fill in the Name with a project unique name:
gpu-example
- Fill in the Container Image with your registry URL and tag:
container.cs.vt.edu/carnold/gpu:latest
- Click the Add Port or Service button
- Select
Cluster IP
from the Service Type list - Fill in the Name with
jupyter
- Fill in the Private Container Port with
8888
- Click on the Create button
- It might a couple of minutes for the workload to start
- After the workload starts, you can Execute Shell into the container to run interactive commands using python3:
python3 main.py --train --amp --checkpoint_dir ./checkpoints
python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints
Create Jupyter Ingress
You can access the running jupyterlab instance using an ingress for ease of interaction with the model.
- Log into into your Endeavour project on https://launch.cs.vt.edu
- Click on Service Discovery->Ingresses on the left
- Click on the Create button
- Fill in the Name with a project unique name:
gpu-example
- Fill in the Request Host with a unique subdomain for endeavour.cs.vt.edu Example:
gpu-example.endeavour.cs.vt.edu
Note: you will need to use a different subdomain. - Fill in the Prefix with
/
- Select your workload from the Target Service
- Select
8888
from the Port list - Click on the Create button
Connect to Jupyter
You will need your randomly generated token to access the jupyterlab site ingress
- Log into your Endeavour project on https://launch.cs.vt.edu
- Click on Wordloads on the left
- Click on the name of your GPU workload
- Select > View logs from the hamburger menu of the running pod listed
- Look through the logs and find your token and copy it
- Navigate to your ingress that you created earlier, and enter your token to login