HowTo:CS Launch GPU: Difference between revisions

Revision as of 09:35, 11 July 2024

Introduction

This guide gives an example of using GPU resources with the Endeavour container cluster. This guide is based off of Nvidia's example at: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae/setup and applied to our local teaching environment.

Dataset

We use the MovieLens 20m dataset. The VA-CF model was trained on the MovieLens 20M dataset. MovieLens 20M is a movie rating dataset. It includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. the goal of our model is to predict the rate of a new movie for a user considering the previous sets of (movie, rate) of the user. The model will be trained using a dataset of the movie and the rate for the movie. After that, the trained model predicts the rate of a new movie for a user.

Overview

The guide is broken up into the following steps. We assume that you already a project available on the Endeavour cluster, and are familiar with how the CS launch service works. See: HowTo:CS_Launch

Create your docker image
Create the container workload
Create an ingress for jupyterlab
Connect to jupyter and run the model

Create Docker Image

If you want to skip this step, I have a pre-built image for this example by using the image: container.cs.vt.edu/carnold/gpu:latest

Create Docker registry

You will need to host your docker image in a docker registry. A docker registry is available with our Gitlab instance.

Login to https://git.cs.vt.edu
Click on New project button
You can create a blank project, all we need is to use the container registry which gets created automatically. Make the project Public for ease of use.
From the menu on the left, select Deploy->Container Registry This will give you your image registry URL that you will need for both uploading and deploying.

Build Docker Image

The docker image will be based on the Nvidia VAE for TensorFlow: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae_for_tensorflow

SSH to rlogin.cs.vt.edu
Make a directory to hold the files: mkdir gpu
Download the image files: wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/vae_for_tensorflow/versions/20.06.3/zip -O vae_for_tensorflow_20.06.3.zip
Unzip the files: unzip vae_for_tensorflow_20.06.3.zip
Download the dataset: wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
Modify the Dockerfile: vim Dockerfile

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
FROM ${FROM_IMAGE_NAME}

ADD requirements.txt .
RUN pip install -r requirements.txt

WORKDIR /code
COPY . .

RUN mkdir -p /data/ml-20m/extracted; \
    cd /data/ml-20m/extracted; \
    unzip /code/ml-20m.zip

ENTRYPOINT ["jupyter", "notebook", "--ip", "0.0.0.0", "--port", "8888", "--allow-root"]

Build the image: podman build . -t container.cs.vt.edu/carnold/gpu Note: you will need to substitute your own registry URL from before

Upload the docker image

podman login container.cs.vt.edu
podman push container.cs.vt.edu/carnold/gpu:latest Note: you will need to substitute your own registry URL from before

Create Workload

Now it is time to actually deploy your GPU container.

Log into your Endeavour project on https://launch.cs.vt.edu
Click on Wordloads on the left
Click on Create button
Select the Deployment tile
Fill in the Name with a project unique name: gpu-example
Fill in the Container Image with your registry URL and tag: container.cs.vt.edu/carnold/gpu:latest
Click the Add Port or Service button
Select Cluster IP from the Service Type list
Fill in the Name with jupyter
Fill in the Private Container Port with 8888
Click on the Create button
It might a couple of minutes for the workload to start
After the workload starts, you can Execute Shell into the container to run interactive commands using python3:
- python3 main.py --train --amp --checkpoint_dir ./checkpoints
- python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints
- python3 main.py --test --amp --checkpoint_dir ./checkpoints

Create Jupyter Ingress

You can access the running jupyterlab instance using an ingress for ease of interaction with the model.

Log into into your Endeavour project on https://launch.cs.vt.edu
Click on Service Discovery->Ingresses on the left
Click on the Create button
Fill in the Name with a project unique name: gpu-example
Fill in the Request Host with a unique subdomain for endeavour.cs.vt.edu Example: gpu-example.endeavour.cs.vt.edu Note: you will need to use a different subdomain.
Fill in the Prefix with /
Select your workload from the Target Service
Select 8888 from the Port list
Click on the Create button

Connect to Jupyter

You will need your randomly generated token to access the jupyterlab site ingress

Log into your Endeavour project on https://launch.cs.vt.edu
Click on Wordloads on the left
Click on the name of your GPU workload
Select > View logs from the hamburger menu of the running pod listed
Look through the logs and find your token and copy it
Navigate to your ingress that you created earlier, and enter your token to login

@@ Line 68: / Line 68: @@
 ** <code>python3 main.py --train --amp --checkpoint_dir ./checkpoints</code>
 ** <code>python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints</code>
+** <code>python3 main.py --test --amp --checkpoint_dir ./checkpoints</code>
 == Create Jupyter Ingress ==

HowTo:CS Launch GPU: Difference between revisions

Revision as of 09:35, 11 July 2024

Contents

Introduction

Dataset

Overview

Create Docker Image

Create Docker registry

Build Docker Image

Upload the docker image

Create Workload

Create Jupyter Ingress

Connect to Jupyter

Navigation menu

HowTo:CS Launch GPU: Difference between revisions

Revision as of 09:35, 11 July 2024

Introduction

Dataset

Overview

Create Docker Image

Create Docker registry

Build Docker Image

Upload the docker image

Create Workload

Create Jupyter Ingress

Connect to Jupyter

Navigation menu

Search