HowTo:CS Launch GPU: Difference between revisions

From Computer Science Wiki
Jump to navigation Jump to search
Carnold (talk | contribs)
Carnold (talk | contribs)
 
Line 67: Line 67:
* After the workload starts, you can ''Execute Shell'' into the container to run interactive commands using python3:  
* After the workload starts, you can ''Execute Shell'' into the container to run interactive commands using python3:  
** <code>python3 main.py --train --amp --checkpoint_dir ./checkpoints</code>
** <code>python3 main.py --train --amp --checkpoint_dir ./checkpoints</code>
** <code>python3 main.py --test --amp --checkpoint_dir ./checkpoints</code>
** <code>python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints</code>
** <code>python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints</code>
** <code>python3 main.py --test --amp --checkpoint_dir ./checkpoints</code>


== Create Jupyter Ingress ==
== Create Jupyter Ingress ==

Latest revision as of 10:35, 11 July 2024

Introduction

This guide gives an example of using GPU resources with the Endeavour container cluster. This guide is based off of Nvidia's example at: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae/setup and applied to our local teaching environment.

Dataset

We use the MovieLens 20m dataset. The VA-CF model was trained on the MovieLens 20M dataset. MovieLens 20M is a movie rating dataset. It includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. the goal of our model is to predict the rate of a new movie for a user considering the previous sets of (movie, rate) of the user. The model will be trained using a dataset of the movie and the rate for the movie. After that, the trained model predicts the rate of a new movie for a user.

Overview

The guide is broken up into the following steps. We assume that you already a project available on the Endeavour cluster, and are familiar with how the CS launch service works. See: HowTo:CS_Launch

  • Create your docker image
  • Create the container workload
  • Create an ingress for jupyterlab
  • Connect to jupyter and run the model

Create Docker Image

If you want to skip this step, I have a pre-built image for this example by using the image: container.cs.vt.edu/carnold/gpu:latest

Create Docker registry

You will need to host your docker image in a docker registry. A docker registry is available with our Gitlab instance.

  • Login to https://git.cs.vt.edu
  • Click on New project button
  • You can create a blank project, all we need is to use the container registry which gets created automatically. Make the project Public for ease of use.
  • From the menu on the left, select Deploy->Container Registry This will give you your image registry URL that you will need for both uploading and deploying.

Build Docker Image

The docker image will be based on the Nvidia VAE for TensorFlow: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae_for_tensorflow

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
FROM ${FROM_IMAGE_NAME}

ADD requirements.txt .
RUN pip install -r requirements.txt

WORKDIR /code
COPY . .

RUN mkdir -p /data/ml-20m/extracted; \
    cd /data/ml-20m/extracted; \
    unzip /code/ml-20m.zip

ENTRYPOINT ["jupyter", "notebook", "--ip", "0.0.0.0", "--port", "8888", "--allow-root"]
  • Build the image: podman build . -t container.cs.vt.edu/carnold/gpu Note: you will need to substitute your own registry URL from before

Upload the docker image

  • podman login container.cs.vt.edu
  • podman push container.cs.vt.edu/carnold/gpu:latest Note: you will need to substitute your own registry URL from before

Create Workload

Now it is time to actually deploy your GPU container.

  • Log into your Endeavour project on https://launch.cs.vt.edu
  • Click on Wordloads on the left
  • Click on Create button
  • Select the Deployment tile
  • Fill in the Name with a project unique name: gpu-example
  • Fill in the Container Image with your registry URL and tag: container.cs.vt.edu/carnold/gpu:latest
  • Click the Add Port or Service button
  • Select Cluster IP from the Service Type list
  • Fill in the Name with jupyter
  • Fill in the Private Container Port with 8888
  • Click on the Create button
  • It might a couple of minutes for the workload to start
  • After the workload starts, you can Execute Shell into the container to run interactive commands using python3:
    • python3 main.py --train --amp --checkpoint_dir ./checkpoints
    • python3 main.py --test --amp --checkpoint_dir ./checkpoints
    • python3 main.py --inference_benchmark --amp --checkpoint_dir ./checkpoints

Create Jupyter Ingress

You can access the running jupyterlab instance using an ingress for ease of interaction with the model.

  • Log into into your Endeavour project on https://launch.cs.vt.edu
  • Click on Service Discovery->Ingresses on the left
  • Click on the Create button
  • Fill in the Name with a project unique name: gpu-example
  • Fill in the Request Host with a unique subdomain for endeavour.cs.vt.edu Example: gpu-example.endeavour.cs.vt.edu Note: you will need to use a different subdomain.
  • Fill in the Prefix with /
  • Select your workload from the Target Service
  • Select 8888 from the Port list
  • Click on the Create button

Connect to Jupyter

You will need your randomly generated token to access the jupyterlab site ingress

  • Log into your Endeavour project on https://launch.cs.vt.edu
  • Click on Wordloads on the left
  • Click on the name of your GPU workload
  • Select > View logs from the hamburger menu of the running pod listed
  • Look through the logs and find your token and copy it
  • Navigate to your ingress that you created earlier, and enter your token to login