HowTo:CS Launch GPU
Introduction
This guide gives an example of using GPU resources with the Endeavour container cluster. This guide is based off of Nvidia's example at: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae/setup and applied to our local teaching environment.
Dataset
We use the MovieLens 20m dataset. The VA-CF model was trained on the MovieLens 20M dataset. MovieLens 20M is a movie rating dataset. It includes 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. the goal of our model is to predict the rate of a new movie for a user considering the previous sets of (movie, rate) of the user. The model will be trained using a dataset of the movie and the rate for the movie. After that, the trained model predicts the rate of a new movie for a user.
Overview
The guide is broken up into the following steps. We assume that you already a project available on the Endeavour cluster, and are familiar with how the CS launch service works. See: HowTo:CS_Launch
- Create your docker image
- Create the container workload
- Create an ingress for jupyterlab
- Connect to jupyter and run the model
Create Docker Image
If you want to skip this step, I have a pre-built image for this example by using the image: container.cs.vt.edu/carnold/gpu:latest
Create Docker registry
You will need to host your docker image in a docker registry. A docker registry is available with our Gitlab instance.
- Login to https://git.cs.vt.edu
- Click on New project button
- You can create a blank project, all we need is to use the container registry which gets created automatically. Make the project Public for ease of use.
- From the menu on the left, select Deploy->Container Registry This will give you your image registry URL that you will need for both uploading and deploying.
Build Docker Image
The docker image will be based on the Nvidia VAE for TensorFlow: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/vae_for_tensorflow
- SSH to rlogin.cs.vt.edu
- Make a directory to hold the files:
mkdir gpu
- Download the image files:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/vae_for_tensorflow/versions/20.06.3/zip -O vae_for_tensorflow_20.06.3.zip
- Unzip the files:
unzip vae_for_tensorflow_20.06.3.zip
- Download the dataset:
wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
- Modify the Dockerfile:
vim Dockerfile
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3 FROM ${FROM_IMAGE_NAME} ADD requirements.txt . RUN pip install -r requirements.txt WORKDIR /code COPY . . RUN mkdir -p /data/ml-20m/extracted; \ cd /data/ml-20m/extracted; \ unzip /code/ml-20m.zip ENTRYPOINT ["jupyter", "notebook", "--ip", "0.0.0.0", "--port", "8888", "--allow-root"]