Sushi101: Difference between revisions

VisualWikitext

Latest revision as of 10:33, 17 April 2020

Summary

Sushi is a small cluster shared by several Stack@CS faculty members.

Sushi Nodes

There are 10 sushi nodes. Each node has:

48 cores
256GB RAM
Local scratch space -- a few hundred GB in /tmp
Shared access to a shared /home file system -- 180TB over NFS
10Gig Ethernet connection to the other nodes
Access to the external Internet

Sushi Access

Only certain labs have access to Sushi. If you are not sure, ask the lab PI.

If you have an account on sushi, you can access sushi via the head node: sushi.cs.vt.edu (128.173.236.117 on the intranet)

scp external files to your home directory on the head node
Launch jobs from the head node

Sushi Jobs

Launch jobs from the head node using the PBS job submission system. There are many guides to PBS on the web. I recommend Purdue's guide.

DO NOT run expensive tasks on the head node itself. This affects the cluster's stability and inconveniences everyone.

Learning to use sushi

Before you use sushi for the first time, you should:

Read this wiki page
Learn about the PBS system
Review the man page for qsub, qstat, and qnodes
Try a simple practice job, e.g. an "echo" that prints the node name

This may take you a day or two. It is well worth the investment.

Example

Here's what I do for "embarassingly parallel" jobs driven by an input file with one task per line.

Split input into files, one task per line

 (10:51:16) davisjam@sushi-headnode ~/qsub-jobs/Memo/input $ split sl-regex-filteredForPrototype-all.json sl-regex-filteredForPrototype-all-piece-  --lines=3000 --additional-suffix=.json --numeric-suffixes --suffix-length=4

Write job script

I use this as a template and tweak it from there. You might also try the GNU Parallel tool. There's a copy in /home/davisjam/bin/parallel.

 (12:05:29) davisjam@sushi-headnode ~/qsub-jobs/Memo $ cat qsub-memo.sh
 #!/usr/bin/env bash
 
 # You must provide REGEX_FILE
 # e.g. "qsub -v REGEX_FILE='/home/davisjam/qsub-jobs/RegexRepl/syntax/input/test/500.json' qsub-syntax.sh"
 
 #########################################
 ## PBS Configuration (Single Comment # ONLY)
 #########################################
 #
 #PBS -l nodes=1:ppn=8
 #
 # Save all env vars -- including PERL5LIB
 #PBS -V
 #########################################
 
 #########################################
 ## Setup
 #########################################
 
 #OUT_FILE=~/data/syntax/cross-registry-real/`basename $REGEX_FILE .json`-slras-job$PBS_JOBID.json
 OUT_FILE=~/data/memo/all-SL/`basename $REGEX_FILE .json`-measureMemo-job$PBS_JOBID.pkl.bz2
 
 STDOUT_FILE=$HOME/logs/qsub-memo-$$.out
 STDERR_FILE=$HOME/logs/qsub-memo-$$.err
 
 NCORES=`wc -l < $PBS_NODEFILE`
 
 # Flush NFS?
 rm $STDOUT_FILE 2>/dev/null
 rm $STDERR_FILE 2>/dev/null
 sync; sync; sync; sync; sync;
 touch $STDOUT_FILE
 touch $STDERR_FILE
 
 # Here we go!
 echo "Hello on node " `hostname` " with $NCORES cores"
 echo "REGEX_FILE $REGEX_FILE"
 echo "OUT_FILE $OUT_FILE"
 echo "STDOUT_FILE $STDOUT_FILE STDERR_FILE $STDERR_FILE"
 
 export MEMOIZATION_PROJECT_ROOT=~/memoized-regex-engine
 export ECOSYSTEM_REGEXP_PROJECT_ROOT=~/EcosystemRegexps
 
 set -x
 
 # For data about prototype
 # PYTHONUNBUFFERED=1 $MEMOIZATION_PROJECT_ROOT/eval/measure-memoization-behavior.py \
 #   --regex-file $REGEX_FILE \
 #   --queryPrototype \
 #   --trials 1 \
 #   --queryProductionEngines \
 #   --parallelism $NCORES \
 #   --out-file $OUT_FILE \
 #   > $STDOUT_FILE \
 #   2>$STDERR_FILE
 
 # For data about other regex engines -- use if you want to test with extended features not supported by prototype
 PYTHONUNBUFFERED=1 $MEMOIZATION_PROJECT_ROOT/eval/measure-memoization-behavior.py \
   --regex-file $REGEX_FILE \
   --useCSharpToFindMostEI \
   --queryProductionEngines \
   --parallelism $NCORES \
   --out-file $OUT_FILE \
   > $STDOUT_FILE \
   2>$STDERR_FILE

Launch job

 (10:56:30) davisjam@sushi-headnode ~/qsub-jobs/Memo $ for f in input/500-piece-*; do echo $f; qsub -v REGEX_FILE=`pwd`/input/$f qsub-memo.sh; done

Monitor job

 (11:13:54) davisjam@sushi-headnode ~/qsub-jobs/Memo $ ls -lhtra ~/logs

(and tail log files, etc.)

Export data

If you want to export the data (e.g. for analysis in a Jupyter notebook), try something like this:

 (11:15:32) davisjam@sushi-headnode ~/qsub-jobs/Memo $ mkdir ~/export-latest; cp ~/data/memo/all-SL/*.pkl.bz2 ~/export-latest; tar -czvf ~/export-latest.tgz ~/export-latest; scp ...

Handy scripts

Check on your jobs

How much longer will you be waiting?

summarize-job-state.pl

 #!/usr/bin/env perl
 # Author: Jamie Davis <davisjam@vt.edu>
 # Description: Summarize the status of the jobs of a user
 
 use strict;
 use warnings;
 
 if (scalar(@ARGV) ne 1) {
   die "  Summarize state of jobs submitted by a user\nusage: $0 username\n";
 }
 my $user = $ARGV[0];
 if (length($user) < 1) {
   die "Error, username is empty\n";
 }
 
 my @lines = `qstat -u $user`;
 my @running = grep { m/\s+$user\s+.*\sR\s/ } @lines;
 my @queued = grep { m/\s+$user\s+.*\sQ\s/ } @lines;
 my @error = grep { m/\s+$user\s+.*\sE\s/ } @lines;
 
 my $nRunning = scalar(@running);
 my $nQueued = scalar(@queued);
 my $nError = scalar(@error);
 my $nJobs = $nRunning + $nQueued + $nError;
 
 print "    Running jobs: $nRunning\n";
 print "    Queued jobs: $nQueued\n";
 print "    Error jobs: $nError\n";
 print " + ------------------------\n";
 print "    Active jobs: $nJobs\n";

Abort a run

Sometimes you see an error show up in your logs files and need to abort the run.

kill-my-jobs.pl

 #!/usr/bin/env perl
 # Author: Jamie Davis <davisjam@vt.edu>
 # Description: Kill (qdel) all jobs owned by the given user
 
 use strict;
 use warnings;
 
 if (scalar(@ARGV) ne 1) {
   die "  qdel all jobs submitted by a user\nusage: $0 username\n";
 }
 my $user = $ARGV[0];
 if (length($user) < 1) {
   die "Error, username is empty\n";
 }
 
 my @jobIDs = &getJobIDs($user);
 
 if (@jobIDs) {
   &log("qdel'ing the " . scalar(@jobIDs) . " jobs owned by $user");
   my $cmd = "qdel " . join(" ", @jobIDs);
   system($cmd);
 } else {
   print "qstat reported no jobs to kill\n";
 }
 
 ###########
 
 sub getJobIDs {
   my ($user) = @_;
 
   &log("Using qstat to get the jobs owned by $user");
   my @qstat_output = `qstat -u $user`;
   chomp @qstat_output;
   my @jobLines = grep { m/\s+$user\s+/ } @qstat_output;
   my @jobIDs = map {
     my ($id) = ( $_ =~ m/^(\d+)\./ );
     $id;
   } @jobLines;
   return @jobIDs;
 }
 
 sub log {
   my ($msg) = @_;
   print STDERR "$msg\n";
 }

Clean up /tmp

Sometimes my analysis tools leak files into /tmp on the sushi nodes.

clean-my-tmp.pl

 #!/usr/bin/env perl
 # Author: Jamie Davis <davisjam@vt.edu>
 # Description: Print commands to clean up my files in /tmp across sushi
 
 use strict;
 use warnings;
 
 my @nodes = qw/ sushi01 sushi02 sushi03 sushi04 sushi05 sushi06 sushi07 sushi08 sushi09 sushi10 /;
 
 ## Parse args
 if (scalar(@ARGV) < 1 or scalar(@ARGV) > 2) {
   die "Print commands to delete files in /tmp on each sushi node [matching the specified find predicates]
 Usage: $0 owning-user ['find predicates']
 
 Examples:
   $0 davisjam
     - Deletes all files owned by davisjam in /tmp on all sushi nodes
   $0 davisjam '-name \"protoRegexEngine*\"'
     - Delete all files ... whose name matches this predicate
       You should wrap predicates in single-quotes, and use double-quotes for any quoting within the predicates
       (The ssh command is wrapped in single-quotes)
 ";
 }
 my $user = $ARGV[0];
 if (length($user) < 1) {
   die "Error, username is empty\n";
 }
 my $findPredicates = "";
 if (scalar(@ARGV) >= 2) {
   $findPredicates = $ARGV[1];
 }
 
 ## Cleanup operations
 my $cmd = "ssh $nodes[0] 'find /tmp -user $user -delete -type f $findPredicates'";
 
 for my $node (@nodes) {
   my $cmd = "find /tmp -user $user -delete -type f $findPredicates";
   print("ssh $node '$cmd' &\n");
 }
 
 &log("\n^^ If the preceding commands are to your liking, copy/paste/execute to run them.");
 
 ############
 
 sub log {
   my ($msg) = @_;
   print STDERR "$msg\n";
 }

@@ Line 1: / Line 1: @@
 == Summary ==
-""Sushi"" is a small cluster shared by several Stack@CS faculty members.
+'''Sushi''' is a small cluster shared by several Stack@CS faculty members.
 ==Sushi Nodes==
@@ Line 22: / Line 22: @@
 ==Sushi Jobs==
-Launch jobs using the PBS job submission system. There are many guides to PBS on the web. I recommend [Purdue's guide](https://www.rcac.purdue.edu/knowledge/hammer/run/pbs).
+Launch jobs from the head node using the PBS job submission system. There are many guides to PBS on the web. I recommend [https://www.rcac.purdue.edu/knowledge/hammer/run/pbs Purdue's guide].
+'''DO NOT run expensive tasks on the head node itself.''' This affects the cluster's stability and inconveniences everyone.
 == Learning to use sushi ==
@@ Line 133: / Line 135: @@
 === Check on your jobs ===
+How much longer will you be waiting?
+'''summarize-job-state.pl'''
    #!/usr/bin/env perl
@@ Line 166: / Line 172: @@
 === Abort a run ===
+Sometimes you see an error show up in your logs files and need to abort the run.
+'''kill-my-jobs.pl'''
    #!/usr/bin/env perl
@@ Line 216: / Line 226: @@
 Sometimes my analysis tools leak files into /tmp on the sushi nodes.
+'''clean-my-tmp.pl'''
    #!/usr/bin/env perl

Sushi101: Difference between revisions

Latest revision as of 10:33, 17 April 2020

Contents

Summary

Sushi Nodes

Sushi Access

Sushi Jobs

Learning to use sushi

Example

Split input into files, one task per line

Write job script

Launch job

Monitor job

Export data

Handy scripts

Check on your jobs

Abort a run

Clean up /tmp

Navigation menu

Sushi101: Difference between revisions

Latest revision as of 10:33, 17 April 2020

Summary

Sushi Nodes

Sushi Access

Sushi Jobs

Learning to use sushi

Example

Split input into files, one task per line

Write job script

Launch job

Monitor job

Export data

Handy scripts

Check on your jobs

Abort a run

Clean up /tmp

Navigation menu

Search