Sushi101: Difference between revisions
Sushi guide |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
== Summary == | == Summary == | ||
'''Sushi''' is a small cluster shared by several Stack@CS faculty members. | |||
==Sushi Nodes== | ==Sushi Nodes== | ||
Line 22: | Line 22: | ||
==Sushi Jobs== | ==Sushi Jobs== | ||
Launch jobs using the PBS job submission system. There are many guides to PBS on the web. I recommend [ | Launch jobs from the head node using the PBS job submission system. There are many guides to PBS on the web. I recommend [https://www.rcac.purdue.edu/knowledge/hammer/run/pbs Purdue's guide]. | ||
'''DO NOT run expensive tasks on the head node itself.''' This affects the cluster's stability and inconveniences everyone. | |||
== Learning to use sushi == | == Learning to use sushi == | ||
Line 133: | Line 135: | ||
=== Check on your jobs === | === Check on your jobs === | ||
How much longer will you be waiting? | |||
'''summarize-job-state.pl''' | |||
#!/usr/bin/env perl | #!/usr/bin/env perl | ||
Line 166: | Line 172: | ||
=== Abort a run === | === Abort a run === | ||
Sometimes you see an error show up in your logs files and need to abort the run. | |||
'''kill-my-jobs.pl''' | |||
#!/usr/bin/env perl | #!/usr/bin/env perl | ||
Line 216: | Line 226: | ||
Sometimes my analysis tools leak files into /tmp on the sushi nodes. | Sometimes my analysis tools leak files into /tmp on the sushi nodes. | ||
'''clean-my-tmp.pl''' | |||
#!/usr/bin/env perl | #!/usr/bin/env perl |
Latest revision as of 10:33, 17 April 2020
Summary
Sushi is a small cluster shared by several Stack@CS faculty members.
Sushi Nodes
There are 10 sushi nodes. Each node has:
- 48 cores
- 256GB RAM
- Local scratch space -- a few hundred GB in /tmp
- Shared access to a shared /home file system -- 180TB over NFS
- 10Gig Ethernet connection to the other nodes
- Access to the external Internet
Sushi Access
Only certain labs have access to Sushi. If you are not sure, ask the lab PI.
If you have an account on sushi, you can access sushi via the head node: sushi.cs.vt.edu (128.173.236.117 on the intranet)
- scp external files to your home directory on the head node
- Launch jobs from the head node
Sushi Jobs
Launch jobs from the head node using the PBS job submission system. There are many guides to PBS on the web. I recommend Purdue's guide.
DO NOT run expensive tasks on the head node itself. This affects the cluster's stability and inconveniences everyone.
Learning to use sushi
Before you use sushi for the first time, you should:
- Read this wiki page
- Learn about the PBS system
- Review the man page for qsub, qstat, and qnodes
- Try a simple practice job, e.g. an "echo" that prints the node name
This may take you a day or two. It is well worth the investment.
Example
Here's what I do for "embarassingly parallel" jobs driven by an input file with one task per line.
Split input into files, one task per line
(10:51:16) davisjam@sushi-headnode ~/qsub-jobs/Memo/input $ split sl-regex-filteredForPrototype-all.json sl-regex-filteredForPrototype-all-piece- --lines=3000 --additional-suffix=.json --numeric-suffixes --suffix-length=4
Write job script
I use this as a template and tweak it from there. You might also try the GNU Parallel tool. There's a copy in /home/davisjam/bin/parallel.
(12:05:29) davisjam@sushi-headnode ~/qsub-jobs/Memo $ cat qsub-memo.sh #!/usr/bin/env bash # You must provide REGEX_FILE # e.g. "qsub -v REGEX_FILE='/home/davisjam/qsub-jobs/RegexRepl/syntax/input/test/500.json' qsub-syntax.sh" ######################################### ## PBS Configuration (Single Comment # ONLY) ######################################### # #PBS -l nodes=1:ppn=8 # # Save all env vars -- including PERL5LIB #PBS -V ######################################### ######################################### ## Setup ######################################### #OUT_FILE=~/data/syntax/cross-registry-real/`basename $REGEX_FILE .json`-slras-job$PBS_JOBID.json OUT_FILE=~/data/memo/all-SL/`basename $REGEX_FILE .json`-measureMemo-job$PBS_JOBID.pkl.bz2 STDOUT_FILE=$HOME/logs/qsub-memo-$$.out STDERR_FILE=$HOME/logs/qsub-memo-$$.err NCORES=`wc -l < $PBS_NODEFILE` # Flush NFS? rm $STDOUT_FILE 2>/dev/null rm $STDERR_FILE 2>/dev/null sync; sync; sync; sync; sync; touch $STDOUT_FILE touch $STDERR_FILE # Here we go! echo "Hello on node " `hostname` " with $NCORES cores" echo "REGEX_FILE $REGEX_FILE" echo "OUT_FILE $OUT_FILE" echo "STDOUT_FILE $STDOUT_FILE STDERR_FILE $STDERR_FILE" export MEMOIZATION_PROJECT_ROOT=~/memoized-regex-engine export ECOSYSTEM_REGEXP_PROJECT_ROOT=~/EcosystemRegexps set -x # For data about prototype # PYTHONUNBUFFERED=1 $MEMOIZATION_PROJECT_ROOT/eval/measure-memoization-behavior.py \ # --regex-file $REGEX_FILE \ # --queryPrototype \ # --trials 1 \ # --queryProductionEngines \ # --parallelism $NCORES \ # --out-file $OUT_FILE \ # > $STDOUT_FILE \ # 2>$STDERR_FILE # For data about other regex engines -- use if you want to test with extended features not supported by prototype PYTHONUNBUFFERED=1 $MEMOIZATION_PROJECT_ROOT/eval/measure-memoization-behavior.py \ --regex-file $REGEX_FILE \ --useCSharpToFindMostEI \ --queryProductionEngines \ --parallelism $NCORES \ --out-file $OUT_FILE \ > $STDOUT_FILE \ 2>$STDERR_FILE
Launch job
(10:56:30) davisjam@sushi-headnode ~/qsub-jobs/Memo $ for f in input/500-piece-*; do echo $f; qsub -v REGEX_FILE=`pwd`/input/$f qsub-memo.sh; done
Monitor job
(11:13:54) davisjam@sushi-headnode ~/qsub-jobs/Memo $ ls -lhtra ~/logs
(and tail log files, etc.)
Export data
If you want to export the data (e.g. for analysis in a Jupyter notebook), try something like this:
(11:15:32) davisjam@sushi-headnode ~/qsub-jobs/Memo $ mkdir ~/export-latest; cp ~/data/memo/all-SL/*.pkl.bz2 ~/export-latest; tar -czvf ~/export-latest.tgz ~/export-latest; scp ...
Handy scripts
Check on your jobs
How much longer will you be waiting?
summarize-job-state.pl
#!/usr/bin/env perl # Author: Jamie Davis <davisjam@vt.edu> # Description: Summarize the status of the jobs of a user use strict; use warnings; if (scalar(@ARGV) ne 1) { die " Summarize state of jobs submitted by a user\nusage: $0 username\n"; } my $user = $ARGV[0]; if (length($user) < 1) { die "Error, username is empty\n"; } my @lines = `qstat -u $user`; my @running = grep { m/\s+$user\s+.*\sR\s/ } @lines; my @queued = grep { m/\s+$user\s+.*\sQ\s/ } @lines; my @error = grep { m/\s+$user\s+.*\sE\s/ } @lines; my $nRunning = scalar(@running); my $nQueued = scalar(@queued); my $nError = scalar(@error); my $nJobs = $nRunning + $nQueued + $nError; print " Running jobs: $nRunning\n"; print " Queued jobs: $nQueued\n"; print " Error jobs: $nError\n"; print " + ------------------------\n"; print " Active jobs: $nJobs\n";
Abort a run
Sometimes you see an error show up in your logs files and need to abort the run.
kill-my-jobs.pl
#!/usr/bin/env perl # Author: Jamie Davis <davisjam@vt.edu> # Description: Kill (qdel) all jobs owned by the given user use strict; use warnings; if (scalar(@ARGV) ne 1) { die " qdel all jobs submitted by a user\nusage: $0 username\n"; } my $user = $ARGV[0]; if (length($user) < 1) { die "Error, username is empty\n"; } my @jobIDs = &getJobIDs($user); if (@jobIDs) { &log("qdel'ing the " . scalar(@jobIDs) . " jobs owned by $user"); my $cmd = "qdel " . join(" ", @jobIDs); system($cmd); } else { print "qstat reported no jobs to kill\n"; } ########### sub getJobIDs { my ($user) = @_; &log("Using qstat to get the jobs owned by $user"); my @qstat_output = `qstat -u $user`; chomp @qstat_output; my @jobLines = grep { m/\s+$user\s+/ } @qstat_output; my @jobIDs = map { my ($id) = ( $_ =~ m/^(\d+)\./ ); $id; } @jobLines; return @jobIDs; } sub log { my ($msg) = @_; print STDERR "$msg\n"; }
Clean up /tmp
Sometimes my analysis tools leak files into /tmp on the sushi nodes.
clean-my-tmp.pl
#!/usr/bin/env perl # Author: Jamie Davis <davisjam@vt.edu> # Description: Print commands to clean up my files in /tmp across sushi use strict; use warnings; my @nodes = qw/ sushi01 sushi02 sushi03 sushi04 sushi05 sushi06 sushi07 sushi08 sushi09 sushi10 /; ## Parse args if (scalar(@ARGV) < 1 or scalar(@ARGV) > 2) { die "Print commands to delete files in /tmp on each sushi node [matching the specified find predicates] Usage: $0 owning-user ['find predicates'] Examples: $0 davisjam - Deletes all files owned by davisjam in /tmp on all sushi nodes $0 davisjam '-name \"protoRegexEngine*\"' - Delete all files ... whose name matches this predicate You should wrap predicates in single-quotes, and use double-quotes for any quoting within the predicates (The ssh command is wrapped in single-quotes) "; } my $user = $ARGV[0]; if (length($user) < 1) { die "Error, username is empty\n"; } my $findPredicates = ""; if (scalar(@ARGV) >= 2) { $findPredicates = $ARGV[1]; } ## Cleanup operations my $cmd = "ssh $nodes[0] 'find /tmp -user $user -delete -type f $findPredicates'"; for my $node (@nodes) { my $cmd = "find /tmp -user $user -delete -type f $findPredicates"; print("ssh $node '$cmd' &\n"); } &log("\n^^ If the preceding commands are to your liking, copy/paste/execute to run them."); ############ sub log { my ($msg) = @_; print STDERR "$msg\n"; }