slurmjobs 🐌¶
Automating Slurm job generation.
Generate a set of .sbatch files over a grid of parameters. A run script is created which will submit all generated jobs as once.
Now with slurmjobs.Singularity()
support!
You can also use Shell which excludes the slurm/module references so you can test & run on your local machine or test server.
Installation¶
pip install slurmjobs
Usage¶
import slurmjobs
jobs = slurmjobs.Singularity(
'python train.py', email='me@nyu.edu', n_gpus=2)
# generate jobs across parameter grid
run_script, job_paths = jobs.generate([
# two different model types
('model', ['AVE', 'AVOL']),
# try mono and stereo audio
('audio_channels', [1, 2]),
], epochs=500)
# NOTE:
# extra keywords (e.g. ``epochs``) are also passed as arguments
# across all scripts and are not included in the job name.
# ******************************
# ** everything was generated ** - now let's see the outputted paths.
slurmjobs.util.summary(run_script, job_paths)
Generated 4 job scripts:
jobs/train/train,model-AVE,audio_channels-1.sbatch
jobs/train/train,model-AVE,audio_channels-2.sbatch
jobs/train/train,model-AVOL,audio_channels-1.sbatch
jobs/train/train,model-AVOL,audio_channels-2.sbatch
To submit all jobs, run:
. jobs/train/run.sh
Tutorial¶
Let’s go through the process of running your code in a singularity container.
Setting up Singularity¶
First we need to create the overlay that will contain our anaconda distribution along with any packages we want to install to the environment.
slurmjobs
provides a script that will do that for you! Just type:
singuconda
And it will ask you for some values. All of which you can mash enter through if you don’t care about them (which will use their defaults). The defaults will give you a singularity overlay with an empty conda environment. The prompts give you the opportunity to change the overlay and sif file paths, as well as specify which packages you’d like installed, specified in the form of:
- a pip requirements file:
i.e.
pip install -r ./your/requirements.txt
- you provide:./your/requirements.txt
- a local python package: a repo directory containing a
setup.py
file where your project source lives. i.e.
pip install -e ./your/package
- you provide:./your/package
- a local python package: a repo directory containing a
- any conda packages that you want to install: conda packages
i.e.
conda install numpy librosa
- you provide:numpy librosa
Once we have that information the script will:
Check if the overlay file already exists. If not, then we copy the overlay file to your current directory.
We then enter the singularity container and install miniconda
Then we install all of the pip/conda stuff you requested to conda
Then you’re all set! :) You have a singularity overlay with anaconda and any packages you asked it to install.
Note
If people would prefer, I can also refactor singuconda
into a Python class
to allow you to configure it with the rest of your slurmjobs code.
Generating Jobs¶
Now let’s generate some job scripts!
The simplest configuration you can give is:
import slurmjobs
def generate():
jobs = slurmjobs.Singularity('python project/train.py')
# generate the jobs, providing a grid of
# parameters to generate over.
run_script, job_paths = jobs.generate([
# two different model types
('model', ['AVE', 'AVOL']),
# try mono and stereo audio
('audio_channels', [1, 2]),
], epochs=500)
# print out a summary of the scripts generated
slurmjobs.util.summary(run_script, job_paths)
if __name__ == '__main__':
import fire
fire.Fire()
Then you can just generate the jobs by doing:
python jobs.py generate
And your files will be found in: ./jobs/project.train
(name automatically derived from the command name.)
But you can do a lot more:
import os
import slurmjobs
def generate():
jobs = slurmjobs.Singularity(
'python train.py',
# give the job batch a name
name='my-train-script',
# say your script uses hydra, so tell slurmjobs how to format your arguments
cli='hydra',
# set the working directory for your script
root_dir='/scratch/myuser/myproject',
# disable job backups. By default it'll save them as `~{name}_01` etc.
backup=False,
# set the email to whoever generates the jobs (nyu uses your netid
# for both email and greene, so you can use $USER to make it easier
# with multiple people)
email=f'{os.getenv("USER")}@nyu.edu',
# set the number of cpus and gpus to request per job
n_cpus=2, n_gpus=2,
# set the requested time (e.g. 2 days)
time='2-0',
# disable passing job_id to your script (if your script doesn't accept one)
job_id=None, # or change the key: job_id='my_job_id',
# pass any arbitrary sbatch flags
# see: https://slurm.schedmd.com/sbatch.html
sbatch={
...
},
# you can also pass anything else here and it'll be changeable
# in __init__ and available in the templates (e.g. if you extend the templates)
)
run_script, job_paths = jobs.generate([
('model', ['AVE', 'AVOL']),
('audio_channels', [1, 2]),
], epochs=500)
slurmjobs.util.summary(run_script, job_paths)
if __name__ == '__main__':
import fire
fire.Fire()
To add initialization and cleanup code around your command, see Customizing your Template which will tell you how to add custom code.
Parameter Grids¶
If you have a simple parameter grid, then you don’t really have to think about this, and you can keep passing your grid like the previous example.
But you may also have a slightly more complex grid you want to try. If that is the case, then you can do:
from slurmjobs import Grid, LiteralGrid
g = Grid([
('a', [1, 2]),
('b', [1, 2]),
], name='train')
# append two configurations
g = g + LiteralGrid([{'a': 5, 'b': 5}, {'a': 10, 'b': 10}])
# create a bigger grid from the product of another grid
g = g * Grid([
('c', [5, 6])
], name='dataset')
# omit a configuration from the grid
g = g - [{'a': 2, 'b': 1, 'c': 5}]
# then
assert list(g) == [
{'a': 1, 'b': 1, 'c': 5},
{'a': 1, 'b': 1, 'c': 6},
{'a': 1, 'b': 2, 'c': 5},
{'a': 1, 'b': 2, 'c': 6},
# {'a': 2, 'b': 1, 'c': 5}, omitted
{'a': 2, 'b': 1, 'c': 6},
{'a': 2, 'b': 2, 'c': 5},
{'a': 2, 'b': 2, 'c': 6},
{'a': 5, 'b': 5, 'c': 5},
{'a': 5, 'b': 5, 'c': 6},
{'a': 10, 'b': 10, 'c': 5},
{'a': 10, 'b': 10, 'c': 6},
]
# Then you can pass the grid like normal
jobs = slurmjobs.Singularity('python train.py')
run_script, job_paths = jobs.generate(g, epochs=500)
Breaking Arguments across multiple functions¶
Sometimes you may have a situation where your arguments need to be broken out across multiple functions. You can do this by naming your grids.
.. code-block:: python
class Singularity(slurmjobs.Singularity):
# remember to extend a base template
template = '''{% extends 'job.singularity.j2' %}
{% block command %}
{{ command }} {{ cli(args.main, indent=4) }}
python my_other_script.py {{ cli(args.other, indent=4) }}
{% endblock %}
'''
g = slurmjobs.Grid([
('a', [1, 2]),
('b', [1, 2]),
], name='main')
g * slurmjobs.Grid([
('c', [1, 2]),
('d', [1, 2]),
], name='other')
Customizing Job IDs¶
The purpose of a job ID is to give your jobs a pretty name that is descriptive of its configuration so you can use it in filenames and log files while also being unique between job instances so that they’re not both trying to write to the same place.
Canonically, job IDs are created using this pattern:
{key}-{value},{key2}-{value2},...
. This is meant as a naive
but somewhat effective way of encoding the parameters.
But things aren’t always that simple. The main cases that I see you needing to changing the job ID formatter are:
your grid contains a long value like a list or long string
your grid contains long float values that you want to shorten
your grid contains objects that don’t have a nice string representation (this would most likely lead to an issue with CLI formatting too, but I digress)
your grid contains too many keys and you’d like to abbreviate them to avoid hitting filename length limits.
Out of the box we include some levers that you can pull to tweak your job ID.
class Singularity(slurmjobs.Singularity):
# a dict of abbreviations (full_key -> abbreviated_key)
key_abbreviations = {}
# a length to clip the keys to (e.g. 2 would turn model into mo)
abbreviate_length = None
# limit the precision on float values (e.g. 3 means 3 decimal places)
float_precision = None
But of course, I’m sure you may have other ways you want to do things, so you have full liberty to change the job ID generation. Just make sure that your job IDs are still unique between jobs!!
class Singularity(slurmjobs.Singularity):
# formatting for a single key-value pair.
# you can return a tuple, string, or None (to exclude)
def format_id_item(self, k, v):
# I'm wacky and like my keys backwards
k = k[::-1]
# e.g. special formatting for booleans
if v is True:
return k
if v is False:
return f'not-{k}'
return k, v
# formatting the entire job ID. Do what you like!
def format_job_id(self, args, keys=None, name=None):
return ','.join(
[name]*bool(name),
*[f'{k}-{self.format_id_item(args[k])}' for k in keys or args]
)
Customizing your Template¶
Another thing that you’ll probably want to end up doing at some point is to add some customization, initialization, or cleanup to your scripts.
Often I will personally add most of that internally in my scripts, but you may also prefer to add it with bash.
Here’s the block structure in each of the templates:
- base (
job.base.j2
)
header
: This has a description of the job and arguments
body
: This is the main body of the script and wraps around everything. You can use this for top-level initialization / cleanup.
environment
: load anaconda and your conda environment.
main
: An empty wrapper around thecommand
block where you can add script initialization / cleanup
command
: The heart of the script. This is where your script gets passed its arguments.
- shell (
job.shell.j2
extendsjob.base.j2
)
body
>main
: Calls the main block usingnohup
so that your shell jobs can survive dropped ssh connections.
- sbatch (
job.sbatch.j2
extendsjob.base.j2
)
header
>arguments
: added sbatch arguments at the beginning of the file.
body
>modules
: This is where we domodule purge
andmodule load cuda
etc.
- singularity (
job.singularity.j2
extendsjob.sbatch.j2
)
body
: wraps with a singularity call. All of the body is run inside the singularity container
Here’s how you can add initialization / cleanup code around your command. Use {{ super() }}
to add back in the parent template’s code.
class Singularity(slurmjobs.Singularity):
# remember to extend a base template
template = '''{% extends 'job.singularity.j2' %}
{% block main %}
# initialization
export SOMETHING=asdfasdfasdf
mkdir -p my-directory
{# call the rest of the main block #}
{{ super() }}
# some cleanup
rm -r my-directory
{% endblock %}
'''
If you want to add code outside the singularity container, you just need to do:
class Singularity(slurmjobs.Singularity):
# remember to extend a base template
template = '''{% extends 'job.singularity.j2' %}
{% block body %}
export SOMETHING=asdfasdfasdf
mkdir -p my-directory
{{ super() }}
rm -r my-directory
{% endblock %}
'''
And if you want to delete a parent’s section, just do:
class Slurm(slurmjobs.Slurm):
# remember to extend a base template
template = '''{% extends 'job.sbatch.j2' %}
{% block modules %}{% endblock %}
'''
Customizing Argument Formatting¶
You can define your own formatter by subclassing slurmjobs.args.Argument()
. If your class
name ends with 'Argument'
, you can omit that when passing the cli name.
This works by gathering subclasses matching the passed string against their
names. If the class ends with ‘Argument’, the suffix will be removed. If you
don’t want a class to be available, prefix the name with an underscore.
Example:
import slurmjobs
class MyCustomArgument(slurmjobs.args.Argument):
@classmethod
def format_arg(cls, k, v=None):
if v is None:
return
# idk do something fancy
return '..{}@{}'.format(k, cls.format_value(v)) # ..size@10
batch = slurmjobs.SBatch('echo', cli='mycustom')
print(batch.command, batch.cli(size=10, blah='blorp'))
# echo ..size@10 ..blah@blorp
Sample Files¶
This is a quick sample of the generated files.
Singularity¶
We have:
jobs/{name}/{job_id}.sbatch
: The sbatch file for each job.
jobs/{name}/run.sh
: A script that submits all jobs at once.
import slurmjobs
jobs = slurmjobs.Singularity(
'python train.py', name='sing', backup=False,
email='me@nyu.edu', n_gpus=2)
run_script, job_paths = jobs.generate([
('model', ['AVE', 'AVOL']),
('audio_channels', [1, 2]),
], epochs=500)
slurmjobs.util.summary(run_script, job_paths)
Generated 4 job scripts:
jobs/sing/sing,model-AVE,audio_channels-1.sbatch
jobs/sing/sing,model-AVE,audio_channels-2.sbatch
jobs/sing/sing,model-AVOL,audio_channels-1.sbatch
jobs/sing/sing,model-AVOL,audio_channels-2.sbatch
To submit all jobs, run:
. jobs/sing/run.sh
jobs/sing/sing,model-AVE,audio_channels-1.sbatch
#!/bin/bash
#SBATCH --job-name=sing,model-AVE,audio_channels-1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/sing/slurm/slurm_%j__sing,model-AVE,audio_channels-1.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: sing,model-AVE,audio_channels-1
# Args:
# {'audio_channels': 1,
# 'epochs': 500,
# 'job_id': 'sing,model-AVE,audio_channels-1',
# 'model': 'AVE'}
#
#########################
###### ( hop into the singularity o_O )
singularity exec \
--nv \
--overlay overlay-5GB-200K.ext3:ro \
/scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif \
/bin/bash << EOF
echo "@: entered singularity container"
source /ext3/env.sh
python train.py \
--model=AVE \
--audio_channels=1 \
--epochs=500 \
--job_id=sing,model-AVE,audio_channels-1
###### (escape from the singularity @o@ )
echo "@: exiting singularity container"
exit 0;
EOF
jobs/sing/sing,model-AVE,audio_channels-2.sbatch
#!/bin/bash
#SBATCH --job-name=sing,model-AVE,audio_channels-2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/sing/slurm/slurm_%j__sing,model-AVE,audio_channels-2.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: sing,model-AVE,audio_channels-2
# Args:
# {'audio_channels': 2,
# 'epochs': 500,
# 'job_id': 'sing,model-AVE,audio_channels-2',
# 'model': 'AVE'}
#
#########################
###### ( hop into the singularity o_O )
singularity exec \
--nv \
--overlay overlay-5GB-200K.ext3:ro \
/scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif \
/bin/bash << EOF
echo "@: entered singularity container"
source /ext3/env.sh
python train.py \
--model=AVE \
--audio_channels=2 \
--epochs=500 \
--job_id=sing,model-AVE,audio_channels-2
###### (escape from the singularity @o@ )
echo "@: exiting singularity container"
exit 0;
EOF
jobs/sing/sing,model-AVOL,audio_channels-1.sbatch
#!/bin/bash
#SBATCH --job-name=sing,model-AVOL,audio_channels-1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/sing/slurm/slurm_%j__sing,model-AVOL,audio_channels-1.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: sing,model-AVOL,audio_channels-1
# Args:
# {'audio_channels': 1,
# 'epochs': 500,
# 'job_id': 'sing,model-AVOL,audio_channels-1',
# 'model': 'AVOL'}
#
#########################
###### ( hop into the singularity o_O )
singularity exec \
--nv \
--overlay overlay-5GB-200K.ext3:ro \
/scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif \
/bin/bash << EOF
echo "@: entered singularity container"
source /ext3/env.sh
python train.py \
--model=AVOL \
--audio_channels=1 \
--epochs=500 \
--job_id=sing,model-AVOL,audio_channels-1
###### (escape from the singularity @o@ )
echo "@: exiting singularity container"
exit 0;
EOF
jobs/sing/sing,model-AVOL,audio_channels-2.sbatch
#!/bin/bash
#SBATCH --job-name=sing,model-AVOL,audio_channels-2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/sing/slurm/slurm_%j__sing,model-AVOL,audio_channels-2.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: sing,model-AVOL,audio_channels-2
# Args:
# {'audio_channels': 2,
# 'epochs': 500,
# 'job_id': 'sing,model-AVOL,audio_channels-2',
# 'model': 'AVOL'}
#
#########################
###### ( hop into the singularity o_O )
singularity exec \
--nv \
--overlay overlay-5GB-200K.ext3:ro \
/scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif \
/bin/bash << EOF
echo "@: entered singularity container"
source /ext3/env.sh
python train.py \
--model=AVOL \
--audio_channels=2 \
--epochs=500 \
--job_id=sing,model-AVOL,audio_channels-2
###### (escape from the singularity @o@ )
echo "@: exiting singularity container"
exit 0;
EOF
jobs/sing/run.sh
#########################
#
# Job Batch: sing
# Params:
# [
# ('model', ['AVE', 'AVOL']),
# ('audio_channels', [1, 2]),
# ]
#
#########################
sbatch "jobs/sing/sing,model-AVE,audio_channels-1.sbatch"
sbatch "jobs/sing/sing,model-AVE,audio_channels-2.sbatch"
sbatch "jobs/sing/sing,model-AVOL,audio_channels-1.sbatch"
sbatch "jobs/sing/sing,model-AVOL,audio_channels-2.sbatch"
Slurm¶
We have:
jobs/{name}/{job_id}.sbatch
: The sbatch file for each job.
jobs/{name}/run.sh
: A script that submits all jobs at once.
import slurmjobs
jobs = slurmjobs.Slurm(
'python train.py', name='slurm', backup=False,
email='me@nyu.edu', n_gpus=2)
run_script, job_paths = jobs.generate([
('model', ['AVE', 'AVOL']),
('audio_channels', [1, 2]),
], epochs=500)
slurmjobs.util.summary(run_script, job_paths)
Generated 4 job scripts:
jobs/slurm/slurm,model-AVE,audio_channels-1.sbatch
jobs/slurm/slurm,model-AVE,audio_channels-2.sbatch
jobs/slurm/slurm,model-AVOL,audio_channels-1.sbatch
jobs/slurm/slurm,model-AVOL,audio_channels-2.sbatch
To submit all jobs, run:
. jobs/slurm/run.sh
jobs/slurm/slurm,model-AVE,audio_channels-1.sbatch
#!/bin/bash
#SBATCH --job-name=slurm,model-AVE,audio_channels-1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/slurm/slurm/slurm_%j__slurm,model-AVE,audio_channels-1.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: slurm,model-AVE,audio_channels-1
# Args:
# {'audio_channels': 1,
# 'epochs': 500,
# 'job_id': 'slurm,model-AVE,audio_channels-1',
# 'model': 'AVE'}
#
#########################
##### Load Modules
module purge
python train.py \
--model=AVE \
--audio_channels=1 \
--epochs=500 \
--job_id=slurm,model-AVE,audio_channels-1
jobs/slurm/slurm,model-AVE,audio_channels-2.sbatch
#!/bin/bash
#SBATCH --job-name=slurm,model-AVE,audio_channels-2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/slurm/slurm/slurm_%j__slurm,model-AVE,audio_channels-2.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: slurm,model-AVE,audio_channels-2
# Args:
# {'audio_channels': 2,
# 'epochs': 500,
# 'job_id': 'slurm,model-AVE,audio_channels-2',
# 'model': 'AVE'}
#
#########################
##### Load Modules
module purge
python train.py \
--model=AVE \
--audio_channels=2 \
--epochs=500 \
--job_id=slurm,model-AVE,audio_channels-2
jobs/slurm/slurm,model-AVOL,audio_channels-1.sbatch
#!/bin/bash
#SBATCH --job-name=slurm,model-AVOL,audio_channels-1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/slurm/slurm/slurm_%j__slurm,model-AVOL,audio_channels-1.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: slurm,model-AVOL,audio_channels-1
# Args:
# {'audio_channels': 1,
# 'epochs': 500,
# 'job_id': 'slurm,model-AVOL,audio_channels-1',
# 'model': 'AVOL'}
#
#########################
##### Load Modules
module purge
python train.py \
--model=AVOL \
--audio_channels=1 \
--epochs=500 \
--job_id=slurm,model-AVOL,audio_channels-1
jobs/slurm/slurm,model-AVOL,audio_channels-2.sbatch
#!/bin/bash
#SBATCH --job-name=slurm,model-AVOL,audio_channels-2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=me@nyu.edu
#SBATCH --output=jobs/slurm/slurm/slurm_%j__slurm,model-AVOL,audio_channels-2.log
#SBATCH --time=3-0
#SBATCH --mem=48GB
#SBATCH --gres=gpu:2
#########################
#
# Job: slurm,model-AVOL,audio_channels-2
# Args:
# {'audio_channels': 2,
# 'epochs': 500,
# 'job_id': 'slurm,model-AVOL,audio_channels-2',
# 'model': 'AVOL'}
#
#########################
##### Load Modules
module purge
python train.py \
--model=AVOL \
--audio_channels=2 \
--epochs=500 \
--job_id=slurm,model-AVOL,audio_channels-2
jobs/slurm/run.sh
#########################
#
# Job Batch: slurm
# Params:
# [
# ('model', ['AVE', 'AVOL']),
# ('audio_channels', [1, 2]),
# ]
#
#########################
sbatch "jobs/slurm/slurm,model-AVE,audio_channels-1.sbatch"
sbatch "jobs/slurm/slurm,model-AVE,audio_channels-2.sbatch"
sbatch "jobs/slurm/slurm,model-AVOL,audio_channels-1.sbatch"
sbatch "jobs/slurm/slurm,model-AVOL,audio_channels-2.sbatch"
Shell¶
We have:
jobs/{name}/{job_id}.job.sh
: The shell file for each job.
jobs/{name}/run.sh
: A script that submits all jobs at once.
import slurmjobs
jobs = slurmjobs.Shell(
'python train.py', name='shell', backup=False,
email='me@nyu.edu', n_gpus=2)
run_script, job_paths = jobs.generate([
('model', ['AVE', 'AVOL']),
('audio_channels', [1, 2]),
], epochs=500)
slurmjobs.util.summary(run_script, job_paths)
Error
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/slurmjobs/checkouts/latest/docs/_tmp.py", line 5, in <module>
email='me@nyu.edu', n_gpus=2)
File "/home/docs/checkouts/readthedocs.org/user_builds/slurmjobs/envs/latest/lib/python3.7/site-packages/slurmjobs-1.0.0-py3.7.egg/slurmjobs/core.py", line 121, in __init__
f"Unrecognized options: {wrong_options}. If you extended your template to use additional options, "
TypeError: Unrecognized options: {'n_gpus'}. If you extended your template to use additional options, please give them default values in your Class.options dictionary.
Job Generation¶
- class slurmjobs.Singularity(command, overlay=None, sif=None, *a, **kw)[source]¶
Generate jobs for sbatch + singularity. Use this for HPC Greene.
This is functionally equivalent to:
{# sbatch_args #} # execute it inside the container singularity exec --overlay ... ... /bin/bash << EOF . /ext3/env.sh {# shell_code_body #} EOF
To setup a singularity overlay with a anaconda python environment, please feel free to use https://gist.github.com/beasteers/84cf1eb2e5cc7bc4cb2429ef2fe5209e I’m in the process of integrating it with
slurmjobs
.Source Tutorial: https://sites.google.com/a/nyu.edu/nyu-hpc/services/Training-and-Workshops/tutorials/singularity-on-greene
- class slurmjobs.Slurm(*a, sbatch=None, modules=None, n_gpus=None, n_cpus=None, nv=None, **kw)[source]¶
Generate jobs for sbatch. This was used for HPC Prince. For HPC Greene, please use
Singularity
.This is functionally equivalent to:
# sbatch_args #SBATCH --nodes=... #SBATCH --cpus-per-task=... #SBATCH --gres=gpu:... #SBATCH --mail-user=... ... {# shell_code_body #}
- class slurmjobs.Shell(command, name=None, cli=None, root_dir=None, backup=True, job_id=True, template=None, run_template=None, **options)[source]¶
Generate batch jobs to be run without a task scheduler. This will create a grid of scripts that will be run using
nohup
, a program that will keep running your script even if the parent process (e.g. your ssh session) dies.This is mainly just for doing small tests and whatnot.
import slurmjobs jobs = Shell('python myscript.py') jobs.generate([ ('a', [1, 2]), ('b', [1, 2]), ])
- class slurmjobs.Jobs(command, name=None, cli=None, root_dir=None, backup=True, job_id=True, template=None, run_template=None, **options)[source]¶
The base class for Job generation. Sub-class this if you want to provide your own
import slurmjobs batch = slurmjobs.SlurmBatch( 'python train.py', email='mynetid@nyu.edu', conda_env='my_env') # generate jobs across parameter grid run_script, job_paths = batch.generate([ ('kernel_size', [2, 3, 5]), ('nb_stacks', [1, 2]), ('lr', [1e-4, 1e-3]), ], receptive_field=6)
We use Jinja2 for our script templating. See https://jinja.palletsprojects.com/en/3.0.x/templates/ for documentation about its syntax and structure.
- options¶
Template options. Anything here is made available to the template.
- Type
dict
- template¶
The Jinja2 template for the sbatch script.
- Type
str
- job_template¶
The Jinja2 template for the jobs script that launches the sbatch files.
- Type
str
- cli¶
The command line argument format your script uses. By default, this uses Fire. See
args.FireArgument
.- Type
- job_id_arg¶
The argument name to use for passing the job ID. Set to None to ignore omit the job ID.
- Type
str
- job_id_key_sep¶
The separator between key and value used in the job ID.
- Type
str
- job_id_item_sep¶
The separator between key-value items in the job ID.
- Type
str
- allowed_job_id_chars¶
Characters (other than alphanumeric) allowed in the job ID. Other characters are removed.
- Type
str, list
- special_character_replacement¶
A string that can be used in place of special characters not found in
allowed_job_id_chars
. Default is empty.- Type
str
- key_abbreviations¶
A mapping from full name to abbreviation for keys in the job ID.
- Type
dict
- abbreviate_length¶
The length to abbreviate keys in the job ID.
- Type
int
- float_precision¶
The number of decimals to limit float values in the job ID.
- Type
int
- format_id_item(k, v)[source]¶
Formats a key-value pair for the job ID.
You can override this to change how each key-value pair in a job ID is formatted.
- Parameters
k (str) – The argument key.
value (any) – The argument value.
- Returns
Return a tuple if you want it to be joined using
job_id_key_sep
. Return a string to be used as is. Return None to omit it from the job ID.
Be warned that omitting key/value pairs runs the risk of filename collisions which will mean multiple jobs overwritting each other in the same file.
- format_job_id(args, keys=None, name=None, ignore_keys=None)[source]¶
Convert a dictionary to a job ID.
- Parameters
args (GridItem) – The job dictionary.
keys (list, tuple) – The keys that we want to include in the job ID.
name (str) – The job name to include in the job ID. By default this will be the name associated with the Jobs object. Pass
False
to use no name.
- generate(grid_=None, *a, ignore_job_id_keys=None, **kw)[source]¶
Generate slurm jobs for every combination of parameters.
- Parameters
grid (Grid, list) – The parameter grid.
*a – Positional arguments to add to the command.
**kw – Additional keyword arguments to pass to the command.
- Returns
str: The path to the run script.
list[str]: The list of paths for each job file.
- Return type
tuple
Parameter Grids¶
Parameter grids!
This lets you do basic grid expansion and grid arithmetic.
g = Grid([
('a', [1, 2]),
('b', [1, 2]),
], name='train')
# append two configurations
g = g + LiteralGrid([{'a': 5, 'b': 5}, {'a': 10, 'b': 10}])
# create a bigger grid from the product of another grid
g = g * Grid([
('c', [5, 6])
], name='dataset')
assert list(g) == [
{'a': 1, 'b': 1, 'c': 5},
{'a': 1, 'b': 1, 'c': 6},
{'a': 1, 'b': 2, 'c': 5},
{'a': 1, 'b': 2, 'c': 6},
{'a': 2, 'b': 1, 'c': 5},
{'a': 2, 'b': 1, 'c': 6},
{'a': 2, 'b': 2, 'c': 5},
{'a': 2, 'b': 2, 'c': 6},
{'a': 5, 'b': 5, 'c': 5},
{'a': 5, 'b': 5, 'c': 6},
{'a': 10, 'b': 10, 'c': 5},
{'a': 10, 'b': 10, 'c': 6},
]
- class slurmjobs.grid.Grid(_Grid__grid, name=None, **constants)[source]¶
A parameter grid! To get all combinations from the grid, just do
list(Grid(...))
.- Parameters
grid (list, dict) –
The parameter grid. Should be either a dict or a list of key values, where the values are a list of values to use in the grid. Examples of valid inputs:
# simple grid Grid([ ('a', [1, 2]), ('b', [1, 2]) ]) Grid({ 'a': [1, 2], 'b': [1, 2] }) # paired parameters Grid([ ('a', [1, 2]), (('b', 'c'), ([1, 2], [1, 2])) ]) Grid([ ('a', [1, 2]), [{'b': 1, 'c': 1}, {'b': 2, 'c': 2}], ]) # any of these are valid grid specs g = slurmjobs.Grid([ # basic ('a', [1, 2]), # paired (('b', 'c'), ([1, 1, 2, 2], [1, 2, 1, 2])), # literal list of dicts [{'d': i} for i in [1, 2]], # dict generator ({'e': i} for i in [1, 2]), # function lambda: [{'f': i} for i in [1, 2]], # function that returns a generator lambda: ({'g': i} for i in [1, 2]), # basic generator ('h', (x for x in [1, 2])), # basic function ('i', lambda: [x for x in [1, 2]]), ]) keys = 'abcdefghi' assert list(g) == [ dict(zip(keys, vals)) for vals in itertools.product(*([ [1, 2] ]*len(keys))) ]
name (str) – The name of this grid. Can be used to search for the parameters from this grid.
**constants – Extra parameters to add to the grid that don’t vary. These will not be included in the job_id name.
Just a heads up, there is nothing stopping you from passing an infinite generator, meaning that you can make some fancy sampling grid generators, but
slurmjobs
will take that and not know when to stop. If you want to use an infinite generator, just wrap it initertools.islice
which will let you provide a limit.Obviously,
slurmjobs
doesn’t operate anywhere near the memory scale where you’d need to even use generators in the first place, but I figured why limit the implementation if it can be used for other things too.
- class slurmjobs.grid.LiteralGrid(_LiteralGrid__grid, name=None, **constants)[source]¶
A parameter grid, specified as a flattened list. This doesn’t do any grid expansion, it lets you specify the grid as you want.
- Parameters
grid (list, dict) – The parameter list. Should be a list of dicts, each corresponding to a parameter config.
name (str) – The name of this grid. Can be used to search for the parameters from this grid.
g = LiteralGrid([ {'a': 1, 'b': 1}, {'a': 1, 'b': 2}, {'a': 2, 'b': 2}, ]) assert list(g) == [ {'a': 1, 'b': 1}, {'a': 1, 'b': 2}, {'a': 2, 'b': 2}, ]
Grid Operations¶
- class slurmjobs.grid.GridCombo(*grids, name=None)[source]¶
This handles the multiplication of two grids (combinations). It will create a grid as a product of all provided grids.
You can create this doing
grid_a * grid_b
. The only reason to use this directly is if you want to give it a name or if you want to make a grid product of 3 or more grids.a = Grid(('a', [1, 2]), ('b', [1, 2])) b = Grid(('c', [1, 2])) # functionally equivalent c = a * b c = GridCombo(a, b, name='my-a-b-combo-grid') c_items = [ dict(da, **db) for da, db in itertools.product(a, b) ]
- class slurmjobs.grid.GridChain(*grids, name=None)[source]¶
This handles the addition of two grids (one after the other).
You can create this doing
grid_a + grid_b
. The only reason to use this directly is if you want to give it a name.a = Grid(('a', [1, 2]), ('b', [1, 2])) b = Grid(('c', [1, 2])) # functionally equivalent c = a + b c = GridChain(a, b, name='my-a-then-b-grid') c_items = list(a) + list(b)
- class slurmjobs.grid.GridOmission(grid, omission, name=None)[source]¶
This handles the subtraction of two grids (combinations). It will yield only dicts from
grid_a
that don’t appear ingrid_b
.You can create this doing
grid_a - grid_b
. The only reason to use this directly is if you want to give it a name.a = Grid(('a', [1, 2]), ('b', [1, 2])) b = Grid(('a', [2]), ('b', [1])) # functionally equivalent c = a - b c = GridOmission(a, b, name='my-a-minus-b-grid') omit = list(b) c_items = [da for da in a if da not in omit]
Extending Grids¶
- class slurmjobs.grid.BaseGrid(name=None, ignore_job_id_keys=None, **constants)[source]¶
The base class for all grids. Use this if you want to extend another grid.
You just need to implement:
__iter__
: This should yield all items generated by the grid__len__
: This should tell you the number of items in the grid. If you cannot determine the length of a grid, then raise a TypeError (as the other grids do).__repr__
: A nice string representation of the grid
Argument Formatting¶
Argument Formatters
- class slurmjobs.args.Argument[source]¶
The base-class for all argument formatters. If you want to provide a new argument formatter, just override this class and change either the:
format_arg
method that is used to format positional and key-value pairs, orformat_value
(which is called informat_arg
) to format python values as a string. This is often some form of quoted repr or json formatting.
- classmethod get(key='fire', *a, **kw)[source]¶
Return an instance of an argument formatter, based on its name.
- class slurmjobs.args.FireArgument[source]¶
Argument formatting for Python Fire.
Example:
python script.py a b --arg1 c --arg2 d
- class slurmjobs.args.ArgparseArgument[source]¶
Argument formatting for Python’s builtin argparse.
Example:
python script.py a b --arg1 c --arg2 d
- class slurmjobs.args.SacredArgument[source]¶
Formatting for sacred.
Example:
python script.py with arg1=a arg2=b
Function Receipts¶
- class slurmjobs.receipt.Receipt(name='', *a, receipt_id=None, __dir__=None, **kw)[source]¶
Make a receipt for a function call. This allows you skip over a function if it was successfully ran. This is useful if a long script fails in the middle and you want to re-run it, but you don’t need to re-run the first part.
This will cache the function execution history using the string representation of the function’s arguments. This works well as 99% of things have some sort of string representation, however, if you have a string representation that doesn’t stay consistent, like an object that prints out its ID, then the receipt won’t work.
To resolve this, you can either:
subclass this method and override
Receipt.hash_args(*a, **kw)
to return something invariant for yourProvide your own
receipt_id
to the function call.submit a PR for a better hash function!
receipt = Receipt(func.__name__, *a, **kw) if not receipt.exists(): try: do_something() receipt.make() # yay we can skip it next time! except Exception: """Oh well. It failed, but we'll just try again next time.""" else: """Oh good it ran completely last time so we can skip it and move to the next one."""
- slurmjobs.receipt.use_receipt(func, receipt_dir=None, test=None)[source]¶
Use a receipt for a function call, which lets us skip a result if the function completed successfully the last run. This is just a wrapper around
Receipt
that handles the receipt checking/making logic for you.# do step 1. If it already ran successfully it will skip and do nothing. use_receipt(my_step1_function)(**step1_kwargs) # do step 2. here we're passing a custom receipt ID custom_receipt_id = ... use_receipt(my_step2_function)(**step2_kwargs, receipt_id=custom_receipt_id) # do step 3 use_receipt(my_step3_function)(**step3_kwargs)
Changes¶
Unreleased¶
Rewrote like 70% of the code, including: jinja templates, jobs base class and subclasses
added a
slurmjobs.Singularity()
class and templateadded parameter grid classes!! Lets you do grid arithmetic :)
Renamed:
slurmjobs.Batch
=>slurmjobs.Jobs()
slurmjobs.SlurmBatch
=>slurmjobs.Slurm()
slurmjobs.Jobs.default_options
=>slurmjobs.Jobs.options
slurmjobs.Jobs.JOB_ID_KEY
=>slurmjobs.Jobs.job_id_arg
Changed signatures:
removed parameters from
slurmjobs.Jobs.__init__()
:multicmd, cmd_wrapper, init_script, run_init_script, post_script, paths
added
Receipt(receipt_id='a-str-used-instead-of-hash_args()')
removed utilities:
get_job_name
,make_job_name_tpl
,format_value_for_name
,_encode_special
,_decode_special
,_obliterate_special
: replaced byslurmjobs.Jobs.format_id_item()
andslurmjobs.Jobs.format_job_id()
expand_grid
,expand_paired_params
,unpack
,split_cond
replaced byslurmjobs.Grid()
singularity_command
replaced byslurmjobs.Singularity()
Factory
replaced byslurmjobs.util.subclass_lookup()
slurmjobs.util.flatten()
now returns a generator rather than a list.
slurmjobs.Jobs.template
andslurmjobs.Jobs.run_template
both expect a template string (not a path). to specify a path, either have a way to read it from file or just extend the templatesimplified argument formatting (removed special namedtuple classes)
changed
NoArgVal
to just...
changed
Argument.build
to justArgument.__call__
improved how command, cli, and args are rendered (all done in jinja now)
added overridable method
Receipt.hash_args
to allow for custom hashingadded proper docs
added tensorflow
2.7
tocuda_versions.csv
added
scripts/singuconda
andslurmjobs.singuconda
as a WIP rewrite.
0.2.2¶
cmd_wrapper can also be a function
added util.find_tensorflow_cuda_version to lookup cuda versions
util.singularity_command returns a function that will now escape quotes in the passed command
0.2.1¶
Now you can pass in a list of dicts and it will use each dictionary as a job. This works along side the parameter grid expansion so you can do:
`python jobs.generate([{'param1': 5}, ('param1', [100, 150]), ('param2', [200, 250])]) `
Added cmd_wrapper argument to SlurmBatch(‘python myscript.py’, cmd_wrapper=slurmjobs.util.singularity_command(overlay, sif)) for easier command formatting (no longer need to use multicmd=True and {__all__}). It expects a string with one positional format arg, e.g. (‘sudo {}’)
0.2.0¶
Oops - TODO - fill in. This was changes to adapt to NYU Greene and Singularity containers.
0.1.7¶
added JSON metadata that can be stored in the receipt. Currently, adds duration_secs and time
added more receipt logging (on successful write, on skip, on error)
Set slurmjobs.Receipt.TEST instead of slurmjobs.use_receipt.TEST
0.1.2¶
Added a receipt utility to avoid re-running functions slurmjobs.use_receipt(func)(*a, **kw)
Added Batch().generate(expand_grid=False) option to avoid expanding parameters and passing explicit grids
fixed json encoding error in run templates
0.1.1¶
- commands can now access the original values (without command line flag attached) by using the variable name preceded with an underscore. e.g.
‘{year}’ -> ‘–year 2016’
‘{_year}’ -> ‘2016’
specifying cli=False will disable any formatting and will just pass them sequentially.
weird things were happening when shlex.quote-ing repr so changed to json.dumps
0.1.0¶
expanded support to handle multi-line commands.
added more tests
moved init_script so it happens after activating conda
added source ~/.bashrc to job
added run_init_script so scripts can run code before you submit the jobs
removed hard coded nodes sbatch arg. It is now changeable (not sure why it was hardcoded..)