Using checkpoints (snapshots)

When your task takes a long time to complete, you may want to keep your partial results should the task be interrupted midway (for example in case of preemption, or if some random error happens). For this, you can use the task.snapshot(sec) method, so that partial results are uploaded to your output bucket every sec seconds.

Not only will this enable you to follow the logs and results, but it may also allow you to recover from those partial results (if that's something your software supports). To be able to recover from previous partial results, you generally need to:

Of course, using snapshots means more data exchanges, with possibly large files, so make sure that the time interval between snapshots is not too low (rule of thumb: between 10-ish min and an hour), to avoid saturating the network. You can also use whitelists and/or blacklists to retrieve only the useful parts of the output.

An example of recover in fluid dynamics

Below is an example of a use of snapshot and possibility to recover from previous result in OpenFoam. OpenFoam is a fluid dynamics numerical simulation software. It can be configured to reuse previous results.

In our case, it was configured (see controlDict) to perform 1,000 iterations (endTime) and to write the partial results every 100 iterations (writeInterval). The task is also configured to start from the last time (startFrom latestTime;) that is written on disk, so it will use previous results if there are any.

To be able to use this feature, we use the same bucket for both input and results, and enable snapshots. Now, if the task is aborted midway, restarting it will automatically make the solver start from the last written step.

Python

#!/usr/bin/env python

import qarnot

conn = qarnot.Connection(client_token='<<<MY_SECRET_TOKEN>>>')
task = conn.create_task("OpenFOAM - checkpoint", "docker-batch", 1)

bucket = conn.create_bucket("openfoam-checkpoint")
bucket.add_directory('openfoam_motorcycle/')

task.resources.append(bucket)
task.results = bucket


task.constants["DOCKER_REPO"] = "qarnotlab/openfoam"
task.constants["DOCKER_TAG"] = "v1912"
task.constants["DOCKER_CMD"] = "/job/Allrun"

task.snapshot(300)

task.submit()

Bash

# Your info
export QARNOT_CLIENT_TOKEN="<<<MY_SECRET_TOKEN>>>"

# Create a bucket for the data
qarnot bucket create \
--name "openfoam-checkpoint" \
--folder "openfoam_motorcycle/"

# Create and run task
qarnot task create \
--name "OpenFOAM - checkpoint" \
--profile docker-batch \
--instance 1 \
--resources bucket \
--result bucket \
--constants "DOCKER_REPO=qarnotlab/openfoam DOCKER_TAG=v1912 DOCKER_CMD=/job/Allrun" \
--periodic 300

OpenFoam - Controldict

/*--------------------------------*- C++ -*----------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  v2012                                 |
|   \\  /    A nd           | Website:  www.openfoam.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
FoamFile
{
    version     2.0;
    format      ascii;
    class       dictionary;
    object      controlDict;
}
// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //

application     simpleFoam;

startFrom       latestTime;

startTime       0;

stopAt          endTime;

endTime         1000;

deltaT          1;

writeControl    timeStep;

writeInterval   100;

purgeWrite      0;

writeFormat     binary;

writePrecision  6;

writeCompression off;

timeFormat      general;

timePrecision   6;

runTimeModifiable true;

functions
{
    #include "streamLines"
    #include "wallBoundedStreamLines"
    #include "cuttingPlane"
    #include "forceCoeffs"
    #include "ensightWrite"
}


// ************************************************************************* //