Checkpoint and Restarting

From Earlham Cluster Department

Jump to: navigation, search


Checkpointing and Restarting Methods in GROMACS

There are three checkpointing and restarting operations dealing with the GROMACS tools we care about for Folding@Clusters: creating an input file to start a run (our traditional grompp'ing), running the MD, and restarting a run from a failed run. The following shows the minimum set of files needed for each operation. Output files are also noted. The command line options are noted in []'s after each file.

The set of files needed from the assignment server are the mdp, gro, and top files (molecule.conf should be in this set but it isn't relevant to this doc). The rest of the files needed can be generated from this set. Previous notions of running simulations with only a gro file or a top file are faulty; both filetypes represent different sets of critical information to MD. The gro file contains the name of atoms, their positions, and their velocities. The topology of a molecular system are contained in the top file and consists of information like bonds, pairs, angles, and dihedrals. The mdp file is essential in that it specifies what the parameters of the MD simulation. One should also note that this is just a barebones list of the minimum requirements of using these tools. Also to be noted is the fact that there are more command arguments in these lists than is necessary. If run without all the argments listed above, the tools will automatically assume the files are present in the working directory.

Notes from the online GROMACS documentation suggest that the most accurate simulation restarts include an edr (energy) file. We may want to start including this in our restart methods.

Checkpoint Frequency

Checkpointing in Folding@Clusters has two distinct parts:

  1. mdrun generating checkpoints in an interval given by the nstxout parameter in grompp.mdp.
  2. The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005).

Signal accepted by mdrun and their relevance to the checkpoint/restart process

The mdrun process accepts SIGTERM and SIGUSR1. These signals can be received by a mdrun process of any rank. The effects of the signals are as follows:

Neither of these signals provide useful checkpointing mechanism. With some modicition (remove the code that modifies nsteps), the SIGUSR1 mechanism could be useful. Note that we are already using SIGINT in nanny/child communication and SIGINT is our only free signal due to COSM.

Using GROMACS tools


Altnernative method:

tools used:

files initially required:

files created:


  1. Create a .gro file that incorporates the result and the topology files.
    trjconv -s original.tpr -f checkpoint.trr -o new.gro
  2. Use the .gro as input to grompp to create a new .tpr file configured for the new number of nodes.
    grompp -f mdout.mdp -c new.gro -p -np 4 -o new.tpr


A simulation can only be ran for a set amount of time without modification. Even after restarting, the simulation cannot continue past the simulation time specified when grompp was initially executed. To extend the simulation, tpbconv must be used with the -until or -extend (which take a number of picoseconds as arguments) options:

     tpbconv -s topol.tpr -f traj.trr -e ener.edr -o new.tpr -extend 10
Personal tools
this semester