Example 1: High-throughput geometry optimisations with CASTEP

Example 1.1: Using run3 locally

In this example, we will suppose that you want to perform a geometry optimisation on several different polymorphs of TiO2 from the ICSD. The files for this example can be found in examples/run3_tutorial, here.

Setting up the files

By default, run3 expects the following files in the job folder:

  • one .res/SHELX file per structure to optimise

  • one $seed.cell file that contains the CASTEP CELL keywords common to every structure, e.g. pseudopotentials, kpoint spacing, etc.

  • one $seed.param file that contains the CASTEP PARAM keywords common to every calculation, e.g. cut_off_energy, xc_functional, task, geom_max_iter, etc.

  • any required pseudopotential files.

Tip

If you have a database set up with structures from the OQMD these could be obtained via matador query --db oqmd_1.1 -f TiO2 --icsd --res.

Tip

Alternatively, you can turn many file types into .res using the various <format>3shx scripts (shx standing for SHELX), e.g. cell3shx *.cell.

The job folder should look something like this:

$ ls
O2Ti-OQMD_112497-CollCode171670.res  O2Ti-OQMD_2575-CollCode9852.res     O2Ti-OQMD_7500-CollCode41493.res
O2Ti-OQMD_117323-CollCode182578.res  O2Ti-OQMD_3070-CollCode15328.res    O2Ti-OQMD_84685-CollCode97008.res
O2Ti-OQMD_13527-CollCode75179.res    O2Ti-OQMD_31247-CollCode657748.res  O2Ti-OQMD_97161-CollCode154036.res
O2Ti-OQMD_19782-CollCode154035.res   O2Ti-OQMD_5979-CollCode31122.res    TiO2.cell
O2Ti-OQMD_2475-CollCode9161.res      O2Ti-OQMD_7408-CollCode41056.res    TiO2.param

with .param file containing:

$ cat TiO2.param
task                 : geometryoptimization
xc_functional        : LDA
cut_off_energy       : 300.0 eV
geom_force_tol       : 0.1
spin_polarized       : false
fix_occupancy        : false
max_scf_cycles       : 100
opt_strategy         : speed
page_wvfns           : 0
perc_extra_bands     : 40
num_dump_cycles      : 0
backup_interval      : 0
geom_method          : LBFGS
geom_max_iter        : 300
mix_history_length   : 20
finite_basis_corr    : 0
fixed_npw            : false
write_cell_structure : true
write_checkpoint     : none
write_bib            : false
bs_write_eigenvalues : false
calculate_stress     : true

and .cell file containing:

$ cat TiO2.cell
kpoint_mp_spacing: 0.07

%block species_pot
QC5
%endblock species_pot

symmetry_generate
symmetry_tol: 0.01
snap_to_symmetry

Calling run3

Once these files are in place, we can begin the geometry optimisations. To run the current host machine, simply call:

$ run3 TiO2

This will start a single node CASTEP job on the current machine, using all available cores. If you are on a local cluster without a queuing system, and wish to run on several nodes at once (say node3, node6 and node8), the oddjob script can be used as follows:

$ oddjob 'run3 TiO2' -n 3 6 8

This will start 3 single node CASTEP jobs on the desired nodes. If instead your nodes are called cpu00010912, cpu323232 and cpu123123, the --prefix flag is needed:

$ oddjob 'run3 TiO2' --prefix cpu -n 00010912 323232 123123

Tip

On a supercomputer with a queuing system, e.g. PBS or slurm, run3 must be called in your submission script. Array jobs are typically an effective way of spreading out over multiple nodes. An example of this kind can be found in example 1.2.

Monitoring your calculations

If you look at the job folder as run3, er… runs, you will see several files and folders being created. Firstly, 3 .txt files will be made:

  • jobs.txt: this file contains a list of jobs that, at some point, __started__ running.

  • finished_cleanly.txt: this file lists jobs that completed without error.

  • failures.txt: this file lists jobs that produced an error.

Every structure in progress will have a <structure_name>.lock file to prevent clashes with other nodes.

Several folders will also be created:

  • logs/: log file per structure containing a complete history of the run.

  • input/: a backup of the starting configuration as a .res file.

  • completed/: all successful calculations will end up here, usually as a .res file with the final configuration, a concatenated .castep file containing every job step, and if requested (via write_cell_structure: true), CASTEP’s -out.cell file.

  • bad_castep/: all failed calculations end up here, including all auxiliary files.

  • <node_name>/: a folder is created per hostname (e.g. when running on multiple nodes) that contains the interim calculations. On failures/timeouts, all files in here are moved back to the main job folder.

Eventually, all jobs will hopefully be moved to completed/, then you are done!

Example 1.1.1: High-throughput geometry optimisations with CASTEP with per-structure parameters

There are a few occasions where you might need a custom .param file for each structure, for example, if using the implicit nanotube %devel_code in CASTEP.

These calculations are performed in exactly the same was as above, except a <structure_name>.param file must be made containing the required DFT parameters AND the nanotube parameters. In this case, run3 must now be called as:

$ run3 --custom_params TiO2

Tip

If you have a .res file that contains a PyAIRSS “REM NT_PROPS” line, this will be ignored.

Example 1.2: High-throughput geometry optimisations with CASTEP on a supercomputer

Each HPC facility has its own quirks, so in this example we will try to be as explicit as possible. The set up of the job is exactly the same as in example 1, but we now must add run3 to our job submission script. The following examples are for the SLURM queuing system on the BlueBear machine at the University of Birmingham and PBS on ARCHER (Tier-1), but run3 has also been tested on CSD3 (Tier-2), HPC Midlands-Plus (Tier-2), Thomas (Tier-2) and several local group-scale clusters.

Example 1.2.1: SLURM on BlueBear

In this job, we will submit a run3 job that performs CASTEP calculations across 2 24-core nodes per structure. Let us presume we have many thousand structures to run. The submission script looks as follows:

$ cat run3.sub
#!/bin/bash -l

###### MACHINE/USER-SPECIFIC OPTIONS ######

#SBATCH --ntasks 48
#SBATCH --nodes 2-2
#SBATCH --time 24:00:00
#SBATCH --qos <REDACTED>
##SBATCH --qos bbshort
#SBATCH --mail-type ALL
#SBATCH --account=<REDACTED>

module purge
export PATH="$HOME/bin/CASTEP-17.21:$HOME/.conda/bin"
module load bluebear
module load mpi/impi/2017.1.132-iccifort-2017.1.132
unset I_MPI_PMI_LIBRARY

# RUN3 COMMANDS
# (assuming installation guide followed at
#  https://matador-db.readthedocs.io/en/latest/install.html)

source activate matador
run3 -nc 48 -v 4 --executable castep.mpi --ignore_jobs_file TiO2

Let’s unpick a few of the flags used to call run3 here:

  • -nc/--ncores: the number of cores to use per structure, per calculation. It is often worth specifying this if more than one node is being used, as the correctness of run3’s core counter is queue/machine-specific.

  • -v 4: sets the verbosity in the log file to the highest level.

  • --ignore_jobs_file: by default run3 will for both <structure>.lock files and entries in jobs.txt before running a new structure. It is often worth disabling the jobs.txt check if it is not expected that all structures complete in one job submission (see below).

try to call an executable called simply castep. On many machines, CASTEP is installed as castep.mpi.

Now to submit this script as a 200-node array job (i.e. running a maximum of 100 structures concurrently, depending on the queue), we call the following:

$ sbatch --array=1-100 run3.job

It may be that this job is not large enough to optimise all structures within the walltime limit. In this case, it can be resubmitted using the same command. Jobs that were running when the walltime ran out should automatically be pushed back into the job folder so that they will be available to the next run3 call. In the event that this does not happen (for example MPI kept control of the Python thread for too long so the queuieng system interrupted run3’s clean up), <hostname> folder will be left hanging around in the main jobs folder. Jobs must then be manually made restartable by deleting <structure>.lock (and removing <structure>> from jobs.txt if not using --ignore_jobs_file). It may also be that the intermediate CASTEP calculation was not copied over from the <hostname> folder: in this case, the CASTEP files can be updated by running:

$ cp -u node*/*.castep .

from inside the root job folder.

Example 1.2.2: PBS on ARCHER

Instructions are almost identical to the above, but the array job script looks a little different, for the same 100 copies of 2 node jobs (this time 24 cores per node):

$ cat run3.job
#!/bin/bash --login
# PBS job options (name, compute nodes, job time) # PBS -N is the job name (e.g. Example_MixedMode_Job)
#PBS -N my_run3_job
# PBS -l select is the number of nodes requested (e.g. 128 node=3072 cores)
#PBS -l select=2
# PBS -l walltime, maximum walltime allowed (e.g. 6 hours)
#PBS -l walltime=24:00:00
# Replace [budget code] below with your project code (e.g. t01)
#PBS -A <REDACTED>
#PBS -m abe
#PBS -M <REDACTED>
#PBS -J 1-100
#PBS -r y

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the direcotry that the job was submitted from
# (remember this should be on the /work filesystem)
cd $PBS_O_WORKDIR

source $HOME/.bashrc
module load anaconda-compute/python3
source activate $HOME/work/.conda/matador

run3 --archer -v 4 -nc 48 KSnP

Notice here we have specified --archer: again, run3 should be able to detect that mpirun is missing and thus try aprun, but it can be worth specifying just in case. With PBS, the whole array can be submitted with just:

$ qsub run3.job