Archiving

Archiving simulation inputs, scripts and output data is a common need for computational physicists. Here are some popular tools and workflows to make archiving easy.

HPC Systems: HPSS

A very common tape filesystem is HPSS, e.g., on NERSC or OLCF.

What’s in my archive file system? hsi ls
Already something in my archive location? hsi ls 2019/cool_campaign/ as usual
Let’s create a neat directory structure:
- new directory on the archive: hsi mkdir 2021
- create sub-dirs per campaign as usual: hsi mkdir 2021/reproduce_paper
Create an archive of a simulation: htar -cvf 2021/reproduce_paper/sim_042.tar /global/cfs/cdirs/m1234/ahuebl/reproduce_paper/sim_042
- This copies all files over to the tape filesystem and stores them as a single .tar archive
- The first argument here will be the new archive .tar file on the archive file system, all following arguments (can be multiple, separated by a space) are locations to directories and files on the parallel file system.
- Don’t be confused, these tools also create an index .tar.idx file along it; just leave that file be and don’t interact with it
Change permissions of your archive, so your team can read your files:
- Check the unix permissions via hsi ls -al 2021/ and hsi ls -al 2021/reproduce_paper/
- Files must be group (g) readable (r): hsi chmod g+r 2021/reproduce_paper/sim_042.tar
- Directories must be group (g) readable (r) and group accessible (x): hsi chmod -R g+rx 2021
Restore things:
- mkdir here_we_restore
- cd here_we_restore
- htar -xvf 2021/reproduce_paper/sim_42.tar
  - this copies the .tar file back from tape to our parallel filesystem and extracts its content in the current directory

Argument meaning: -c create; -x extract; -v verbose; -f tar filename. That’s it, folks!

Note

Sometimes, for large dirs, htar takes a while. You could then consider running it as part of a (single-node/single-cpu) job script.

Desktops/Laptops: Cloud Drives

Even for small simulation runs, it is worth to create data archives. A good location for such an archive might be the cloud storage provided by one’s institution.

Tools like rclone can help with this, e.g., to quickly sync a large amount of directories to a Google Drive.

Asynchronous File Copies: Globus

The scientific data service Globus allows to perform large-scale data copies, between HPC centers as well as local computers, with ease and a graphical user interface. Copies can be kicked off asynchronously, often use dedicated internet backbones and are checked when transfers are complete.

Many HPC centers also add their archives as a storage endpoint and one can download a client program to add also one’s desktop/laptop.

Scientific Data for Publications

It is good practice to make computational results accessible, scrutinizable and ideally even reusable.

For data artifacts up to approximately 50 GB, consider using free services like Zenodo and Figshare to store supplementary materials of your publications.

For more information, see the open science movement, open data and open access.

Note

More information, guidance and templates will be posted here in the future.