Static data#

Wellies defines static data as anything that should be made available in a common suite working environment. The data source can be local or remote. By defining different types, wellies will associate to the data reference a particular retrieval script. The options are:

  • link: This only works with local data. The associated script will create a symbolic link to the source directory in the suite target directory.
  • copy: The associated script will copy the source directory in the suite target directory. If files is specified, will only copy those.
  • rsync: The associated script will rsync the source directory in the suite target directory. If files is specified, will only copy those. Extra options for the rsync command can be provided with rsync_options; the default is to use "-avzpL".
  • git: The associated script will clone a repository branch on the suite target directory. If files is specified it will clone on to a temporary directory, git, then rsync everything in files, therefore it also accepts rsync_options.
  • ecfs: The associated script copies data from the ECFS remote archive in the suite directory. If files is specified it will only copy those.
  • mars: The associated script will be a MARS request. All the keys should be given in the request option. For more details, on the MARS request option check here
  • custom: This is a wildcard option that natively does nothing, but the user can specify a custom script to use using options pre_script or post_script.

All scripts can be extended by using the options pre_script and post_script.

Examples#

Configuration entry and assuming DATA_DIR is well a defined suite variable.

data.yaml
static_data:
    large_datasets:
        type: link
        source: /path/to/dir

wellies data script snippet

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/large_datasets
rm -rf $dest_dir
ln -sfn /path/to/dir $dest_dir
if [[ -L $dest_dir && -d $(readlink $dest_dir) ]]; then
    echo Link and directory exist
else
    echo Link or directory does not exist
    exit 1
fi
cd $dest_dir

Copy data#

Configuration entry and assuming DATA_DIR is a well defined suite variable.

data.yaml
static_data:
    copy_data:
        type: copy
        source: /path/to/dir/file.txt
        post_script: "echo 'Copy Done'"

wellies data script snippet

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/copy_data
rm -rf $dest_dir
mkdir -p $dest_dir
scp /path/to/dir/file.txt  $dest_dir/
cd $dest_dir

# Post-script
echo 'Copy Done'

or if using files option

data.yaml
static_data:
    copy_data:
        type: copy
        source: /path/to/dir
        files: file.txt
        post_script: "echo 'Copy Done'"

The result is equivalent

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/copy_data
rm -rf $dest_dir
mkdir -p $dest_dir
scp /path/to/dir/file.txt  $dest_dir/
cd $dest_dir

# Post-script
echo 'Copy Done'

Rsync data#

Configuration entry and assuming DATA_DIR is a well defined suite variable. In this example, we also use post_script as a reference to a existing file script. If not using absolute paths, it will be relative to where the deployment has been executed.

data.yaml
static_data:
    copy_data:
        type: rsync
        source: hpc-login:/path/to/dir/
        files: 
          - dis.nc
          - scov.nc
        rsync_options: "-avz"
        post_script: "install.sh"

wellies data script snippet

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/copy_data
rsync -avz /path/to/dir/dis.nc /path/to/dir/scov.nc  $dest_dir/
cd $dest_dir

# Post-script
echo 'running after every other command'
echo 'bye  bye'

Git data#

Configuration entry and assuming DATA_DIR is a well defined suite variable.

data.yaml
static_data:
  git_data:
    type: git
    source: "git.example.com/repo.git"
    branch: main
    pre_script: "git config --global user.name 'John Doe'"

wellies data script snippet

# Pre-script
git config --global user.name 'John Doe'

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/git_data
rm -rf $dest_dir
giturl=git.example.com/repo.git
gitbranch=main
git clone $giturl --branch $gitbranch --single-branch --depth 1 $dest_dir
cd $dest_dir

if files is specified the repository is cloned onto a temporary directory:

data.yaml
static_data:
    git_data:
        type: git
        source: "git.example.com/repo.git"
        branch: main
        files:
          - static/dem.nc
          - static/metadata_template.grb
# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/git/git_data
rm -rf $dest_dir
giturl=git.example.com/repo.git
gitbranch=main
git clone $giturl --branch $gitbranch --single-branch --depth 1 $dest_dir
cd $dest_dir


dest_dir=$DATA_DIR/git_data
rsync -avzpL $DATA_DIR/git/git_data/static/dem.nc $DATA_DIR/git/git_data/static/metadata_template.grb  $dest_dir/
cd $dest_dir

echo 'cleaning build directory'
rm -rf $DATA_DIR/git/git_data

ECFS data#

Configuration entry and assuming DATA_DIR is a well defined suite variable.

data.yaml
static_data:
    ecfs_data:
        type: ecfs
        source: "ec:/path/to/data"
        post_script: "cd $DATA_DIR/ecfs_data && tar -xvf *.tar && cd -"

Note

for remote datasets the host needs to be specified even for ECFS type.

wellies data script snippet

# Main script for retrieving data
mkdir -p $DATA_DIR

dest_dir=$DATA_DIR/ecfs_data
rm -rf $dest_dir
mkdir -p $dest_dir
ecp ec:/path/to/data  $dest_dir/
cd $dest_dir

# Post-script
cd $DATA_DIR/ecfs_data && tar -xvf *.tar && cd -

MARS data#

Configuration entry and assuming DATA_DIR is a well defined suite variable.

data.yaml
static_data:
    mars_data:
        type: mars
        request:
          class: od
          type: an
          expver: "1"
          date: "19990215"
          time: "12"
          param: t
          levtype: "pressure level"
          levelist: [1000, 850, 700, 500]
          target: t.grb
        post_script: "pproc-interpol --grid SMUFF-OPERA-2km-proj t.grb t_2km.grb"

Note

the Mars data type does not accept a source option.

wellies data script snippet

# Main script for retrieving data
mkdir -p $DATA_DIR
dest_dir=$DATA_DIR/mars_data
mkdir -p $dest_dir
cd $dest_dir

mars << EOF
retrieve,
  class=od,
  type=an,
  expver=1,
  date=19990215,
  time=12,
  param=t,
  levtype=pressure level,
  levelist=1000/850/700/500,
  target="t.grb"
EOF

# Post-script
pproc-interpol --grid SMUFF-OPERA-2km-proj t.grb t_2km.grb

Custom data#

data.yaml
static_data:
    custom_data:
        type: custom
        pre_script: "retrieve_data.sh"
        post_script: 
          - "cd $DATA_DIR/custom_data"
          - "tar -xvf *.tar"
          - "cd -"

wellies data script snippet

# Pre-script
echo 'running data_retrieve.sh script contents'
echo 'end of script'


# Main script for retrieving data
mkdir -p $DATA_DIR
# Running custom data command
# Post-script
cd $DATA_DIR/custom_data
tar -xvf *.tar
cd -