Intro to Using Docker and Singularity

Introduction

What is a container

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. This is useful where there are many complex dependencies which are being updated in different timeframes. It is also useful to allow that a version of software together with those dependencies to be tested and then deployed repeatedly knowing that everything is the same and therefore the system will work as intended.

Why Docker

Docker (https://www.docker.com) is the most widely used container system and available for multiple platforms. It is a lightweight container system which creates a stand-alone executable package including everything needed to run the package: code, runtime, system tools, system libraries, settings.

A single container can can then be executed on multiple platforms (i.e., Linux, MacOS and Windows) without making any changes to the software or having to re-build the software. The container can also be tagged to create a unique version of the system encapsulated within the container which can be reused at a later point ensuring that a result can be reproduced.

In this case, whether the software is a built via conda-forge (https://conda-forge.org) or spack (https://spack.readthedocs.io) we can encapsulate the tools and dependencies within a docker container for re-use.

Singularity

Docker cannot be used on a shared system (e.g., HPC) as it requires to run with root (administrator) access which is not always possible or desirable. However, an alternative is available, which can import Docker containers, called Singularity (https://sylabs.io/singularity/) and this specifically supports HPC systems. However, Singularity is only available for Linux, with MacOS support under development.

Singularity is able to import a docker image and therefore can be easily used in place of docker with the advantage that the container image is stored in a single file within the file system rather than centrally by the system.

However, due to support across all platforms this rest of this tutorial will focus on using Docker.

Docker Installation

MacOS and Windows

To install Docker, login into https://hub.docker.com and follow the instructions to download and install the software.

Linux

Ubuntu / Debian

Instructions are available from https://docs.docker.com/install/linux/docker-ce/ubuntu/ but if you have not previously had Docker installed then the following commands are used:

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

CentOS

Instructions are available from https://docs.docker.com/install/linux/docker-ce/centos/ but if you have not previously had Docker installed then the following commands are used:

sudo yum install docker-ce docker-ce-cli containerd.io

Fedora

Instructions are available from https://docs.docker.com/install/linux/docker-ce/fedora/ but if you have not previously had Docker installed then the following commands are used:

sudo dnf install docker-ce docker-ce-cli containerd.io

Useful Images/Containers

NOTE Terminology: Docker has images and containers. A Docker Image is a set of files which have no state, whereas Docker Container is the instantiation of Docker Image. In other words, Docker Container is the run time instance of images.

At https://hub.docker.com you will find many useful Docker images which have been built and a ready to use.

The Docker images we will be using for the remainder of this tutorial are from myself and can be browsed at: https://hub.docker.com/u/petebunting.

au-eoed

This image contains the released version of RSGISLib (https://www.rsgislib.org) and ARCSI (https://arcsi.remotesensing.info) and their dependencies, such as GDAL (https://www.gdal.org). This is the image which most will want to use for remote sensing (i.e., satellite imagery) data analysis.

More information at: https://hub.docker.com/r/petebunting/au-eoed

Installed using the following command:

docker pull petebunting/au-eoed

If you were to use Singularity then the command to create this image would be:

singularity pull docker://petebunting/au-eoed

au-eoed-dev

This image contains the released version of RSGISLib (https://www.rsgislib.org), ARCSI (https://arcsi.remotesensing.info) and EODataDown (https://eodatadown.remotesensing.info) and their dependencies, such as GDAL (https://www.gdal.org).

This is the image should not be used in most cases as it changes regularly and at times may contain version of the software which are broken. However, it does contain the very latest versions of the various software and dependencies.

More information at: https://hub.docker.com/r/petebunting/au-eoed-dev

Installed using the following command:

docker pull petebunting/au-eoed-dev

If you were to use Singularity then the command to create this image would be:

singularity pull docker://petebunting/au-eoed-dev

spdlib

This image contains the released version of SPDLib (https://www.spdlib.org) its dependencies, such as GDAL (https://www.gdal.org).

This is the image which most will want to use for LiDAR data analysis.

More information at: https://hub.docker.com/r/petebunting/spdlib

Installed using the following command:

docker pull petebunting/spdlib

If you were to use Singularity then the command to create this image would be:

singularity pull docker://petebunting/spdlib

Docker Basics

Once you have pulled your Docker image it is installed on your system, to see which images you have downloaded to your system using the following command:

docker images

To see which containers you have running you can use the following command:

docker ps

If you find that Docker is using a lot of storage space on your machine then the following command can be used to delete an image from your system:

docker rmi <IMAGE ID>

If you want to remove all docker images from your machine this can be done with the following command:

docker system prune -a 

If you want to docker a docker image with a specific tag, by default the latest tag is pulled, then this can be done is a : and then the tag name, as shown below for the petebunting/au-eoed-dev image to download the image tagged 20200129:

docker pull petebunting/au-eoed-dev:20200129

Running The Container Terminal

The simplest way to use the Docker image us to log into the container on a Terminal prompt. At this point your will have access to the software installed within the Docker image as you would from the Terminal on your own local machine. Running the following command will achieve this exit to leave the container:

docker run -i -t petebunting/au-eoed /bin/bash

Once within the container try and run a command such as gdalinfo --formats to check the system is working.

However, you will noticed that you do not have access to your files, to get access to your local file system you need to mount this within the Docker container, as show below. NOTE: the variable ${PWD} is a reference to the current location (i.e., where in your file system you have run the docker image from) this is being mapped on to the /data directory within the Docker container.

docker run -i -t -v ${PWD}:/data petebunting/au-eoed /bin/bash

From the terminal prompt within the Docker container you can now navigate to the /data directory, if you list the contents of the directory you will find the same files as where you executed the docker run command.

cd /data
ls -lh

You can also specify a specify local path to be mapped, for example:

docker run -i -t -v /scratch/MyCoolData:/data petebunting/au-eoed /bin/bash

Please note that you will now have to reference all your paths to /data and not the local paths on the machine you are working from. Also, all the data and scripts you want to use also need to be available in /data .

Using ARCSI in Docker

To run ARCSI using the docker image you use the same command as you would have otherwise done but you need to pre-append the Docker command and remember that all the files you are using are relative to the mount point within the Docker container.

docker run -i -t -v /scratch:/data petebunting/au-eoed arcsi.py -s ls5tm\
-p CLOUDS DOSAOTSGL STDSREF SATURATE TOPOSHADOW FOOTPRINT METADATA \
-o /data/Outputs/ --tmpath /data/tmp --dem /data/UKSRTM_90m.kea \
--k  clouds.kea meta.json sat.kea toposhad.kea valid.kea stdsref.kea \
--stats --format KEA \
-i /data/Input/LT05_L1TP_203024_19950815_20180217_01_T1_MTL.txt
docker run -i -t -v /scratch:/data petebunting/au-eoed arcsi.py -s sen2 \
-p CLOUDS DOSAOTSGL STDSREF SATURATE TOPOSHADOW FOOTPRINT METADATA SHARP \
-o /data/Outputs  --dem /data/UKSRTM_90m.kea --tmpath /data/tmp \
--k  clouds.kea meta.json sat.kea toposhad.kea valid.kea stdsref.kea \
--stats --format KEA \
-i /data/S2A_MSIL1C_20170617T113321_N0205_R080_T30UVD.SAFE/MTD_MSIL1C.xml

Using GDAL Tools in Docker

Using one of the GDAL tools is similar to the ARCSI, in that the commands are all the same but you need to update file paths to be relative to the mount point in the Docker container. For example:

docker run -i -t -v /scratch:/data petebunting/au-eoed gdal_translate \
-of GTIFF /data/input_img.kea /data/output_img.tif

Running Python (RSGISLib) in Docker

Again, the change which is needed related to the file paths either being inputting into the python script. For example, the following python script (saved as calc_ndvi.py)

import rsgislib.imagecalc

img = '/data/landsat_img.kea'
out_img = '/data/landsat_ndvi.kea'
rsgislib.imagecalc.calcindices.calcNDVI(img, 3, 4, out_img)

Can be executed using the following Docker command:

docker run -i -t -v /scratch:/data petebunting/au-eoed python /data/calc_ndvi.py

Using Singularity instead of Docker

This is pretty straight forward and very similar to Docker. Once you have pulled your singularity image (i.e., from the docker hub as shown above) then you will have a file for the image on your system:

-rwxr-xr-x 1 pete pete 753M  Sep 24 20:15 au-eoed-20191015.sif
-rwxr-xr-x 1 pete pete 1000M Sep 24 17:02 au-eoed-dev-20200129.sif

You can remove images by just deleting the files for the images you do no longer wish have on your system. I would recommend including the tag name (e.g., 20200129) in your singularity image name -- you might need to rename the file once it has downloaded. You choose the image to run by using the full path to the image file.

ARCSI

To run ARCSI using the singularity image you use the same command as you would have otherwise done but you need to pre-append the Singularity command.

\begin{minted}[frame=lines]{bash} singularity exec /data/sw_imgs/au-eoed-20190327.simg arcsi.py -s ls5tm\

singularity exec --bind /scratch:/data /scratch/sw_imgs/au-eoed-20191015.sif \
arcsi.py -s ls5tm -p CLOUDS DOSAOTSGL STDSREF SATURATE TOPOSHADOW FOOTPRINT METADATA \
-o /data/Outputs/ --tmpath /data/tmp --dem /data/UKSRTM_90m.kea \
--k  clouds.kea meta.json sat.kea toposhad.kea valid.kea stdsref.kea \
--stats --format KEA \
-i /data/Input/LT05_L1TP_203024_19950815_20180217_01_T1_MTL.txt

GDAL

singularity exec --bind /scratch:/data /scratch/sw_imgs/au-eoed-20191015.sif \
gdal_translate -of GTIFF /data/input_img.kea /data/output_img.tif

Python (RSGISLib)

For example, the following python script (saved as calc_ndvi.py)

import rsgislib.imagecalc

img = '/data/landsat_img.kea'
out_img = '/data/landsat_ndvi.kea'
rsgislib.imagecalc.calcindices.calcNDVI(img, 3, 4, out_img)

Can be executed using the following Docker command:

singularity exec --bind /scratch:/data /scratch/sw_imgs/au-eoed-20191015.sif python /data/calc_ndvi.py

Batch Processing Landsat with Docker and ARCSI

ARCSI has tools for bulk downloading and processing landsat and sentinel-2 data but require the Google Cloud SDK, the following commands can be used with two different docker images (petebunting/au-eoed-dev and google/cloud-sdk)

Download Landsat Database

The following command will download and build a local sqlite database for all the Landsat images collected globally:

docker run -i -t -v ${PWD}:/data petebunting/au-eoed-dev \
arcsisetuplandsatdb.py -f /data/lsgoog_db_20190924.sqlite

Search Landsat for Scenes

The following command searches the local database for path 227 and row 63 for collection 1 scenes and a cloud cover less than 50 percent.

docker run -i -t -v ${PWD}:/data petebunting/au-eoed-dev \
arcsigenlandsatdownlst.py -f /data/lsgoog_db_20190924.sqlite \
-p 227 -r 63 -o /data/ls_scns_dwnld.sh --outpath /data/ls_dwn \
--collection T1 --cloudcover 50 --multi --lstcmds

Setup Google SDK

If this is the first time you are using this google cloud sdk docker image, then you will need to authenticate it using the following command:

docker run -ti --name gcloud-config google/cloud-sdk gcloud auth login

Download Landsat Images

To download the scenes which were found through querying the database and outputted into file /data/ls_scns_dwnld.sh the following command:

docker run -ti -e CLOUDSDK_CONFIG=/config/mygcloud \
-v ${PWD}/mygcloud:/config/mygcloud -v ${PWD}:/certs \
-v ${PWD}:/data  google/cloud-sdk sh /data/ls_scns_dwnld.sh

Generate ARCSI Commands

The following command can be used to generate ARCSI commands for processing all the scenes which have been downloaded:

docker run -i -t -v ${PWD}:/data petebunting/au-eoed-dev 
arcsibuildcmdslist.py -s LANDSAT -f KEA --stats \
-p CLOUDS DOSAOTSGL STDSREF --outpath /data/ls_ard \
--dem /data/srtm/srtm_3arc.kea --tmpath /data/tmp \
--keepfileends stdsref.kea clouds.kea -i /data/ls_dwn 
-e "*MTL.txt" -o /data/ard_arcsi_cmds.sh

Run ARCSI Commands with Docker

The commands outputted do not have the docker command pre-appended so that needs to be pre-appended at the front of all the lines, so more cores can be used for the processing the following command can be used to split it into 4 output files:

docker run -i -t -v ${PWD}:/data petebunting/au-eoed-dev \
splitcmdslist.py -i /data/ard_arcsi_cmds.sh \
-p "docker run -i -t -v ${PWD}:/data petebunting/au-eoed-dev" \
-o /data/ard_arcsi_cmds_splt.sh -f 4

The output files can then be execute in different terminal windows using the following commands:

sh ard_arcsi_cmds_splt_1.sh
sh ard_arcsi_cmds_splt_2.sh
sh ard_arcsi_cmds_splt_3.sh
sh ard_arcsi_cmds_splt_4.sh

Conclusion

Docker and Singularity are a really useful set of tools and solve a lot of challenges with software installation and deployment and transferring workflows between different machines (e.g., local laptop, workstation, HPC and Amazon or Google Cloud).

It is recommended that you develop your workflow and scripts on your local machine, probably using Docker, and then deploy onto the HPC or high performance workstation using Singularity.

Software

Repostories