Topics

Data transfer

Quite often large amounts of data is generated, and it can be worth spending some time considering how to transfer data from the data producer to storage and analysis environment. Consider the capacity of the internet connection, transfer via a low bandwith network can be so time-consuming that it might be faster and easier to send the data on a hard drive through carrier services.

SciLifeLab Data Delivery System

The Data Delivery System (DDS) is a cloud-based system for the delivery of data from SciLifeLab platforms to their users. It consists of a command line interface (CLI) and a web interface. This system is e.g. used by the National Genomics Infrastucture (NGI) for deliveries of sequencing data.

Uppmax

Please find below some useful links from the compute resource Uppmax regarding data transfer:

Using Aspera on Uppmax

Aspera (ascp) is a command-line transfer program that can be used for stable transfers of files e.g. from Uppmax to ENA (European Nucleotide Archive) upload area when doing submission. Aspc gives the user many options (use command ascp --help), below is an example where a set of fastq files will be uploaded to a subfolder at ENA:

  1. Open a terminal window and log in to Uppmax using your credentials:
    ssh -X username@rackham.uppmax.uu.se
  2. Use interactive mode in order to execute the transfer at a compute node using a compute project:
    interactive -A project-name
  3. In order to be able to use the ascp command, it needs to be activated by typing:
    module load ascp
  4. Aspera has an environment variable that you can use in order to add your password at the remote/receiving site (in this case ENA) to memory, so you don’t have to type it when executing the transfer command:
    export ASPERA_SCP_PASS='my-ENA-password'
  5. The command below will copy all fastq.gz files found in all subfolders under ‘path-to-uppmax-folder’ to the user ‘Webin-XXXXX’ (exchange to your login-name at ENA) upload area at ENA under subfolder ‘subfolder-at-ENA’ (which will be created if it doesn’t already exist):
    ascp --file-checksum=md5 -d -k 3 --mode=send --overwrite=always -QT -l300M --host=webin.ebi.ac.uk --user=Webin-XXXXX path-to-uppmax-folder/**/*.fastq.gz subfolder-at-ENA

Dardel

  • Dardel, the compute cluster at Parallelldatorcentrum (PDC), KTH, has multiple ways of transferring files to and from your local machine, see documentation on the PDC on File transfer.

  • In the future, Dardel will have dedicated nodes for transferring large files, see further on Nodes for file operations, but at the moment transfers can be done directly on login node (dardel.pdc.kth.se).

Using Aspera on Dardel

  • There is an Aspera client available via command ml aspera-cli/3.9.6.1467.159c5b1

    • Note: Command ml avail aspera-cli lists the available versions
  • Newer versions of Aspera exist though, and can be installed locally on Dardel:

    1. In order to install Aspera locally, write the following commands after logging in:
      curl -fsSL https://github.com/rbenv/rbenv-installer/raw/HEAD/bin/rbenv-installer | bash
      source ~/.bashrc
      rbenv install 3.2.2
      rbenv global 3.2.2
      gem install aspera-cli
      ascli --version
      Note: The instructions only works if you are in a bash shell. If in doubt run echo $0 and if it doesn’t reply with bash then type bash to change. Also, in case you don’t have a file named .bashrc in your home directory, you can instead type source ~/.bash_profile.
    2. Run ascli conf ascp install
    3. Check current version using ascli conf ascp info

Then, in order to upload to European Nucleotide Archive (ENA) interactively using locally installed Aspera:

  1. Fill with desired ascp commands:

    ~/.aspera/sdk/ascp -k 3 -d -q --mode=send -QT -l300M --host=webin.ebi.ac.uk --user=Webin-XXXXX /local/path/to/*.gz /
  2. Enter password if/when prompted. In order to not be prompted about password, export the password first: read -s ASPERA_SCP_PASS && export ASPERA_SCP_PASS

Note: In order to check the progress and outcome of the transfer, a program such as FileZilla can be used to connect to your upload area at ENA from your local computer.

Learn more about uploading files to ENA

Learn more about the ascp command

Transferring files using RClone

Rclone is a command-line program that can be used to transfer files across a wide range of protocols. This can be useful when you you are unable to use specialised submission tools or Aspera, for example when transfering files in bulk to SciLifeLab Data Repository over the FTPS protocol.

The following example describes how to upload files to SciLifeLab Data Repository (or any other FigShare repository):

  1. Find/create your username and password for FTP uploads to Figshare
  2. To configure your FTP connection parameters for rclone your command will look something like this (rclone lfs :ftp:data will list the content of your data uploads folder on FigShare):
    rclone lsf :ftp:data --ftp-host=ftps.figshare.com --ftp-user=$user --ftp-pass=$(rclone obscure $pass) --ftp-port=21 --ftp-explicit-tls
  3. Use rclone copy path/to/localfile :ftp:data/new-data-item-title-folder to upload (use the same configuration flags as above)

Learn more about RClone

Learn more about FTP uploads to FigShare

Resources

Please find below resources concerning data transfer in form of training, guidance and/or tools.

Guiding resources

Tools