Data transfer
Quite often large amounts of data is generated, and it can be worth spending some time considering how to transfer data from the data producer to storage and analysis environment. Consider the capacity of the internet connection, transfer via a low bandwith network can be so time-consuming that it might be faster and easier to send the data on a hard drive through carrier services.
SciLifeLab Data Delivery System
The Data Delivery System (DDS) is a cloud-based system for the delivery of data from SciLifeLab platforms to their users. It consists of a command line interface (CLI) and a web interface. This system is e.g. used by the National Genomics Infrastucture (NGI) for deliveries of sequencing data.
Uppmax
Please find below some useful links from the compute resource Uppmax regarding data transfer:
Using Aspera on Uppmax
Aspera (ascp) is a command-line transfer program that can be used for stable transfers of files e.g. from Uppmax to ENA (European Nucleotide Archive) upload area when doing submission. Aspc gives the user many options (use command ascp --help), below is an example where a set of fastq files will be uploaded to a subfolder at ENA:
- Open a terminal window and log in to Uppmax using your credentials:
ssh -X username@rackham.uppmax.uu.se - Use interactive mode in order to execute the transfer at a compute node using a compute project:
interactive -A project-name - In order to be able to use the ascp command, it needs to be activated by typing:
module load ascp - Aspera has an environment variable that you can use in order to add your password at the remote/receiving site (in this case ENA) to memory, so you don’t have to type it when executing the transfer command:
export ASPERA_SCP_PASS='my-ENA-password' - The command below will copy all fastq.gz files found in all subfolders under ‘path-to-uppmax-folder’ to the user ‘Webin-XXXXX’ (exchange to your login-name at ENA) upload area at ENA under subfolder ‘subfolder-at-ENA’ (which will be created if it doesn’t already exist):
ascp --file-checksum=md5 -d -k 3 --mode=send --overwrite=always -QT -l300M --host=webin.ebi.ac.uk --user=Webin-XXXXX path-to-uppmax-folder/**/*.fastq.gz subfolder-at-ENA
Dardel
-
Dardel, the compute cluster at Parallelldatorcentrum (PDC), KTH, has multiple ways of transferring files to and from your local machine, see documentation on the PDC on File transfer.
-
In the future, Dardel will have dedicated nodes for transferring large files, see further on Nodes for file operations, but at the moment transfers can be done directly on login node (dardel.pdc.kth.se).
Using Aspera on Dardel
-
There is an Aspera client available via command
ml aspera-cli/3.9.6.1467.159c5b1- Note: Command
ml avail aspera-clilists the available versions
- Note: Command
-
Newer versions of Aspera exist though, and can be installed locally on Dardel:
- In order to install Aspera locally, write the following commands after logging in:
Note: The instructions only works if you are in a bash shell. If in doubt runcurl -fsSL https://github.com/rbenv/rbenv-installer/raw/HEAD/bin/rbenv-installer | bash source ~/.bashrc rbenv install 3.2.2 rbenv global 3.2.2 gem install aspera-cli ascli --versionecho $0and if it doesn’t reply withbashthen typebashto change. Also, in case you don’t have a file named.bashrcin your home directory, you can instead typesource ~/.bash_profile. - Run
ascli conf ascp install - Check current version using
ascli conf ascp info
- In order to install Aspera locally, write the following commands after logging in:
Then, in order to upload to European Nucleotide Archive (ENA) interactively using locally installed Aspera:
-
Fill with desired ascp commands:
~/.aspera/sdk/ascp -k 3 -d -q --mode=send -QT -l300M --host=webin.ebi.ac.uk --user=Webin-XXXXX /local/path/to/*.gz / -
Enter password if/when prompted. In order to not be prompted about password, export the password first:
read -s ASPERA_SCP_PASS && export ASPERA_SCP_PASS
Note: In order to check the progress and outcome of the transfer, a program such as FileZilla can be used to connect to your upload area at ENA from your local computer.
Learn more about uploading files to ENA
Learn more about the ascp command
Transferring files using RClone
Rclone is a command-line program that can be used to transfer files across a wide range of protocols. This can be useful when you you are unable to use specialised submission tools or Aspera, for example when transfering files in bulk to SciLifeLab Data Repository over the FTPS protocol.
The following example describes how to upload files to SciLifeLab Data Repository (or any other FigShare repository):
- Find/create your username and password for FTP uploads to Figshare
- To configure your FTP connection parameters for rclone your command will look something like this (
rclone lfs :ftp:datawill list the content of your data uploads folder on FigShare):rclone lsf :ftp:data --ftp-host=ftps.figshare.com --ftp-user=$user --ftp-pass=$(rclone obscure $pass) --ftp-port=21 --ftp-explicit-tls - Use
rclone copy path/to/localfile :ftp:data/new-data-item-title-folderto upload (use the same configuration flags as above)
Learn more about FTP uploads to FigShare
Resources
Please find below resources concerning data transfer in form of training, guidance and/or tools.