Using Globus to Transfer and Share Big Data

Ashley DeVine and Mark Wance
People seated at a table talking.

Sucheta Godbole (left), bioinformatics analyst III, CCR Sequencing Facility Bioinformatics Group, consults with Mark Wance (seated), network architect, Information Systems Program (ISP), and Jeff Rife, systems administrator, ISP, on using the Globus service.

By Ashley DeVine, Staff Writer, and Mark Wance, Guest Writer; photo by Richard Frederickson, Staff Photographer

Editor's note: This article was updated April 30, 2018.

Transferring big data, such as the genomics data delivered to customers from the Center for Cancer Research Sequencing Facility (CCR SF), has been difficult in the past because the transfer systems have not kept pace with the size of the data. However, the situation is changing as a result of the Globus project.

Globus, NIH's Electronic Data Transfer Service, is a data transfer protocol that enables the transfer and sharing of large amounts of data between users all over the world. Globus provides dependable and consistent access to computing resources from multiple geographic locations worldwide. 

Researchers at the Frederick National Laboratory for Cancer Research can use Globus to securely transfer big data to and from the facility. Transferring the data gives researchers access for analysis and dissemination of the results to their colleagues or collaborators.

Globus Manages the Transfer While You Keep Working

Globus manages file transfers, monitors performance, retries failures, recovers from faults automatically, and reports the status of your data transfer. Once a user initiates a file transfer, it is performed in the background.

“It frees up your laptop or desktop to continue with the paper or article that you’re working on, while the data is transferring in the background,” said Mark Wance, network architect, Information Systems Program (ISP). “And you’ll get e-mails that tell you the status of your transfer.”

If a transfer fails, the system will automatically stop it, wait for a period of time, and then restart the transfer where it left off, Wance said.

“Globus is a very flexible, highly efficient, and robust file transfer system that is easy to use,” said Sean Davis, M.D., Ph.D., staff scientist, Genetics Branch, NCI Center for Cancer Research. “We also use Globus for file sharing of small files to huge file sets with ease comparable to Dropbox, but without requiring us to put our data in the cloud.”

The Challenges of Big Data Delivery before Globus

The CCR SF at the Advanced Technology Research Facility currently operates multiple platforms, including high-throughput next-generation and third-generation sequencing machines, in support of NCI investigators. It requires about 0.5 petabytes (PB) of storage space for data that it processes, according to Yongmei Zhao, manager of the CCR Sequencing Facility Bioinformatics Group.

The facility receives samples from customers, which are extracted from DNA, RNA, or other biological materials. The samples are processed, constructed into libraries, and loaded on to sequencers. A typical high-throughput sequencing run requires 4–6 terabytes (TB) of storage space for both sequencing and data processing. After the sequencing data are processed and analyzed, the result files are delivered to customers and can range in size from several hundred gigabytes (GB) to several TB.

“The delivery of such a large amount of data is a major challenge,” Zhao said. “That’s why we need Globus, a highly efficient data transfer protocol, to help us efficiently deliver the data to our customers.”

To get a sense of the size of this data, consider that the hard drive of a typical desktop computer is about 1–2 TB. If you multiply 1 TB by 1,000, that is equal to 1 PB of storage space—or about 1,000 individual hard drives’ worth of space.

Prior to the Globus, NIH offered several other transfer protocols, including FTP, Secure FTP, SCP, and bbcp. “FTP is inherently insecure because it sends data and authentication information in plain text,” Zhao said. “Other secure data transfer protocol ensures the security, but comes at the price of the transfer speed.” 

In some cases, researchers had to use other methods to transport their data, such as transferring it onto a hard drive, which was then shipped or hand-carried to the desired location.

Zhao previously used a different data transfer protocol system that was much slower than Globus. With these other systems, transferring the data was a manual, time-consuming process in which the researcher would have to monitor the transfer in case it stopped and had to be restarted, according to Wance.

How to Set Up a Globus Account

If you are interested in using this service to transfer data to/from your Helix, Biowulf, or Frederick National Laboratory account, you can set up a Globus account at https://www.globus.org. Once your Globus profile is complete, use the links below for instructions on using and starting a file transfer:

Most data transfers will be initiated through the Globus.org website. If you have questions about using Globus, contact the Frederick IT Services Helpdesk at 301-846-5115 or by email to fredhelpdesk@nih.gov

For more information on Globus and big data transfer, view the following links:

Data transfer and sharing using Globus

Globus website

NIH Biowulf Computational Cluster

Mark Wance is a network architect, Information Systems Program.