By Ashley DeVine, Staff Writer, and Mark Wance, Guest Writer; photo by Richard Frederickson, Staff Photographer
Transferring big data, such as the genomics data delivered to customers from the Center for Cancer Research Sequencing Facility (CCR SF), has been difficult in the past because the transfer systems have not kept pace with the size of the data. However, the situation is changing as a result of the Globus GridFTP project.
Globus GridFTP is a data transfer protocol that enables the transfer and sharing of large amounts of data between users all over the world. GridFTP refers to grid computing, a collection of computational resources from multiple geographic locations. The Globus project connects these computing resources, so that users have dependable and consistent access to them.
NCI at Frederick researchers can use Globus GridFTP to securely transfer big data to and from the facility. Transferring the data gives researchers access for analysis and dissemination of the results to their colleagues or collaborators.
Globus Manages the Transfer While You Keep Working
The Globus website manages file transfers, monitors performance, retries failures, recovers from faults automatically, and reports the status of your data transfer. Once a user initiates a file transfer, it is performed in the background.
“It frees up your laptop or desktop to continue with the paper or article that you’re working on, while the data is transferring in the background,” said Mark Wance, network architect, Information Systems Program (ISP). “And you’ll get e-mails that tell you the status of your transfer.”
If a transfer fails, the system will automatically stop it, wait for a period of time, and then restart the transfer where it left off, Wance said.
“Globus is a very flexible, highly efficient, and robust file transfer system that is easy to use,” said Sean Davis, M.D., Ph.D., staff scientist, Genetics Branch, NCI Center for Cancer Research. “We also use Globus for file sharing of small files to huge filesets with ease comparable to Dropbox, but without requiring us to put our data in the cloud.”
The Challenges of Big Data Delivery before GridFTP
The CCR SF at the Advanced Technology Research Facility currently operates multiple platforms, including high-throughput next-generation and third-generation sequencing machines, in support of NCI investigators. It requires about 0.5 petabytes (PB) of storage space for the amount of data that it processes, according to Yongmei Zhao, manager of the CCR Sequencing Facility Bioinformatics Group.
The facility receives samples from customers, which are extracted from DNA, RNA, or other biological materials. The samples are processed, constructed into libraries, and loaded on to sequencers. A typical high-throughput sequencing run requires 4–6 terabytes (TB) of storage space for both sequencing and data processing. After the sequencing data are processed and analyzed, the result files are delivered to customers and can range in size from several hundred gigabytes (GB) to several TB.
“The delivery of such a large amount of data is a major challenge,” Zhao said. “That’s why we need GridFTP, a highly efficient data transfer protocol, to help us efficiently deliver the data to our customers.”
To get a sense of the size of this data, consider that the hard drive of a typical desktop computer is about 1–2 TB. If you multiply 1 TB by 1,000, that is equal to 1 PB of storage space—or about 1,000 individual hard drives’ worth of space.
Prior to the Globus GridFTP project, there were several other transfer protocols used at NIH, including FTP, Secure FTP, SCP, and bbcp. “FTP is inherently insecure because it sends data and authentication information in plain text,” Zhao said. “Other secure data transfer protocol ensures the security, but comes at the price of the transfer speed.”
In some cases, researchers had to use other methods to transport their data, such as transferring it onto a hard drive, which was then shipped or hand-carried to the desired location.
Zhao previously used a different data transfer protocol system that was much slower than GridFTP. With these other systems, transferring the data was a manual, time-consuming process in which the researcher would have to monitor the transfer in case it stopped and had to be restarted, according to Wance.
“By adopting and deploying the GridFTP component of the Globus Toolkit, Frederick National Laboratory is able to reap the benefit of nearly 20 years of U.S. Department of Energy–led development experience in providing software and services to securely share computing power and data across organizational and geographic boundaries,” said Eric Stahlberg, senior high-performance computing engineer, ISP.
How to Set Up a Globus GridFTP Account
If you are interested in using this service to transfer data to/from your Helix, Biowulf, or NCI at Frederick account, you can set up a Globus account at https://www.globus.org. Once your Globus profile is complete, use the links below for instructions on using and starting a file transfer:
Most data transfers will be initiated through the Globus.org website. If you have questions about using Globus GridFTP, contact the NCI at Frederick Computer Helpdesk at 301-846-5115 or email@example.com. Your request will be forwarded to the Information Technology Operations Group, ISP, which is the group that manages the GridFTP server.
For more information on Globus GridFTP and big data transfer, view the following links:
Mark Wance is a network architect, Information Systems Program.