Unleashing your inner data scientist: The ability and audacity to scale your science with free and open cyberinfrastructure

Published in TRIPODS, 2018

Talk given in the TRIPODS lecture series at University of Arizona, about Cyverse and becoming a data scientist.

Co-authored with Nirav Merchant, Eric Lyons, and Blake Joyce.

Abstract

“Data deluge” has become a common phraseology for the phenomena sweeping across almost every scientific discipline. In the life sciences, relatively inexpensive Next Generation Sequencing (NGS) technologies, which generate entire genomes in days to hours, are fuelling an explosion of data. Billions of people use their mobile devices daily, streaming data in real time to companies like Twitter and Uber, providing information about global events and patterns of movement. Similarly, the ability to generate spatial data from still images or video via high resolution digital cameras, coupled with wi-fi enabled global positioning systems (GPS), and autonomous technologies such as drones and self-driving vehicles, are producing actionable data at a historically unprecedented rate.

Movements such as Open Data, and data management mandates from national science funding agencies are liberating data and democratizing access to increasingly massive data archives. Complementing this democratization of data are advances in computational capabilities, fueled by the rise of cloud computing, and significant investments by entities such as the U.S. National Science Foundation (NSF) to create distributed computing infrastructure (i.e. the Extreme Science and Engineering Discovery Environment (XSEDE) and institutional capabilities). The wide availability of nearly bespoke software, methodologies, and data analytic capabilities, fueled by the open-source software movement have led to the possibility of anyone with an internet connection and basic programming skills doing transformative research.

Existentially, a gulf between the knowledge and capacity of domain scientists and computational scientists still exists. This gap in expertise must be filled by new types of researcher: Data Scientists and Informaticians. Closing the gap between what is taught in domain sciences, computer sciences, and information science requires understanding of cyberinfrastructure landscape and computational thinking. Utilizing these amazing resources for asking bold research questions necessitates interdisciplinary collaboration, hands-on training, and technology orientation.

This talk will provide a roadmap with a broad overview of exemplar communities that have successfully established their own cyberinfrastructure with these open resources, and strategies for unleashing the inner “data scientist” embodied in every scientist.

Link to presentation