Introduction to Cloud Native & Analysis Ready Data Formats¶

Instructors(s):¶

Tyson Lee Swetnam PhD , Carlos Lizárraga-Celaya PhD

About¶

This website follows the FAIR and CARE data principles and hopes to help further open science.

Agenda¶

Lessons	Estimated Time to Complete	Link
Introduction to Cloud Native Data Types	15 minutes	presentation
Hands on with GeoJSON	30 minutes	GeoJSON.io
Hands on with Cloud Optimized GeoTIFF	30 minutes	cogeo.org
Break	10 minutes
Hands on with XArray & Zarr	30 minutes	Xarray, Zarr
Hands on with Cloud Optimized Point Clouds	30 minutes	COPC
Hands on with Spatio-Temporal Asset Catalogs	30 minutes	STAC
Summary and Conclusion	5 minutes

Pre-requisites¶

a laptop with an active wifi connection

helpful but not required¶

a basic understanding of the Command Line Interface (UNIX)
a basic understanding of Python3

Why "cloud native"?¶

There is a shift happening in the way we use Earth Observation System data to do research and management. Cloud data storage technologies have advanced at such a pace that we can now find and explore massive amounts of data via our web browser. At the same time online platforms with specialized software and hardware offer general data science and machine learning tools to explore these online datasets.

With these advances it is easier to foster collaborations, promote data-driven discovery, drive scientific innovation, increase transparency and improve reproducibility.

conventional — The old ways of receiving and working with GIS data.

Many of us have been participants in "sneaker net" and "mail order" data delivery ordering and managing data transfers over physical media. These data are then processed on our workstations and laptop computers and ultimately put on external hard drives or uploaded back to national data services. GIS data have changed hands for years over conventional internet protocols (https://, ftp://, and newer s3://), where datasets are preferentially DOWNLOADED to our local compute resources and worked on.

"Cloud Native" means you are no longer looking to download all of your GIS data. Instead, we send our "code" and our execution tasks to the "Cloud" where the data are processed, and serviced over a variety of commercial cloud providers who are already hosting these large geospatial datasets (often free of cost to us). Results can be viewed in the browser, or streamed in reduced formats back to our local computers.

Cloud-native and "Analysis Ready Data" formats allow us to work with large datasets on the cloud easily and rather painlessly.

Open Architectures¶

The new approach to data sharing, focused on object storage rather than file downloads. This cloud platform approach is scalable and instead of moving data to processing systems near users as is the tradition, brings processing, computing, analytics and visualization to data – so called data proximate workbench capabilities, sometimes also referred to as server-side processing.

(Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science.)

Light reading¶

Gentemann, C. L., et al. (2021). “Science Storms the Cloud”. AGU Advances, 2, e2020AV000354. https://doi.org/10.1029/2020AV000354

Abernathey, R. P. et al. (2021) "Cloud-Native Repositories for Big Scientific Data," in Computing in Science & Engineering, vol. 23, no. 2, pp. 26-35, 1 March-April 2021, https://doi.org/10.1109/MCSE.2021.3059437

Last update: 2022-11-15