Open Earth Monitor — Global Workshop 2023

Daniel Loos

2023 - today Data Scientist and PostDoc, Max Planck Institute for Biogeochemistry
2018 - 2023 Doctoral researcher Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology
2015 - 2017 Master Bioinformatics, Friedrich Schiller University Jena
2012 - 2015 Bachelor Molecular Life Science, University of Lubeck

The speaker's profile picture


DGGS.jl: A Julia Package for Scalable Geospatial Analysis Using Discrete Global Grid Systems
Daniel Loos

Discrete Global Grid Systems (DGGS) tessellate the surface of the earth with hierarchical cells of equal area, minimizing distortion and loading time of large geospatial datasets, which is crucial in spatial statistics and building Machine Learning models. Successful applications of DGGS include the prediction of flood events by integrating remote sensing data sets of different resolutions, as well as vector data. Here we present DGGS.jl: An analysis framework for scalable geospatial analysis written in the Julia programming language. Bindings from the C++ library DGGRID were created to convert between geographical coordinates and DGGS cell ids, as well as to provide several projections and grids. An efficient data structure and chunking scheme based on data cubes and Zarr-arrays was created to store remote sensing data of different resolutions, structured in accordance with the selected grid. This provides the basis for fast and accurate ML modeling, especially distortion-less and spatially aware Graph Convolutional Neural Networks. Furthermore, the hierarchical cell structure of a DGGS enables multiscale modeling, in which regions of interest can be represented in a higher resolution than others.

Poster presentation
Distributed computing on large geodata from multiple sources using the Julia Programming language
Felix Cremer, Daniel Loos

Spatiotemporal data cubes are becoming ever more abundant and are a widely used tool in the Earth System Science community to handle geospatial raster data.
Sophisticated frameworks in high-level programming languages like R and python allow scientists to draft and run their data analysis pipelines and to scale them in HPC or cloud environments.

While many data cube frameworks can handle harmonized analysis-ready data cubes very well, we repeatedly experienced problems when running complex analyses on multi-source data that was not homogenized. The problems arise when different datasets need to be resampled on the fly to a common resolution and have non-aligning chunk boundaries, which leads to very complex and often unresolvable task graphs in frameworks like xarray+dask.

In this workshop we present the emerging ecosystem of large-scale geodata processing in the Julia programming language under the JuliaDataCubes github umbrella.
Julia is an interactive scientific programming language, designed for HPC applications with primitives for Multi-threaded and Distributed computations built into the language.
We will demonstrate an example analysis where data from different sources (global fields of daily MODIS, hourly ERA5, high-resolution land cover), summing to multiple TBs of data, can interoperate on-the-fly and scale well when run on different computing environments.

OEMC project workshop
EURAC Seminar room 2 & 3