Datacubes on Steroids with ISO Array SQL
Open Source, Open Standards, Open Terabytes
Never before it was so easy and inexpensive to gather, as well as generate, massive amounts of data. Often, data get discretized in space and time, naturally leading to multi-dimensional arrays. In fact, arrays play a core role in most domains of science, engineering, and business - generally speaking, spatio-temporal sensor, image, timeseries, simulation, and statistics data. This raises the need for flexible, scalable, and open services in replacement of the bespoke silo solutions that have prevailed in the past.
Traditional databases have been successful due to their flexibility (through query languages) and scalability (through manifold optimizations and parallelization in the server) - however, they unfortunately do not support massive arrays. This is being remedied within ISO currently where SQL/MDA ("Multi-Dimensional Arrays") is in an advanced stage, likely becoming adopted in summer 2017. SQL/MDA adds declarative array definition and operations to SQL. Not only paves this the way for powerful services, maybe even more important it allows, for the first time, integrating data and metadata into the same archive, even in one and the same query. As such, SQL/MDA will be a game changer in data services not only for science and engineering at large.
We present the concepts and rationales, as well as the open-source technology rasdaman ("raster data manager") which is serving as the blueprint for MDA.
We have learnt to live with the pain of separating data and metadata into non-interoperable silos. For metadata, we enjoy the flexibility of databases, be they relational, graph, or some other NoSQL. Contrasting this, users still "drown in files" as an unstructured, low-level archiving paradigm. It is time to bridge this chasm which once was technologically induced, but today can be overcome.
One building block towards a common re-integrated information space is to support massive multi-dimensional spatio-temporal arrays. These "datacubes" appear as sensor, image, simulation, and statistics data in all science and engineering domains, and beyond. For example, 2-D satellilte imagery, 2-D x/y/t image timeseries and x/y/z geophysical voxel data, and 4-D x/y/z/t climate data contribute to today's data deluge in the Earth sciences. Virtual observatories in the Space sciences routinely generate Petabytes of such data. Life sciences deal with microarray data, confocal microscopy, human brain data, which all fall into the same category.
The ISO SQL/MDA (Multi-Dimensional Arrays) candidate standard is extending SQL with modelling and query support for n-D arrays ("datacubes") in a flexible, domain-neutral way. This heralds a new generation of services with new quality parameters, such as flexibility, ease of access, embedding into well-known user tools, and scalability mechanisms that remain completely transparent to users. Technology like the EU rasdaman ("raster data manager") Array Database system can support all of the above examples simultaneously, with one technology. This is practically proven: As of today, rasdaman is in operational use on hundreds of Terabytes of satellite image timeseries datacubes, with transparent query distribution across more than 1,000 nodes.
Therefore, Array Databases offering SQL/MDA constitute a natural common building block for next-generation data infrastructures. Being initiator and editor of the standard we present principles, implementation facets, and application examples as a basis for further discussion. Time allowing we will present live demos from services exceeding 20 TB of "datacubes".