Online / 6 & 7 February 2021

schedule

Hypers and Gathers and Takes! Oh my!

Processing large datasets in Raku.


Raku takes the pain out of parallel processing. Combined with rational numbers for lossless conversions it is an ideal language for ETL. The operators may take a bit getting used to for Perl hackers, however. This talk looks at pre-processing the nr.gz file, splitting it into separate streams of digests and data to efficiently find duplicates. This walks through the basic process of getting the file open with gzip, acquiring data with block reads, gathers and takes, lazy operators, and other bits of technological magick. The goal is showing a reasonable framework for parallel processing of a really large dataset.

No, Raku isn't Perl. But it is high-level and offers quite a bit of 20/20 hindsight. The new constructs make it rather well suited to handling ETL, especially transforming bulky datasets. Finding duplicates in the nr.gz file is a multi-stage process, the first pre-processing stage can take hours. Going through a dozen iterations of the code I've found some alternatives for reading and parallelizing the data that could save re-inventing some wheels. Much of the linguistic territory here is foreign to Perl: gathers and takes, hypers, lazy operators, even the maps read backward (well, forward, but that's backward!).

Much of the programming involves tradeoffs between speed and size, as usual, so I'll also look at a few different ways to chunk and re-process the data for speed or space.

Speakers

Photo of Steven Lembark Steven Lembark

Links