Brussels / 3 & 4 February 2018


Tools for large-scale collection and analysis of source code repositories

Open source Git repository collection pipeline

There are 10s of millions Git repositories publicly available over the Internet, but what kind of tools would one need to be able to treat all this code as a Big Dataset? I will talk about new and existing OSS tools that were built and used, in order to allow collection and analysis of millions of Git repositories on commodity hardware clusters.

Toos: - Babelfish: universal code parser - Engine: library, extending Apache Spark, to query Git repositories - Rovers/Borges: Git repository crawlers written in Go - Siva: optimized storage format - go-git: custom implementation of Git protocol and storage format - CoreOS and K8s clusters for jobs orchestration


Alexander Bezzubov