Brussels / 3 & 4 February 2018

schedule

Tools for large-scale collection and analysis of source code repositories

Open source Git repository collection pipeline


There are 10s of millions Git repositories publicly available over the Internet, but what kind of tools would one need to be able to treat all this code as a Big Dataset? I will talk about new and existing OSS tools that were built and used, in order to allow collection and analysis of millions of Git repositories on commodity hardware clusters.

Toos: - Babelfish: universal code parser https://doc.bblf.sh/ - Engine: library, extending Apache Spark, to query Git repositories https://github.com/src-d/engine - Rovers/Borges: Git repository crawlers written in Go https://github.com/src-d/rovers https://github.com/src-d/borges - Siva: optimized storage format https://github.com/src-d/go-siva https://github.com/src-d/siva-java - go-git: custom implementation of Git protocol and storage format https://github.com/src-d/go-git - CoreOS and K8s clusters for jobs orchestration

Speakers

Alexander Bezzubov

Links