Brussels / 3 & 4 February 2018

schedule

It's a Trie... it's a Graph... it's a Traph!

Designing an on-file multi-level graph index for the Hyphe web crawler


Hyphe, a web crawler for social scientists developed by the SciencesPo m├ędialab, introduced the novel concept of web entities to provide a flexible and evolutive way of grouping web pages in situations where the notion of website is not relevant enough (either too large, for instance with Twitter accounts, newspaper articles or Wikipedia pages, or too constrained to group together multiple domains or TLDs...). This comes with technical challenges since indexing a graph of linked web entities as a dynamic layer based on a large number of URLs is not as straightforward as it may seem.

We aim at providing the graph community with some feedback about the design of an on-file index - part Graph, part Trie - named the "Traph", to solve this peculiar use-case. Additionally we propose to retrace the path we followed, from an old Lucene index, to our experiments with Neo4j, and lastly to our conclusion that we needed to develop our own data structure in order to be able to scale up.

Speakers

  • Paul Girard
  • Mathieu Jacomy
  • Benjamin Ooghe-Tabanou
  • Guillaume Plique

Speakers

Guillaume Plique

Attachments

Links