Online / 5 & 6 February 2022

visit

A lightning intro to re-Isearch

re-Isearch, the 27 year old new kid on the search block


Project re-isearch is a novel multimodal search and retrieval engine using mathematical models and algorithms different from the all-too-common inverted index. The design allows it to have, in practice, effectively no limits on the frequency of words, term length, number of fields or complexity of structured data and support even overlap--- where fields or structures cross other's boundaries (common examples are quotes, line/sentences, biblical verse, annotations). Its model enables a completely flexible unit of retrieval and modes of search. Developed using a highly portable C++ subset to be RAM efficient, the engine provides also bindings to a number of other languages such as Python, Tcl, Java etc.

“Re-isearch” is a project following in the spirit of the original isearch developed back in the 1990s. Reborn in 2020 in the middle of the global Covid19 pandemic as Project re-Isearch.

Like the original, it is not just about textual words but pushes the envelope. re-Isearch is multi-object, multi-modal and with an unharnessed unit of retrieval.

Mainstream search engines are about finding any information: "a list of all documents containing a specific word or phrase”. So search engines paradoxically return both too much information (i.e. long lists of links) and too little information (i.e. links to content, not content itself). The re-Isearch engine is, by contrast, about exploiting document structure, both implicit (XML and other markup) and explicit (visual groupings such as paragraph), to zero in on relevant sections of documents, not just links to documents. This concept of search granularity is a radical departure from other designs. With typical text indexers one has the concept of document or record and that is the unit of index and the unit of retrieval. Instead we can have a dynamic search time unit of retrieval: user specified or heuristically determined. The structure of of documents can be exploited to identify which document elements (such as the appropriate chapter or page) to retrieve. Retrieval granularity may be on the level of sub-structures of a given document or page such as line, paragraph but may also be as part of a larger collection.

Like the original, it is not just about textual words but the design contains a large number of objects: numerical, range, geospatial etc. It is unique among full-text systems in that it also provides numerous object types with their own methods of search and allows these to be viewed parallel as text--- a date field (of which it will be one of the first to support some key parts of the new ISO-8601:2019 standard date semantics), for instance, can be searched as a date but also a text, searching for the words in the field.

Speakers

Photo of Edward Zimmermann Edward Zimmermann

Links