Brussels / 1 & 2 February 2014

schedule

How we found a million style and grammar errors in the English Wikipedia

...and how to fix them


LanguageTool is an Open Source proofreading tool developed to detect errors that a common spell checker cannot find, including grammar and style issues. The talk shows how we run LanguageTool on Wikipedia texts, finding many errors (as well as a lot of false alarms). Errors are detected by searching for error patterns that can be specified in XML, making LanguageTool easily extensible.

LanguageTool exists since 2003, and it now contains almost 1000 patterns to detect errors in English texts. These patterns are a lot like regular expressions, only that they can, for example, also refer to the words' part-of-speech. The fact that all patterns are independent of each other makes adding more patterns easy. I'll explain the XML syntax of the rules and how more complicated errors, for which the XML syntax is not powerful enough, can be detected by writing Java code.

Running LanguageTool on a random 20,000 article subset of the English Wikipedia led to 37,000 errors being detected. However, many of these errors are false alarms, either because of problems with the Wikipedia syntax or because the LanguageTool error patterns are too strict. So we manually looked at 200 of the errors, finding that 29 of the 200 errors were real errors. Projected to the whole Wikipedia (currently at 4.3 million articles), that's about 1.1 million real errors - and that does not even count simple typos that could be detected by a spell checker. If you want less errors in your Wikipedia: LanguageTool offers a web-based tool to send corrections directly to Wikipedia with just a few clicks. And while these numbers refer to the English Wikipedia, LanguageTool also supports German, French, Polish, and many other languages.

This talk will contain lots of examples of errors that can be detected automatically, and others that can't. I'll also explain that LanguageTool itself is just a core written in Java (and available on Maven Central), but that it also comes with several front-ends: a stand-alone user interface, add-ons for LibreOffice/OpenOffice and Firefox and an embedded HTTP server.

Speakers

Daniel Naber

Attachments

Links