Automatically identifying (human) languages with code

For a recent project, we needed to automatically tell the difference between text that was written in French, German and English inside Word documents.

The simplest way of doing this is by checking the language attribute that’s been set on the style inside Word; unfortunately, very few Word users use the language value for styles correctly, or even use styles at all.

So, if we couldn’t trust the styles, we needed a mechanism that worked based on the text only. The first thing we tried was to identify some characteristic French, English and German words (like “des”,”und”,”für”,”and”), and check the text to see if it contained those words. The highest count of these distinctive words in a text determine which language it is likely to be.

This worked well, but we couldn’t be sure that the words would always appear in the text we were analyzing. So, we switched to an n-gram approach, as described in the thesis Evaluation of Language Identification Methods. This works by creating a “fingerprint” for the text, based on the occurrence of bigrams (“un”,”an”) and trigrams (“und”). It then compares this fingerprint to standard fingerprints for the various languages to find the one that it most resembles.

This gives better results when there is not much text available for analysis.

We’ve released the source code for this utility under an Open Source license. It’s written in Scala, an object-functional language that compiles to Java-compatible bytecode.