Bloom Filters for efficient ontology querying and text mining

One of the problems with large ontologies such as SNOMED Clinical Terms is that they’re, well, large. So, it’s not typically possible to hold all of the ontology in memory at once, and queries against it require a database lookup. It’s possible to eliminate a number of database accesses, and thus speed up the query process, by using a Bloom filter.

A Bloom filter is a memory-efficient probabilistic data structure that lets you test whether a particular item is a member of a set. It may return false positives, but not false negatives. So, by adding all of the terms in your ontology to a Bloom filter, you can do a fast, in-memory check to see whether an entered term definitely doesn’t exist in your ontology. If the Bloom filter reports that the term does exist, then you can confirm with a slower file or database query for that term.

In an application where you expect to encounter many terms that aren’t in the ontology, such as automated metadata extraction from documents, and automated document classification, then this can potentially lead to large performance improvements.

I think there are also interesting possibilities in using Bloom filters in environments where storing a whole ontology isn’t feasible. For example, a JavaScript implementation of a Bloom filter, initialized with a few 100kb of data, could give a fairly high probability of testing accurately whether a particular term exists in an ontology of half-a-million terms.