March 2008 – 67 Bricks blog

Bloom Filter implementation in F#

Further to my previous post on bloom filters for efficient ontology lookup, I’ve made a simple implementation in F#. This is based on a Java implementation by Ian Clarke.

The neat thing about Ian’s implementation is its use of Random to extend the hashcode provided by the object being stored into a hash of arbitrary length, suitable for use by the bloom filter algorithm. This will reduce the quality of the hash, but for an arbitrary passed-in object, it’s hard to do better. (For a specific application, like storing ontology labels, it would be better to use a more specific algorithm such as a Jenkin’s Hash).

#light
open System
// Based on Java Bloom Filter implementation by Ian Clarke 
// http://locut.us/blog/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/
type BloomFilter(bitArraySize : int, expectedElementCount : int) =
    let bitSet  = new System.Collections.BitArray(bitArraySize, false)
    let bitArraySize = bitArraySize
    let expectedElementCount = expectedElementCount
    let k = (int) (Math.Ceiling( ((double) bitArraySize / (double) expectedElementCount) * Math.Log(2.0) ))
    let bitSequence o =
        let r = new Random( hash o )
        Seq.init_infinite (fun n -> r.Next(bitArraySize))
    member b.expectedFalsePositiveProbability =
        Math.Pow((1.0 - Math.Exp( -((double) k) * (double) expectedElementCount / (double) bitArraySize)), (double) k)
    member b.add o =
        let sq = bitSequence o
        for x in 0 to k do bitSet.Set( Seq.hd sq , true)
    member b.addAll os = Seq.iter b.add os
    member b.clear = bitSet.SetAll(false)
    member b.contains o =
        let sq = bitSequence o
        let isSet n = bitSet.Get( Seq.hd sq )
        Seq.for_all isSet [0..k]
    member b.containsAll os = Seq.for_all b.contains os

Using F# makes the code slightly neater than Ian’s Java version – I’ve been able to factor out the hash code into a sequence supplied by the bitSequence function, and to collapse some of the for loops into operations over lists instead. But, the basic structure of the code is still very similar.

Oxford Semantic Web Interest Group

Myself, and Leigh Dodds of Ingenta, recently spoke at the Oxford SWIG. We were both talking about SPARQL, the standard query language and protocol for the Semantic Web.

Eamonn Neylon, who organized the session, has written up a summary of the talks. I was talking about DBPedia and how to write SPARQL extension functions; Leigh was talking about his SPARQL tool Twinkle, and about the different SPARQL query forms (SELECT, DESCRIBE, CONSTRUCT, and ASK) and what they are useful for.

Bloom Filters for efficient ontology querying and text mining

One of the problems with large ontologies such as SNOMED Clinical Terms is that they’re, well, large. So, it’s not typically possible to hold all of the ontology in memory at once, and queries against it require a database lookup. It’s possible to eliminate a number of database accesses, and thus speed up the query process, by using a Bloom filter.

A Bloom filter is a memory-efficient probabilistic data structure that lets you test whether a particular item is a member of a set. It may return false positives, but not false negatives. So, by adding all of the terms in your ontology to a Bloom filter, you can do a fast, in-memory check to see whether an entered term definitely doesn’t exist in your ontology. If the Bloom filter reports that the term does exist, then you can confirm with a slower file or database query for that term.

In an application where you expect to encounter many terms that aren’t in the ontology, such as automated metadata extraction from documents, and automated document classification, then this can potentially lead to large performance improvements.

I think there are also interesting possibilities in using Bloom filters in environments where storing a whole ontology isn’t feasible. For example, a JavaScript implementation of a Bloom filter, initialized with a few 100kb of data, could give a fairly high probability of testing accurately whether a particular term exists in an ontology of half-a-million terms.