Further to my previous post on bloom filters for efficient ontology lookup, I’ve made a simple implementation in F#. This is based on a Java implementation by Ian Clarke.
The neat thing about Ian’s implementation is its use of Random to extend the hashcode provided by the object being stored into a hash of arbitrary length, suitable for use by the bloom filter algorithm. This will reduce the quality of the hash, but for an arbitrary passed-in object, it’s hard to do better. (For a specific application, like storing ontology labels, it would be better to use a more specific algorithm such as a Jenkin’s Hash).
#light open System // Based on Java Bloom Filter implementation by Ian Clarke// http://locut.us/blog/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/ type BloomFilter(bitArraySize : int, expectedElementCount : int) = let bitSet = new System.Collections.BitArray(bitArraySize, false) let bitArraySize = bitArraySize let expectedElementCount = expectedElementCount let k = (int) (Math.Ceiling( ((double) bitArraySize / (double) expectedElementCount) * Math.Log(2.0) )) let bitSequence o = let r = new Random( hash o ) Seq.init_infinite (fun n -> r.Next(bitArraySize)) member b.expectedFalsePositiveProbability = Math.Pow((1.0 - Math.Exp( -((double) k) * (double) expectedElementCount / (double) bitArraySize)), (double) k) member b.add o = let sq = bitSequence o for x in 0 to k do bitSet.Set( Seq.hd sq , true) member b.addAll os = Seq.iter b.add os member b.clear = bitSet.SetAll(false) member b.contains o = let sq = bitSequence o let isSet n = bitSet.Get( Seq.hd sq ) Seq.for_all isSet [0..k] member b.containsAll os = Seq.for_all b.contains os
Using F# makes the code slightly neater than Ian’s Java version – I’ve been able to factor out the hash code into a sequence supplied by the bitSequence function, and to collapse some of the for loops into operations over lists instead. But, the basic structure of the code is still very similar.