Bloom Filter implementation in F#

Further to my previous post on bloom filters for efficient ontology lookup, I’ve made a simple implementation in F#. This is based on a Java implementation by Ian Clarke.

The neat thing about Ian’s implementation is its use of Random to extend the hashcode provided by the object being stored into a hash of arbitrary length, suitable for use by the bloom filter algorithm. This will reduce the quality of the hash, but for an arbitrary passed-in object, it’s hard to do better. (For a specific application, like storing ontology labels, it would be better to use a more specific algorithm such as a Jenkin’s Hash).

#light
open System
// Based on Java Bloom Filter implementation by Ian Clarke 
// http://locut.us/blog/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/
type BloomFilter(bitArraySize : int, expectedElementCount : int) =
    let bitSet  = new System.Collections.BitArray(bitArraySize, false)
    let bitArraySize = bitArraySize
    let expectedElementCount = expectedElementCount
    let k = (int) (Math.Ceiling( ((double) bitArraySize / (double) expectedElementCount) * Math.Log(2.0) ))
    let bitSequence o =
        let r = new Random( hash o )
        Seq.init_infinite (fun n -> r.Next(bitArraySize))
    member b.expectedFalsePositiveProbability =
        Math.Pow((1.0 - Math.Exp( -((double) k) * (double) expectedElementCount / (double) bitArraySize)), (double) k)
    member b.add o =
        let sq = bitSequence o
        for x in 0 to k do bitSet.Set( Seq.hd sq , true)
    member b.addAll os = Seq.iter b.add os
    member b.clear = bitSet.SetAll(false)
    member b.contains o =
        let sq = bitSequence o
        let isSet n = bitSet.Get( Seq.hd sq )
        Seq.for_all isSet [0..k]
    member b.containsAll os = Seq.for_all b.contains os

Using F# makes the code slightly neater than Ian’s Java version – I’ve been able to factor out the hash code into a sequence supplied by the bitSequence function, and to collapse some of the for loops into operations over lists instead. But, the basic structure of the code is still very similar.

Oxford Semantic Web Interest Group

Myself, and Leigh Dodds of Ingenta, recently spoke at the Oxford SWIG. We were both talking about SPARQL, the standard query language and protocol for the Semantic Web.

Eamonn Neylon, who organized the session, has written up a summary of the talks. I was talking about DBPedia and how to write SPARQL extension functions; Leigh was talking about his SPARQL tool Twinkle, and about the different SPARQL query forms (SELECT, DESCRIBE, CONSTRUCT, and ASK) and what they are useful for.

Bloom Filters for efficient ontology querying and text mining

One of the problems with large ontologies such as SNOMED Clinical Terms is that they’re, well, large. So, it’s not typically possible to hold all of the ontology in memory at once, and queries against it require a database lookup. It’s possible to eliminate a number of database accesses, and thus speed up the query process, by using a Bloom filter.

A Bloom filter is a memory-efficient probabilistic data structure that lets you test whether a particular item is a member of a set. It may return false positives, but not false negatives. So, by adding all of the terms in your ontology to a Bloom filter, you can do a fast, in-memory check to see whether an entered term definitely doesn’t exist in your ontology. If the Bloom filter reports that the term does exist, then you can confirm with a slower file or database query for that term.

In an application where you expect to encounter many terms that aren’t in the ontology, such as automated metadata extraction from documents, and automated document classification, then this can potentially lead to large performance improvements.

I think there are also interesting possibilities in using Bloom filters in environments where storing a whole ontology isn’t feasible. For example, a JavaScript implementation of a Bloom filter, initialized with a few 100kb of data, could give a fairly high probability of testing accurately whether a particular term exists in an ontology of half-a-million terms.

Microsoft takes on GWT

It’s been a problem for a while that developers of web applications need to use a language like JavaScript on the web client, and another language like Java or C# or Python on the server. One popular attempt to fix this is Google’s GWT, and there have been other less mainstream options like ParenScript for Lisp and Links.

Now, Microsoft is launching another contender in the same space: Volta.

The post is somewhat obscure, but it’s essentially a beta version of a GWT competitor for .NET. You use annotations to mark chunks of code to be run on the client-side or server-side, and they’re compiled behind the scenes to JavaScript and deployed. There’s a debugger and profiler for the client-side code too.

An interesting feature about it, is that it works on MSIL (the .NET bytecode) rather than on the language syntax (as GWT does). Therefore, you should be able to use the more functional .NET languages with it – F#, for instance, is an ML implementation for .NET that appears well supported by MS. For that matter, C# 3 is already among the most functional mainstream languages.

The beta version is only available for Visual Studio.NET 2008 – currently available if you have an MSDN subscription, but not yet available for purchase.

SWIG-UK Semantic Web workshop

Last Friday I attended and spoke at the SWIG-UK Semantic Web workshop in Bristol, organized jointly by Hewlett-Packard Research and the Oxford Semantic Web Interest Group (OxSWIG).

It was a fun day, with a number of interesting speakers. I’d say that the common theme was that the speakers generally were using Semantic Web technologies to solve real world problems, without being too concerned about the “Semantic Web” in abstract. Alberto Reggiori of Asemantics described their “RDF on the inside, Web 2.0 on the outside” approach that they’re using for work around the BBC Memoryshare site, and this pragmatic approach was typical of the other speakers.

A talks that particularly stood out for me was the first talk, from Ian Davis of Talis, talking about their “Platform as a Service” model: providing a hosted RDF datastore with a REST API. It seems very much like a higher level version of Amazon’s S3 – but rather than providing a simple data-bucket API, they’re providing a more sophisticated SPARQL interface, as well as a full text search.

I was also very interested in what Graham Klyne of Oxford University said about their experience in providing access to scientific data on the web. I’ve been working with Graham recently, so I’m generally familiar with their work, but it was good to hear more background. They are using a “Data Webs” approach – putting a SPARQL endpoint on top of their existing ePrints repository, so the data stays in its original location. The key points from his talk were the importance of the loose coupling enabled by their current approach, the advantages of leaving data in place rather than extracting it to a separate store, and that rapid progress is possible

I talked about applying Semantic Web technologies to data integration and visualization, working with Oxford University to help visualize gene expression in fruit flies; and working to combine customer data from several sources including a CRM system and visualize it. My slides are available. In the same session as me, Daniel Lewis spoke about the intersection between the Social Web (i.e. Facebook, LinkedIn, etc.) and the Semantic Web. Unfortunately, he borrowed my laptop to do so, so I don’t have any detailed notes on his talk.

There were a few discussions on the day about a follow-up meeting. With luck, there might be another meeting, possibly in the Oxford/Milton Keynes area, early in the new year. It also sounds like Talis might be arranging a SWIG meeting for next year.

In any case, the next evening meeting of the Oxford SWIG is on Wednesday 12th December, at the Lamb and Flag in Oxford.

Why Wikipedia is a bad example of knowledge management

Nowadays, most peoples first contact with a wiki is through Wikipedia. Wikipedia is a great resource, but I think it’s a bad example of using a wiki for knowledge management. The limitation of Wikipedia is that it aims to be purely a repository of existing knowledge. Wikipedia policies require that content on the site is written from a neutral point of view, is verifiable and contains no original research: even going so far as to ban synthesis of other sources. Although reasonable in an encyclopaedia, these restrictions give the wrong idea about how wikis can be used for knowledge management. They encourage the idea that wikis are for knowledge capture, rather than knowledge creation, and that their main purpose is to produce a good knowledge store.

This bias extends to commercial wiki vendors, too. Wiki vendor MindTouch describes their wiki as making “knowledge easier to capture, find and consume” (thanks to Zoli’s Blog for the link), and Atlassian describe their wiki Confluence as “lowering the barriers to knowledge capture“.

However, there are other wikis than Wikipedia, and the original Wiki was designed for knowledge creation. The Portland Pattern Repository (also called the WikiWiki, or Ward’s Wiki) was the first Wiki, created by Ward Cunningham in 1995. Although technically simple, the Wiki’s brilliance was in its design: It facilitated asynchronous conversations between distributed contributors, enabling a dialogue across time and space. At its best, these conversations led to new insights, that were then formalised into documents.

The WikiWiki was originally focused around design patterns, but by the time I first encountered it in 1997, it was starting to be focussed around Extreme Programming, and since then has drifted into other topics without a clear focus. Some of the most interesting pages on it are the longest established ones: I particularly like the patterns pages, and the proto patterns.

At 67 Bricks, we’re using Semantic MediaWiki to help us create new knowledge, as well as to store our existing knowledge. I’ll write more about it in a future blog entry.

Setting up a website using WordPress

I’ve set the 67 Bricks website up using WordPress as a content management system. Previously, I’ve used either simple static HTML pages, or a traditional, full-featured CMS. I decided to use WordPress here because it is easy to use, and has an ecology of templates written for it making site design much simpler. It also describes itself as a “Semantic Publishing Platform”, and for a knowledge-management company such as our own, semantic publishing is important.

Setting up WordPress requires very little technical knowledge: my ISP, Heart Internet allows you to do it with a few clicks in their script library, but even without that, it’s still simple. The changes that I’ve made to make it work as a CMS are:

  • Installing the Navigo plugin: which makes it easy to create a menu
  • Installing Search Everything: so pages can be searched as well as posts
  • Installing WP Last Posts: to put the text of the last few posts on the home page

Then, I created the bulk of the site using WordPress “pages”, and the news items as “posts”. I set the front page to be a static page under the Options | Reading menu.

To create a design for the site, I looked through the library of templates on the WordPress site, downloaded four or five that looked interesting, and tried them out locally. Having found a template with a good layout, clean HTML code, and a reasonable license, I then customized it to remove the sections I didn’t like, and to change the graphics – a much simpler process than building a web site design from scratch.

That’s really all there is to it.

So, what did I learn?

Setting up a website with WordPress is very quick. Installation of WordPress took a few minutes, installing plugins took maybe quarter of an hour, choosing and customizing a theme about an hour, and the rest of the time was spent creating content.

The WordPress way of separating content from presentation using PHP functions works well. This is more powerful than the pure CSS approach of CSS Zen Garden, since it allows different templates to display different content. An alternative would be to use XQuery: which would probably be more versatile, but the PHP developer community hasn’t had a wide take-up of XQuery and seems generally sceptical of XML.

(thanks to twenty3design for help with choosing WordPress plugins)