Thursday, December 07, 2006

Prolog and regular expressions

Some time ago I started writing a lemmatizer (i.e. a program that receives as input a word and some tags that describe it ad outputs the dictionary form of the word) for Dutch. I have been programming nearly exclusively in Prolog for the last three months, so the choice of language was quite obvious (also, I have the feeling that trying to write something serious in a new programming language helps you learn it a lot better than writing even a gazillion of text-book exercises). However, not everything was so simple. When you are doing any text-processing, you surely want to use regular expressions.

Both Prologs I usually use --SWI and Sicstus-- allow you of course to use DCG, but that was not quite what I was looking for. I wanted normal, Posix style regular expressions. First of all, it is what I know and I'm used to. Secondly, it is what the majority of people who will have anything to do with my program will be used to and will understand. And, last but not least, I don't really feel like writing and naming a new predicate every time I have to match something.

A good place for starting my quest for perl-style regexp was this. It is a quite extensive list of resources related to regular expressions. My first idea was finding a library that I could use with SWI or Sicstus. However (you may call me dumb) I didn't manage to compile or put to work any of the things I found on Van Nord's page. So, I decided to try to find a Prolog implementation that comes with regexp out of the box.

The two Prologs I tested were Ciao and Yap.

Ciao at first makes a good impression (even though it's web-page is *really* ugly), however, the regexp implemented are quite lame: the don't support extracting found groups (or at least I didn't find it), which rendered it quite useless.

In Yap (which stands for Yet Another Prolog and it's developed in Portugal), however, regexp finally worked. It's regexp library is a direct port of the FreeBSD regex library, so it's pleasantly plain and normal. The only drawback is that it is very poorly documented. For example, I spent quite a long time wondering who and why would implement regular expressions without back-references. And I would still wonder if not a lucky typo, which revealed that in Yap regexp you access groups found earlier not by a backslash followed by number, but by a *double* backslash followed by a number (maybe for some reason this should be obvious, but it wasn't for me).

So, if you want to have a regular expression that matches all strings of the form "X=X", where X may be any string, you have to write:

?- use_module(library(regexp)).
?- regexp("^(.+)=\\1$", "a=a", []).

Apart from that strange incident, Yap turned to be a nice and fast Prolog implementation. I appreciate especially the SWI-Prolog compatibility module, which allows me to write exactly the way I'm used to (btw, there also exists a SICstus compatibility module). It's only a pity that Yap depends on some non-free elements (exactly, “free for non-commercial use”) which causes that Yap binaries aren't shipped in any major Linux distribution.

(Oh, and one last problem---the binaries from binaries.tar.gz crash after you try to import the library regexp... Luckily, the RPM version for Fedora works fine on Ubuntu, after you convert it to DEB with alien.)