Xapian: the tricky multi-term removal process

Update 2012-01-17: this article is quite old now and it might be completely irrelevant. It is only provided as a hint which might help you out writing a procedure in PHP to manage indexing. As Olly Betts (main developer of Xapian) commented below, the error message doesn't come directly from Xapian either, but it might be coming from some of the things built on top. No harm is meant to Xapian, it is a very light weight solution, fits very well in our needs to have an indexing component in our PHP application without adding complicated Java requirements and its been working for us for several years now. No critical use, but it's never been down either. Chamilo now implements the Xapian search engine in its professional version. The results are quite good, but to implement a very specific need for one customer, we had to make something a bit complicated: we associated terms in the Xapian database to a specific table of terms in Chamilo. Not playing too much with transactions (as we should, really), we've been relying on the process of keeping the two codes databases in synch by having code that only does the two things together each time. Of course, one of our team was taken aback by a client request and decided to "clean some terms directly from the Chamilo database"... Murphy's law's applications are always around... Anyway, I had to implement a little (very ugly for now) interface to add/remove/edit terms from the Xapian database without affecting the Chamilo database. That's when I realized that, when you remove terms from a XapianDocument object, you have to do the following process:
$list_of_terms_to_remove = array('term1','term2','term3'); $xi = new XapianIndexer(); $doc = $xi->get_document((int)$doc_id); foreach ($list_of_terms_to_remove as $rem_term) {

$doc->remove_term($rem_term);

if ($doc instanceof XapianDocument) { $xi->getDb()->replace_document((int)$doc_id, $doc); }

} $xi->getDb()->flush();
Now... it doesn't look like it, but the replace_document() method is actually quite important. If you don't put that one *in* the loop, then Xapian will give you an evil error saying a term cannot be removed from an unexisting document! You want to avoid that? Use replace_document(); It's that easy.

Comments

I think nowadays it is very important for all "long way" projects to adopt a framework; whether it be sooner or later. I remember a company which developed a CMS and they spent 70% of the work time every day fixing and adapting the libraries, and I thought "they are crazy": everything they are doing was already done in a better way and is ready to use in RoR, Django or CakePHP: focus in understanding deeply and totally the domain logic of your application not in developing a framework. One must now learn to drop the old libraries which one created (and many times one loves). This needs a mature developer personality: but just let it go.

That's the reason I began my own e-learning tool: http://trac.chipotle-software.com/karamelo/ I know there is dirty stuff in many files but is full OO, MVC and to know there are people making improvements in ActiveRecord, Ajax, ACL, security, and cache features for my application while I work in my own tickets is great and makes development process faster.

I look at the code in the main LMS and I don't (I REALLY don't) wanna spend four boring and frustrating months deciphering a non-standarized and "practical born" libraries like in Moodle, Claroline or Dokeos 1.x. Libraries which are not supported for any design pattern. I respect their work because I know it takes many hours of hard work to implent a feature but as long as these projects don't adopt a framework they are driving away new developers.

On the other hand, for the las fourteen months the CakePHP developers have been a great advancement in reliability and velocity, but over all when you use a framework automatically you add millions of potential developers which can read and understand the code immediatly, "oh! that is the model method", "here is ActiveRecord" and "that is an HTML helper mmm I know where those arrays came from ... I can modify them" and so on. At the end adopt a framework have socials repercutions in order to build a better community.

Cheers!

Hi Julio, as mentionned in another post, I agree with the idea of not reinventing the wheel, and I feel startled by the fact that you are writing this, yet you started a new e-learning tool on your own with no apparent previous major experience of a PHP framework. Maybe your own ideas do not apply to you?
There are more than 6 major open-source e-learning systems around (Dokeos, Moodle, Atutor, ILIAS, e-doceo and Sakai), yet you still find a way to start your own (be it for a totally non-end-user-focused reason of "not supporting any design pattern"). All of them are probably summing up to more than 20 man-years of development.
The "four boring and frustrating months" you are talking about are probably about identical to learn CakePHP (I stopped after one week).
I can only wish you good luck in your endeavours and I hope you can prove me wrong by making a better LMS than Dokeos ;-)

You should be able to just call replace_document() once after removing the terms you want, and in my tests that works fine. I'm not aware of there ever being any bugs in this area. If you can reproduce this, please file a bug. However, there's no error message in Xapian about "unexisting document", so perhaps you were actually getting that error from something else?

Incidentally, I don't see the use of the instanceof check - get_document() will throw an exception if it fails rather than returning null, and besides if $doc isn't a XapianDocument object, then the call to its remove_term method will fail.

Hi Olly,

Actually as you can see there's been quite some time since the article's been written, and we have been using Xapian in Chamilo succesfully for quite a while (thanks for the precious and quick help you provided time and again on the mailing list, by the way). I'm not sure if this still appears but I have kind of a very busy schedule these days so I'll check it again sometime but dunno when. In the meantime, I'll edit the post to make it clear it's not really relevant anymore.