ht://Dig vs mnoGoSearch Comparison

This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/209). Since then, other massive 
players like Lucene and Xapian have been considered as open-source search 
engines, but were not included yet into this comparison.

Introduction

This article is an attempt to compare objectively the two wonderful Open Source Indexation Tools that are ht://Dig and mnoGoSearch. The criteria of comparison will mainly be aimed at what one can do on a Linux web server with Open Source database systems such as PostgreSQL and MySQL, with PHP bindings if available. This is because this research isn't funded by anyone and I have a need of one of these products for, in this case, the development of a search tool for Dokeos, an e-learning management system developed in PHP+MySQL. Here, I will mix (intentionally) the terms search tool and indexing tool, because indeed those products do both, even if the main effort is done on indexing. I know the terms are different, but the distinction doesn't make sense here, and it's easier for non-technical people to understand search tool than indexing tool. The indexing system will probably work with the help of command-line parsers. Those parsers generally only work on Linux systems, so the Dokeos system would need to keep this indexing tool as a plugin, so as not to force the user to use a Linux server. This only affects the server anyway, and the Dokeos system can still be used from Windows computers. For a more extensive list of search tools, you might want to visit searchtools.com which has quite an impressive list… In fact, the pages there for ht://Dig and mnoGoSearch are really interesting.

General project aspect

At first glance, we could say that the website is the reflect of the overall quality of the product. More than just finding if the website is crappy or well-organised (which is the reflection of the mind of the developers), you can find documentation, changelogs, and many other interesting stuff there. Let's see what we can extract from this part
Information type ht://Dig mnoGoSearch
Design OK Bad
Last changelog date January 2002 January 2005
Last release date June 2004 January 2005
Finding information easily Yes Yes
Last documenation update 2002? December 2004 (release 3.2.26)
References to "clients" Around 500 (impressive!) Around 200 (including MySQL and Debian!)
Another interesting point is the portability of the system. Both systems are only needed on the server, so only the servers OS is important.
Operating system ht://Dig mnoGoSearch
Win32 via CygWin - native but not free
Linux yes yes
Digital Unix yes yes
FreeBSD yes yes
OpenBSD no yes
HP UX yes yes
Solaris yes yes
Irix yes yes
SunOS yes yes
Mac OS X yes yes
Mac OS 9 yes no
BSDI no yes
SCO Unixware no yes
AIX no yes
As for the mailing-lists, it appears to me as if mnoGoSearch was trying to give value to its support contracts by being very slow to answer help messages. The result is pretty poor, as you need around a week to get an answer for pretty much everything, and the documentation is not really good (although pretty much up to date). On the other side, the ht://Dig support seems faster, but then it always seems like only one person gives answers, and sometimes no answers are given either. As for mailing-list support, I would say both projects suck, but mnoGoSearch offers official commercial support, which might be a good thing for companies looking into the product. As for numbers, in the same period of 45 days observation, we have 134 mails in the htdig-general mailing-linst whereas mnogosearch-general comes up to almost 400.

Web and offline "crawling"

Both systems were made to crawl web pages and to extract info from them. This is not possible when a website is password-protected. Unless you use mnoGoSearch to pass the Basic Authentication method, which is not often used to protect a website nowadays. To get into a, say, password-protected PHP website, you will need to write a PHP page (very well protected) that will log you in and initiate a session for you  [1] While mnoGoSearch offers offline and database crawling, the documentation lacks too, and if you have a little sample given away with the program code, it is not really helpful because you still need to have web pages the users can use with a particular ID to get the document you indexed offline. There is also a lack of documentation here about how to use multiple offline searches on the same server.

The document types

Both search tools were first aimed at indexing HTML documents. The respective websites of these product are clear as for their title. They are Internet search tools. As far as Dokeos is concerned, this is not enough. Courses are composed of a mix of HTML pages and documents, which can be whatever the user decides. The more document types the search tool can index, the better. The following table is build as accurately as can be from the documentation of these products as of the writing of this article. Although the search tools only index text and HTML files, they allow the usage of parsers (see here for mnoGoSearch and here for ht://Dig) that will convert other formats to text or HTML files, which can then be indexed.
Document type ht://Dig mnoGoSearch
HTML yes yes
Plain text yes yes
MS-Word yes, via catdoc yes, via catdoc
PDF yes, via xpdf or pdf2text (pdfinfo also extracts meta info) yes, via pdf2txt
PostScript yes, via ps2ascii yes, via ps2ascii
Man pages probably via parser yes, via deroff
RPM packages probably via parser yes, via rpminfo
MS-Excel probably via parser yes, via xls2csv
RTF probably via parser yes, via unrtf
MS-PowerPoint yes, via ppt2html yes, via ppt2html
Database content yes, via web frontend :-( yes, via htdb features 2
Website needing HTTP context (scripts) ? yes
HTTPS ? yes
Others... yes, via appropriate parser yes, via appropriate parser

Indexing methods

Indexing uses several kinds of possible algorithms. Generally speaking, it is the users' role to choose which algorithm(s) will be used. This is a list of supported algorithms types:
Algorithm types ht://Dig mnoGoSearch
Stemming yes yes
Soundex yes
Fuzzy yes
Synonyms yes
Substrings yes
Thesaurus based indexation no no
And although they both seem to support boolean search, it might be somewhat limited as for latest versions
Boolean operator ht://Dig mnoGoSearch
AND yes yes
OR yes yes
NOT no yes
GROUP no yes
Phrase matching no

Language handling

An important matter in the current case is the handling of the different languages. Dokeos (the company) has its headquarters in Belgium, which is a trilingual country (french, dutch, german) and uses a lot of english. But the Dokeos product is Open Source and we have heard (and have already a translation for it) that Japanese users exist. So the multilingual aspect is important, for the database data as well as the documents and HTML pages. There couldn't be a clear table to compare the two products here. ht://Dig offers multilingual support based on a configuration file and ispell dictionnaries. However, this is only for 8-bits characters encoding, so Japanese and Chinese cannot be handled (see report). mnoGoSearch offers multilingual support based on a configuration file, but the indexation of other-than-english documents is badly documented, so we don't know what it's based on. However, the recommended encoding for multilingual support is Unicode, and as such it supports all characters. The language detection is supposed to be automatic, based on words contained in the documents… let us see if we can get more info about this somewhere. The tool uses ispell as a spell checker, following searchtools.com.

Search interfaces

I will only be able to talk about mnoGoSearch here for now, because I haven't tested the ht://Dig interface yet. The mnoGoSearch project offers a bundled CGI search interface as default. This interface is nice and easy, and works with template files so you can modify them at will (almost). But sometimes (as in this case), you need to integrate this search interface into another project, for example a PHP application. mnoGosearch also provides a PHP interface that you need to install (not so easily) on your system to be able to use PHP extension functions directly from your PHP project. Let's consider this option in more details
  • the install
The first word that comes to my mind here is "brainsquishing". Why does an install need to have the MySQL and PHP sources and recompile everything? Why isn't there an easy way to download some scripts, move them to the right place, and start to play? Apparently, this was the case previously and it stopped because integrating the mnoGoSearch functions into a C compiled extension to PHP was more efficient. Well, for me, who just want a fast, integrable solution, it's just a bad point. Not only do you need to compile these, but you also need to have the MySQL and PHP sources at hand for this compile process. Most user that want to try it on a outsourced server where they don't have near-admin permissions will be unable to use that…
  • the config
I did not get out of this compilation step, so I will write something here when I have.

Other considerations

ht://Dig (or its rundig program) seems to rebuild the index database from scratch each time it runs. Written in C++. mnoGoSearch only updates those indexes that are bound to modified sources (but does it work for database sources?). Uses database systems for creating index tables and can handle a dozen of different DBMS. Implements servers clustering and mirroring. Can do basic authentication if the webserver requires it and the settings are set for it. Supports weighting for pages structure. Supports compressed gzip format. Written in C. Both systems can do search result highlighting.
[1] But mnoGoSearch (what about ht://Dig?) before version 3.2.34 didn't support cookies, so you had to configure PHP to use trans_sid, which meant the PHPSESSID was given in the URL. But you didn't want this PHPSESSID to appear in the index database, as users that would have clicked the link would have automatically been connected (depending on the session lifetime). So you had to implement something called ReverseAlias to get rid of this URL part before it got into the database. Things then became very hard to do as there was a big lack of documentation on how to use this in combination with a list of servers kept in a database table.