This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/209). Since then, other massive
players like Lucene and Xapian have been considered as open-source search
engines, but were not included yet into this comparison.
Introduction
This article is an attempt to compare objectively the two wonderful Open Source Indexation Tools that are
ht://Dig and
mnoGoSearch.
The criteria of comparison will mainly be aimed at what one can do on a Linux web server with Open Source database systems such as
PostgreSQL and
MySQL, with
PHP bindings if available. This is because this research isn't funded by anyone and I have a need of one of these products for, in this case, the development of a search tool for
Dokeos, an e-learning management system developed in
PHP+
MySQL.
Here, I will mix (intentionally) the terms
search tool and
indexing tool, because indeed those products do both, even if the main effort is done on indexing. I know the terms are different, but the distinction doesn't make sense here, and it's easier for non-technical people to understand
search tool than
indexing tool.
The indexing system will probably work with the help of command-line parsers. Those parsers generally only work on Linux systems, so the Dokeos system would need to keep this indexing tool as a plugin, so as not to force the user to use a Linux server. This only affects the server anyway, and the Dokeos system can still be used from Windows computers.
For a more extensive list of search tools, you might want to visit
searchtools.com which has quite an impressive list… In fact, the pages there for
ht://Dig and
mnoGoSearch are really interesting.
General project aspect
At first glance, we could say that the website is the reflect of the overall quality of the product. More than just finding if the website is crappy or well-organised (which is the reflection of the mind of the developers), you can find documentation, changelogs, and many other interesting stuff there. Let's see what we can extract from this part
Information type |
ht://Dig |
mnoGoSearch |
---|
Design |
OK |
Bad |
Last changelog date |
January 2002 |
January 2005 |
Last release date |
June 2004 |
January 2005 |
Finding information easily |
Yes |
Yes |
Last documenation update |
2002? |
December 2004 (release 3.2.26) |
References to "clients" |
Around 500 (impressive!) |
Around 200 (including MySQL and Debian!) |
|
|
|
Another interesting point is the portability of the system. Both systems are only needed on the server, so only the servers OS is important.
Operating system |
ht://Dig |
mnoGoSearch |
---|
Win32 |
via CygWin |
- native but not free |
Linux |
yes |
yes |
Digital Unix |
yes |
yes |
FreeBSD |
yes |
yes |
OpenBSD |
no |
yes |
HP UX |
yes |
yes |
Solaris |
yes |
yes |
Irix |
yes |
yes |
SunOS |
yes |
yes |
Mac OS X |
yes |
yes |
Mac OS 9 |
yes |
no |
BSDI |
no |
yes |
SCO Unixware |
no |
yes |
AIX |
no |
yes |
|
|
|
As for the mailing-lists, it appears to me as if mnoGoSearch was trying to give value to its support contracts by being very slow to answer help messages. The result is pretty poor, as you need around a week to get an answer for pretty much everything, and the documentation is not really good (although pretty much up to date). On the other side, the ht://Dig support seems faster, but then it always seems like only one person gives answers, and sometimes no answers are given either. As for mailing-list support, I would say both projects suck, but mnoGoSearch offers official commercial support, which might be a good thing for companies looking into the product.
As for numbers, in the same period of 45 days observation, we have 134 mails in the htdig-general mailing-linst whereas mnogosearch-general comes up to almost 400.
Web and offline "crawling"
Both systems were made to crawl web pages and to extract info from them. This is not possible when a website is password-protected. Unless you use mnoGoSearch to pass the Basic Authentication method, which is not often used to protect a website nowadays.
To get into a, say, password-protected PHP website, you will need to write a PHP page (very well protected) that will log you in and initiate a session for you [1]
While mnoGoSearch offers offline and database crawling, the documentation lacks too, and if you have a little sample given away with the program code, it is not really helpful because you still need to have web pages the users can use with a particular ID to get the document you indexed offline. There is also a lack of documentation here about how to use multiple offline searches on the same server.
The document types
Both search tools were first aimed at indexing
HTML documents. The respective websites of these product are clear as for their title. They are Internet search tools. As far as Dokeos is concerned, this is not enough. Courses are composed of a mix of
HTML pages and documents, which can be whatever the user decides. The more document types the search tool can index, the better.
The following table is build as accurately as can be from the documentation of these products as of the writing of this article. Although the search tools only index text and
HTML files, they allow the usage of parsers (see
here for mnoGoSearch and
here for ht://Dig) that will convert other formats to text or
HTML files, which can then be indexed.
Document type |
ht://Dig |
mnoGoSearch |
---|
HTML |
yes |
yes |
Plain text |
yes |
yes |
MS-Word |
yes, via catdoc |
yes, via catdoc |
PDF |
yes, via xpdf or pdf2text (pdfinfo also extracts meta info) |
yes, via pdf2txt |
PostScript |
yes, via ps2ascii |
yes, via ps2ascii |
Man pages |
probably via parser |
yes, via deroff |
RPM packages |
probably via parser |
yes, via rpminfo |
MS-Excel |
probably via parser |
yes, via xls2csv |
RTF |
probably via parser |
yes, via unrtf |
MS-PowerPoint |
yes, via ppt2html |
yes, via ppt2html |
Database content |
yes, via web frontend :-( |
yes, via htdb features 2 |
Website needing HTTP context (scripts) |
? |
yes |
HTTPS |
? |
yes |
Others... |
yes, via appropriate parser |
yes, via appropriate parser |
|
|
|
Indexing methods
Indexing uses several kinds of possible algorithms. Generally speaking, it is the users' role to choose which algorithm(s) will be used. This is a list of supported algorithms types:
Algorithm types |
ht://Dig |
mnoGoSearch |
---|
Stemming |
yes |
yes |
Soundex |
yes |
|
Fuzzy |
yes |
|
Synonyms |
|
yes |
Substrings |
|
yes |
Thesaurus based indexation |
no |
no |
|
|
|
And although they both seem to support boolean search, it might be somewhat limited as for latest versions
Boolean operator |
ht://Dig |
mnoGoSearch |
---|
AND |
yes |
yes |
OR |
yes |
yes |
NOT |
no |
yes |
GROUP |
no |
yes |
Phrase matching |
no |
|
|
|
|
Language handling
An important matter in the current case is the handling of the different languages. Dokeos (the company) has its headquarters in Belgium, which is a trilingual country (french, dutch, german) and uses a lot of english. But the Dokeos product is Open Source and we have heard (and have already a translation for it) that Japanese users exist. So the multilingual aspect is important, for the database data as well as the documents and
HTML pages. There couldn't be a clear table to compare the two products here.
ht://Dig offers multilingual support based on a configuration file and ispell dictionnaries. However, this is only for 8-bits characters encoding, so Japanese and Chinese cannot be handled
(see report).
mnoGoSearch offers multilingual support based on a configuration file, but the indexation of other-than-english documents is badly documented, so we don't know what it's based on. However, the recommended encoding for multilingual support is
Unicode, and as such it supports all characters. The language detection is supposed to be automatic, based on words contained in the documents… let us see if we can get more info about this somewhere. The tool uses ispell as a spell checker, following searchtools.com.
Search interfaces
I will only be able to talk about mnoGoSearch here for now, because I haven't tested the ht://Dig interface yet.
The mnoGoSearch project offers a bundled CGI search interface as default. This interface is nice and easy, and works with template files so you can modify them at will (almost). But sometimes (as in this case), you need to integrate this search interface into another project, for example a PHP application.
mnoGosearch also provides a PHP interface that you need to install (not so easily) on your system to be able to use PHP extension functions directly from your PHP project.
Let's consider this option in more details
The first word that comes to my mind here is "brainsquishing". Why does an install need to have the MySQL and PHP sources and recompile everything? Why isn't there an easy way to download some scripts, move them to the right place, and start to play? Apparently, this was the case previously and it stopped because integrating the mnoGoSearch functions into a C compiled extension to PHP was more efficient.
Well, for me, who just want a fast, integrable solution, it's just a bad point. Not only do you need to compile these, but you also need to have the MySQL and PHP sources at hand for this compile process. Most user that want to try it on a outsourced server where they don't have near-admin permissions will be unable to use that…
I did not get out of this compilation step, so I will write something here when I have.
Other considerations
ht://Dig (or its
rundig program) seems to rebuild the index database from scratch each time it runs. Written in C++.
mnoGoSearch only updates those indexes that are bound to modified sources (but does it work for database sources?). Uses database systems for creating index tables and can handle a dozen of different DBMS. Implements servers clustering and mirroring. Can do basic authentication if the webserver requires it and the settings are set for it. Supports weighting for pages structure. Supports compressed gzip format. Written in C.
Both systems can do search result highlighting.
[1] But mnoGoSearch (what about ht://Dig?) before version 3.2.34 didn't support cookies, so you had to configure PHP to use trans_sid, which meant the PHPSESSID was given in the URL. But you didn't want this PHPSESSID to appear in the index database, as users that would have clicked the link would have automatically been connected (depending on the session lifetime). So you had to implement something called ReverseAlias to get rid of this URL part before it got into the database. Things then became very hard to do as there was a big lack of documentation on how to use this in combination with a list of servers kept in a database table.