ht://Dig vs mnoGoSearch Comparison

This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/209). Since then, other massive 
players like Lucene and Xapian have been considered as open-source search 
engines, but were not included yet into this comparison.

Introduction

This article is an attempt to compare objectively the two wonderful Open Source Indexation Tools that are ht://Dig and mnoGoSearch. The criteria of comparison will mainly be aimed at what one can do on a Linux web server with Open Source database systems such as PostgreSQL and MySQL, with PHP bindings if available. This is because this research isn't funded by anyone and I have a need of one of these products for, in this case, the development of a search tool for Dokeos, an e-learning management system developed in PHP+MySQL. Here, I will mix (intentionally) the terms search tool and indexing tool, because indeed those products do both, even if the main effort is done on indexing. I know the terms are different, but the distinction doesn't make sense here, and it's easier for non-technical people to understand search tool than indexing tool. The indexing system will probably work with the help of command-line parsers. Those parsers generally only work on Linux systems, so the Dokeos system would need to keep this indexing tool as a plugin, so as not to force the user to use a Linux server. This only affects the server anyway, and the Dokeos system can still be used from Windows computers. For a more extensive list of search tools, you might want to visit searchtools.com which has quite an impressive list… In fact, the pages there for ht://Dig and mnoGoSearch are really interesting.

General project aspect

At first glance, we could say that the website is the reflect of the overall quality of the product. More than just finding if the website is crappy or well-organised (which is the reflection of the mind of the developers), you can find documentation, changelogs, and many other interesting stuff there. Let's see what we can extract from this part

Information type	ht://Dig	mnoGoSearch
Design	OK	Bad
Last changelog date	January 2002	January 2005
Last release date	June 2004	January 2005
Finding information easily	Yes	Yes
Last documenation update	2002?	December 2004 (release 3.2.26)
References to "clients"	Around 500 (impressive!)	Around 200 (including MySQL and Debian!)

Another interesting point is the portability of the system. Both systems are only needed on the server, so only the servers OS is important.

Operating system	ht://Dig	mnoGoSearch
Win32	via CygWin	- native but not free
Linux	yes	yes
Digital Unix	yes	yes
FreeBSD	yes	yes
OpenBSD	no	yes
HP UX	yes	yes
Solaris	yes	yes
Irix	yes	yes
SunOS	yes	yes
Mac OS X	yes	yes
Mac OS 9	yes	no
BSDI	no	yes
SCO Unixware	no	yes
AIX	no	yes

As for the mailing-lists, it appears to me as if mnoGoSearch was trying to give value to its support contracts by being very slow to answer help messages. The result is pretty poor, as you need around a week to get an answer for pretty much everything, and the documentation is not really good (although pretty much up to date). On the other side, the ht://Dig support seems faster, but then it always seems like only one person gives answers, and sometimes no answers are given either. As for mailing-list support, I would say both projects suck, but mnoGoSearch offers official commercial support, which might be a good thing for companies looking into the product. As for numbers, in the same period of 45 days observation, we have 134 mails in the htdig-general mailing-linst whereas mnogosearch-general comes up to almost 400.

Web and offline "crawling"

Both systems were made to crawl web pages and to extract info from them. This is not possible when a website is password-protected. Unless you use mnoGoSearch to pass the Basic Authentication method, which is not often used to protect a website nowadays. To get into a, say, password-protected PHP website, you will need to write a PHP page (very well protected) that will log you in and initiate a session for you [1] While mnoGoSearch offers offline and database crawling, the documentation lacks too, and if you have a little sample given away with the program code, it is not really helpful because you still need to have web pages the users can use with a particular ID to get the document you indexed offline. There is also a lack of documentation here about how to use multiple offline searches on the same server.

The document types

Both search tools were first aimed at indexing HTML documents. The respective websites of these product are clear as for their title. They are Internet search tools. As far as Dokeos is concerned, this is not enough. Courses are composed of a mix of HTML pages and documents, which can be whatever the user decides. The more document types the search tool can index, the better. The following table is build as accurately as can be from the documentation of these products as of the writing of this article. Although the search tools only index text and HTML files, they allow the usage of parsers (see here for mnoGoSearch and here for ht://Dig) that will convert other formats to text or HTML files, which can then be indexed.

Document type	ht://Dig	mnoGoSearch
HTML	yes	yes
Plain text	yes	yes
MS-Word	yes, via catdoc	yes, via catdoc
PDF	yes, via xpdf or pdf2text (pdfinfo also extracts meta info)	yes, via pdf2txt
PostScript	yes, via ps2ascii	yes, via ps2ascii
Man pages	probably via parser	yes, via deroff
RPM packages	probably via parser	yes, via rpminfo
MS-Excel	probably via parser	yes, via xls2csv
RTF	probably via parser	yes, via unrtf
MS-PowerPoint	yes, via ppt2html	yes, via ppt2html
Database content	yes, via web frontend :-(	yes, via htdb features 2
Website needing HTTP context (scripts)	?	yes
HTTPS	?	yes
Others...	yes, via appropriate parser	yes, via appropriate parser

Indexing methods

Indexing uses several kinds of possible algorithms. Generally speaking, it is the users' role to choose which algorithm(s) will be used. This is a list of supported algorithms types:

Algorithm types	ht://Dig	mnoGoSearch
Stemming	yes	yes
Soundex	yes
Fuzzy	yes
Synonyms		yes
Substrings		yes
Thesaurus based indexation	no	no

And although they both seem to support boolean search, it might be somewhat limited as for latest versions

Boolean operator	ht://Dig	mnoGoSearch
AND	yes	yes
OR	yes	yes
NOT	no	yes
GROUP	no	yes
Phrase matching	no

Language handling

An important matter in the current case is the handling of the different languages. Dokeos (the company) has its headquarters in Belgium, which is a trilingual country (french, dutch, german) and uses a lot of english. But the Dokeos product is Open Source and we have heard (and have already a translation for it) that Japanese users exist. So the multilingual aspect is important, for the database data as well as the documents and HTML pages. There couldn't be a clear table to compare the two products here. ht://Dig offers multilingual support based on a configuration file and ispell dictionnaries. However, this is only for 8-bits characters encoding, so Japanese and Chinese cannot be handled (see report). mnoGoSearch offers multilingual support based on a configuration file, but the indexation of other-than-english documents is badly documented, so we don't know what it's based on. However, the recommended encoding for multilingual support is Unicode, and as such it supports all characters. The language detection is supposed to be automatic, based on words contained in the documents… let us see if we can get more info about this somewhere. The tool uses ispell as a spell checker, following searchtools.com.

Search interfaces

I will only be able to talk about mnoGoSearch here for now, because I haven't tested the ht://Dig interface yet. The mnoGoSearch project offers a bundled CGI search interface as default. This interface is nice and easy, and works with template files so you can modify them at will (almost). But sometimes (as in this case), you need to integrate this search interface into another project, for example a PHP application. mnoGosearch also provides a PHP interface that you need to install (not so easily) on your system to be able to use PHP extension functions directly from your PHP project. Let's consider this option in more details

the install

The first word that comes to my mind here is "brainsquishing". Why does an install need to have the MySQL and PHP sources and recompile everything? Why isn't there an easy way to download some scripts, move them to the right place, and start to play? Apparently, this was the case previously and it stopped because integrating the mnoGoSearch functions into a C compiled extension to PHP was more efficient. Well, for me, who just want a fast, integrable solution, it's just a bad point. Not only do you need to compile these, but you also need to have the MySQL and PHP sources at hand for this compile process. Most user that want to try it on a outsourced server where they don't have near-admin permissions will be unable to use that…

the config

I did not get out of this compilation step, so I will write something here when I have.

Other considerations

ht://Dig (or its rundig program) seems to rebuild the index database from scratch each time it runs. Written in C++. mnoGoSearch only updates those indexes that are bound to modified sources (but does it work for database sources?). Uses database systems for creating index tables and can handle a dozen of different DBMS. Implements servers clustering and mirroring. Can do basic authentication if the webserver requires it and the settings are set for it. Supports weighting for pages structure. Supports compressed gzip format. Written in C. Both systems can do search result highlighting.

[1] But mnoGoSearch (what about ht://Dig?) before version 3.2.34 didn't support cookies, so you had to configure PHP to use trans_sid, which meant the PHPSESSID was given in the URL. But you didn't want this PHPSESSID to appear in the index database, as users that would have clicked the link would have automatically been connected (depending on the session lifetime). So you had to implement something called ReverseAlias to get rid of this URL part before it got into the database. Things then became very hard to do as there was a big lack of documentation on how to use this in combination with a list of servers kept in a database table.

Latest Articles

English

ht://Dig vs mnoGoSearch Comparison

Introduction

General project aspect

Web and offline "crawling"

The document types

Indexing methods

Language handling

Search interfaces

Other considerations

Installation guide for Chamilo 1.11.22 on Digital Ocean with PHP7.4

10 New features in Chamilo 1.11.18

An eportfolio for self and collective reflexion in the EKT Project

Documentation

Moving a WordPress site: Changing URL and/or sub…

Chamilo

Installation guide for Chamilo 1.11.16 on Digita…

Best practices

Using Chamilo for MOOCs

Latest Articles

English

ht://Dig vs mnoGoSearch Comparison

Introduction

General project aspect

Web and offline "crawling"

The document types

Indexing methods

Language handling

Search interfaces

Other considerations

Related Blogs

Documentation

Moving a WordPress site: Changing URL and/or sub…

Chamilo

Installation guide for Chamilo 1.11.16 on Digita…

Best practices

Using Chamilo for MOOCs