Structured vs unstructured information

This article was first written in September 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/293).

What is "structured information" ?

It is information that is already structured in fields, such as "date", "title", "subject", "unit price", "quantity", "total price", "commission percentage". Typically, what you find in a record of a relational database table. When information is structured, it is usually relatively easy to search it, since you can easily tell a program : give me the list of record numbers in the table CUSTOMERS, where total sales is greater than 1,000 and name starts with the letter A. The drawback is that such RDBMS systems usually require that the "fields" have a certain maximum size : a date can have maximum 8 digits yyyymmdd, a name can have maximum 30 characters, etc… This is because the information must fit into "columns" and "tables", and it is difficult for such systems to handle efficiently data that can vary significantly between one row and the next.

What is "unstructured information" ?

Generally speaking, what people mean by that is text, such as can be found printed on a 2-page memo. Although there may be some visual structure for a human reader (it's easy to find the date, wether it's on the left or right side; it's easy to find the subject once you read a couple of paragraphs), for a program it is something else altogether. Contrary to popular belief, the amount of unstructured information in this day and age is several orders of magnitude larger than the amount of structured information. Unstructured information does not fit easily in the "columns and rows" concept of relational databases : the text of a memo may contain 1 paragraph or 100 (especially mine), a book may have chapters of varying lengths, a technical description of an Airbus plane requires a few hundred boxes of drawings and text pages, etc.. Thus relational database engines have trouble handling that kind of data, and must generally handle it as "blobs" [1] stored in a different way than the usual columns and rows, which also makes their handling more difficult for this kind of programs.

What is a mixture between structured and unstructured information ?

Most information is like that. There may be some structured bits like date, author, etc… but then you have one or more paragraphs with "description". For all practical purposes, a mixture is to be considered as unstructured. Information systems that can handle unstructured information can usually handle structured information as well (not always). The opposite is generally not true.
[1] "Binary Large OBject": collection of bits of undetermined length and of which we don't really know what's in it. Or at least the DBMS doesn't know.