Sorting out web encoding problems

This article was first written in June 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/270).

Introduction

This article is about sorting web forms encoding problems. In particular, it looks into an encoding problem found when serving a UTF-8 form, expecting a completed UTF-8 form to come back, but really getting a ISO-8859-1 or ISO-8859-15 form. It is aimed to be a solution article, explaining step by step how to reverse-engineer such a problem to find a reasonable solution. It doesn't go deep into the reasons why the form gets back in another encoding, as this is clearly user responsibility, so you cannot expect anything, really. If you want to know more about the history of encodings, I recommend you read this page: UTF-8 explained

Note

Sadly, the Kanji character I am trying to display below doesn't show up. I will try to set that right. Meanwhile, just imagine that 鉉 is in fact a beautiful Kanji (something you can most probably understand, and neither do I).

How the problem appears

You might some day come accross this kind of problem. You display a form to web users in UTF-8. It is UTF-8 at the beginning because you send it with a HTTP header containing
Content-Type: text/html; charset=UTF-8
But strangely, when you get the results of the form, you have funny things happening. The data you get, if you put it into a database, will appear like as if it has broken the HTML structure by using a quote that extends the length of the normal form data. If you had a web form like this:
<tr>
  <td align='right'>Name*</td>
  <td><input name="lastname" size="18" style="" value=""></td>
  <td align='right'>Firstname*</td>
  <td ><input name="firstname" size="18" style="" value=""></td>
</tr>
You might end up with data that looks like this:
$_POST['lastname']: Warnier
$_POST['firstname']: Aim鉉 </tr> <tr> <td align='right'> ...
That is the kind of problem we are looking to debug in this article.

The reason why

Note that this paragraph has been contested and should probably be revised in time. Don't take my reasons for granted. It might seems pretty illogical at first, but here is my analysis of the situation. You'll see that, in the end, it makes a lot of sense. The server sends the form as UTF-8 and is expecting to get it back as UTF-8. UTF-8 rules are quite documented on the web, but I'll explain very shortly. Where UTF-8 characters that represent the ASCII characters are only coded on 1 byte, other characters (like Kanji) can be coded on several bytes. The server expecting UTF-8 characters to come through, if it fins ISO-8859-15 characters in there, it will try to understand them as normal ASCII, but if it fails, it will try to understand them as multi-byte characters. You can probably already smell what goes wrong here. If, for some reason, the server doesn't understand the first character as ASCII, it will take several bytes to make one UTF-8 character. So if you entered, say, the &eacute; character, but for some reason the server didn't get that, it will try getting more bytes into one character. This way, it will probably swallow the double-quote character used to close the value declaration in the form...
<input .. value="Aim鉉 </tr>... (until the next double-quote char)
thus breaking your form.

How I got that reasoning

I had this problem, and the Kanji 鉉 strangely appeared in the middle of my clients data. Understanding how unicode (UTF-8) works, I guessed it might be a several-bytes-into-one problem, and looked for the unicode code for 鉉. Luckily, searching on Google for "鉉 encoding", I found this page which lists several of these characters. I searched for the 鉉 string inside that page and found the right encoding for this character. Comparing that encoding (E98980) to ISO-8859-15 codes table, I found that the first part (generally representing the table index in UTF-8), E9 in this case, represents the &eacute; character in ISO-8859-15, which could very possibly be entered by customers on the website (it s a very common french character).