stmllr.net

Thinking about UTF-8 character set conversion in TYPO3

by on stmllr.net

UTF-8 seems to become more and more popular as a character set used with TYPO3. Changing the charset of a system can be tricky and there are a lot of traps. Lots of mailing list requests prove that. Here's are some general thoughts about the subject (no howto!).

If you have ever tried to convert your character set to UTF-8 and end up with strange charcters like , then you are not alone. I was trapped in that jungle more than once.

There are lots of tutorials, howtos, blog postings and documentations around which aim to help you converting a character set to UTF-8. Some are more helpful, some are less. But all have something in common: They have been written without considering how your system is configured. Analyzing your individual environment paired with a deep understanding of your components will probably serve much better than trial-and-error snippets. I'll try to give you an impression what that could mean.

Taking all components and their interaction into account

The architecture of a TYPO3 system is complex, since it consists of independent components, which are even interchangeable. The following questions are essentials when thinking about character set conversion:

  • Operating system (Win, Linux, *BSD, ...): What's the default charset of your OS? Does it support UTF-8?
  • Filesystem: What charset do the files have? Is it used consistently or mixed?
  • Webserver: Which charset is used for serving static files?
  • Database: This one is the most complex. The DB has various components, which can come with different charsets:

    • Client: Which charset does the client use to display/process data?
    • Server: How does the server store the data?
    • Connection: What charset is used for data transfer?
    • ...
  • PHP: There is a bunch of modules, which try to help with on-the-fly charset conversion: recode, mbstring, iconv, ... Which one you have, depends if the module is available for your PHP version and if it's enabled.
  • TYPO3: What version do you use? What charset configuration?
  • The whole stuff mixed together: What charset do the components use to interact with each other?
  • ...

Uff, much stuff! But this is just a loose and incomplete collection of items. I stopped brainstorming after a few minutes. I am sure there's much more to think (and write) about. I will not give you advices what things to do or not to do. As I already mentioned, there's a lot of writing out there in the web.

Demonstrating my favorite issue

This one was really tricky to find out. And it serves as a great example, how complex things can be:

Updating from TYPO3 4.1 (and earlier versions) to 4.2 (and later)

Some DB fields in 4.1 are of type BLOB (for example the TS templates). Most of these fields get converted to TEXT in 4.2. Now think about the following scenario, which seems to be common. The template was saved using TYPO3 4.1 and a DB using latin1 (ISO-8859-X) as charset. Then the DB converted to UTF-8 and TYPO3 was configured accordingly. You think you're finished because everything works will. But in most cases, there's still some latin1 formatted data in the BLOB fields. You just don't see. Once you upgrade TYPO3 to 4.2, these BLOBs get converted to TEXT, assuming the data is UTF-8. But it's latin1, because BLOB was not converted before. The result is a broken template. Lots of people in the mailing lists complain about whole parts missing. The reason is invalid non ascii characters (like umlauts äöüé¢ etc), which break the template view.

How to avoid that?

If you change the charset of TYPO3 and/or your DB, convert those BLOB fields which would have been changed by the TYPO3 update to TEXT before converting the charset or make sure to convert the BLOB data otherwise.

Lessons learned?

  • Don't believe that using snippets from the web will fit your needs 100%. Even if they are written by someone with good reputation and rated well, chances are that your system behaves different.
  • Structured Analysis paired with sufficient knowledge about CMS components are the best basis for successful administration. Trial-and-error will not always lead to a lucky punch and can be very disappointing.
  • Character set conversion is a complex task, even for the experienced administrators. Complex tasks are time-consuming, so calculate generously.
  • Design flaws can cause a disaster which might come up at a point you can't foresee. Using BLOB for text in databases is such a design flaw.

Finaly, a tiny helper extension

Sometimes it's not easy to find out what charset TYPO3 uses when connecting to MySQL. Asking for help, you probably will be advised to provide the output of

SHOW VARIABLES LIKE '%CHARACTER_SET%';

But if you do that on the command line (using the mysql client) or in phpmyadmin, you might get a different result than TYPO3 would produce.

I have written a tiny charset helper extension, which shows the results from a true TYPO3 point-of-view (by using standard TYPO3 DB functions). It's available from TER and does not need any configuration. just install it and navigate to the module:

http://typo3.org/extensions/repository/view/sm_charsethelper/current/

Tags

Comments

  1. Steffen

    I forgot to mention some external ressource:

    http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
    On this page you can learn how MySQL handles charsets, what stages exist and how they are configured. This page is mandatory for every TYPO3 administrator.

    http://public.m-plify.net/mysql/MySQL_Charset_Handling.pdf
    It is a hand-drawn diagram which gives you an impression how data is stored, shipped and transformed and what components are involved.

  2. Rudolf

    Hello Steffen, it would be nice to add a trailing ";" to the command "SHOW VARIABLES LIKE '%CHARACTER_SET%';". If you are not so fluent with mysql, you are searching a while to figure out.
    Regards, Rudolf

  3. Steffen

    Just found another source of information about setting up a TYPO3 environment with UTF-8: http://xavier.perseguers.ch/en/tutorials/typo3/configuration/utf-8.html

  4. Steffen

    I have just uploaded a new version of the sm_charsethelper extension to TER. It now uses the Reports module instead of an own BE module.

  5. Björn

    Hi there,

    you're extension is awesome. Just found it through google.

    I've got a strange problem with a brand new TYPO3 4.5 installation. I've set up Russian and TYPO3 behaves strange. Russian content in the bodytext field is shown correctly. Content in the headline is stored and displayed as entities. Your report says everything is UTF8 but why do I get entities? Have you ever had such a problem?

    Thanks, Björn

  6. Steffen

    Hi Björn,
    I guess you mean HTML entities (like &). This has nothing to do with character set.
    Please ask for support in the TYPO3 mailinglists. There are quite a lot geeks in there with great knowledge.

  7. Steffen

    Here's some more info about utf-8. I hope this helps you:

    When your Apache serves text files (e.g. *.txt, *.html) which are utf-8 encoded, you need this setting in Apache:

    AddDefaultCharset UTF-8

    @see: http://httpd.apache.org/docs/2.2/en/mod/core.html#adddefaultcharset

    This will add the following HTTP header to the server response:

    Content-Type: text/html; charset=UTF-8

    If you don't have this header, non-ascii characters in your file will probably not be shown correctly. For example, the "ä" character will be shown in browser as "ä"

    When you have plain PHP scripts, there's an alternative to Apaches AddDefaultCharset. PHP has its own configuration option to be set in php.ini:

    default_charset = "utf8"

    @see: http://www.php.net/manual/en/ini.core.php#ini.default-charset

    Alternatively, set this in PHP code via:

    ini_set('default_charset', 'utf8');

    This results in the same HTTP header as mentioned above. Please note that this will only apply to PHP files, not plain text or html files.

    To debug your HTTP headers, I recommend to use Firefox add-on "Live HTTP headers".

     

  8. Steffen

    Did you know that some PHP string functions are not utf-8 capable?

    Try this:

    <?php
      echo strlen('ö');
    ?>

    Expected result: 1
    Actual result: 2

    So better use this:

    <?php
      echo mb_strlen('ö', 'utf-8');
    ?>

    Expected result: 1
    Actual result: 1

  9. Holgy


    The TYPO3 wiki has got a dedicated UTF-8 page: http://wiki.typo3.org/UTF-8_support

    There you can also find a script to convert existing database to utf-8