UTF-8 seems to become more and more popular as a character set used with TYPO3. Changing the charset of a system can be tricky and there are a lot of traps. Lots of mailing list requests prove that. Here's are some general thoughts about the subject (no howto!).
If you have ever tried to convert your character set to UTF-8 and end up with strange charcters like , then you are not alone. I was trapped in that jungle more than once.
There are lots of tutorials, howtos, blog postings and documentations around which aim to help you converting a character set to UTF-8. Some are more helpful, some are less. But all have something in common: They have been written without considering how your system is configured. Analyzing your individual environment paired with a deep understanding of your components will probably serve much better than trial-and-error snippets. I'll try to give you an impression what that could mean.
Taking all components and their interaction into account
The architecture of a TYPO3 system is complex, since it consists of independent components, which are even interchangeable. The following questions are essentials when thinking about character set conversion:
- Operating system (Win, Linux, *BSD, ...): What's the default charset of your OS? Does it support UTF-8?
- Filesystem: What charset do the files have? Is it used consistently or mixed?
- Webserver: Which charset is used for serving static files?
- Database: This one is the most complex. The DB has various components, which can come with different
- Client: Which charset does the client use to display/process data?
- Server: How does the server store the data?
- Connection: What charset is used for data transfer?
- PHP: There is a bunch of modules, which try to help with on-the-fly charset conversion: recode, mbstring, iconv, ... Which one you have, depends if the module is available for your PHP version and if it's enabled.
- TYPO3: What version do you use? What charset configuration?
- The whole stuff mixed together: What charset do the components use to interact with each other?
Uff, much stuff! But this is just a loose and incomplete collection of items. I stopped brainstorming after a few minutes. I am sure there's much more to think (and write) about. I will not give you advices what things to do or not to do. As I already mentioned, there's a lot of writing out there in the web.
Demonstrating my favorite issue
This one was really tricky to find out. And it serves as a great example, how complex things can be:
Updating from TYPO3 4.1 (and earlier versions) to 4.2 (and later)
Some DB fields in 4.1 are of type BLOB (for example the TS templates). Most of these fields get converted to TEXT in 4.2. Now think about the following scenario, which seems to be common. The template was saved using TYPO3 4.1 and a DB using latin1 (ISO-8859-X) as charset. Then the DB converted to UTF-8 and TYPO3 was configured accordingly. You think you're finished because everything works will. But in most cases, there's still some latin1 formatted data in the BLOB fields. You just don't see. Once you upgrade TYPO3 to 4.2, these BLOBs get converted to TEXT, assuming the data is UTF-8. But it's latin1, because BLOB was not converted before. The result is a broken template. Lots of people in the mailing lists complain about whole parts missing. The reason is invalid non ascii characters (like umlauts äöüé¢ etc), which break the template view.
How to avoid that?
If you change the charset of TYPO3 and/or your DB, convert those BLOB fields which would have been changed by the TYPO3 update to TEXT before converting the charset or make sure to convert the BLOB data otherwise.
- Don't believe that using snippets from the web will fit your needs 100%. Even if they are written by someone with good reputation and rated well, chances are that your system behaves different.
- Structured Analysis paired with sufficient knowledge about CMS components are the best basis for successful administration. Trial-and-error will not always lead to a lucky punch and can be very disappointing.
- Character set conversion is a complex task, even for the experienced administrators. Complex tasks are time-consuming, so calculate generously.
- Design flaws can cause a disaster which might come up at a point you can't foresee. Using BLOB for text in databases is such a design flaw.
Finaly, a tiny helper extension
Sometimes it's not easy to find out what charset TYPO3 uses when connecting to MySQL. Asking for help, you probably will be advised to provide the output of
SHOW VARIABLES LIKE '%CHARACTER_SET%';
But if you do that on the command line (using the mysql client) or in phpmyadmin, you might get a different result than TYPO3 would produce.
I have written a tiny charset helper extension, which shows the results from a true TYPO3 point-of-view (by using standard TYPO3 DB functions). It's available from TER and does not need any configuration. just install it and navigate to the module: