Feature #288
automatic character recoding (e.g. latin1 <-> utf8)
| Status: | New | Start: | 01/11/2010 | |
| Priority: | Normal | Due date: | ||
| Assigned to: | - | % Done: | 0% |
|
| Category: | Engine | |||
| Target version: | 0.10 | |||
| Complexity: | High |
|||
| Votes: | 3 (View) |
Description
I did a small survey, as most users are ignorant and do not want to change their encoding, smuxi schould recode as necessary as all major IRC clients do it nowadays anyway.
History
Updated by Mirco Bauer 1225 days ago
The perl regex on this page might help to detect UTF-8 characters:
http://www.w3.org/International/questions/qa-forms-utf-8.en.php
Updated by Mirco Bauer 1003 days ago
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
This expression can be adapted to other programming languages. It takes care of various issues, such as illegal overlong encodings and illegal use of surrogates. It will return true if $field is UTF-8, and false otherwise.
Updated by Mirco Bauer 1002 days ago
The branch that tries to deal with this:
http://git.qnetp.net/?p=smuxi.git;a=shortlog;h=refs/heads/feature/%23288_automatic_character_recoding
It can detect UTF8 but the recode part is not working.
Updated by Raphaël Hertzog 976 days ago
+1 from me, this is really needed, it's one of the regressions that annoy me the most.
I see lots of ? instead of the accented characters on #debian-devel-fr. Some are going through correctly (for those that send UTF-8).
Example:
20:35 <bubulle> et, indirectement, ? cause d'un bug de dak, ?a m'emp?che d'envoyer une mise ? jour de s?curit? dans t-p-u
22:38 <KiBi> oué hein :)
22:39 <KiBi> (faire des IO ? quel drôle d'idée pour une machine qui fait du SQL..)
Updated by Mirco Bauer 970 days ago
- Target version changed from 0.11 to 0.10
- Complexity set to High
Ok, I can't recode from ISO 8859-1 if it was already converted from raw bytes to a string as it removes UTF-8 values during that. The IRC lib has to either expose the raw bytes or handle the transformation.
Here the chat with alan about this issue:17:48:40 <meebey> I think my issue is that the encoders are stripping unvalid values 17:48:51 <meebey> but I am not sure, maybe I am just too stupid 17:49:01 <meebey> the initial issue is that the input is not byte[] 17:49:10 <meebey> it is already parsed bytes in strings 17:49:17 <alan> It's too late then :) 17:49:21 <meebey> say iso8859-15 17:49:27 <meebey> but it preserves the utf8 values 17:49:33 <meebey> at leat I can see them 17:49:42 <alan> byte[] -> string is a lossy conversion if you have an invalid byte sequence 17:49:53 <meebey> sure? 17:49:55 <alan> so if you have invalid utf8 you discard those chars 17:49:55 <alan> aye 17:49:55 <alan> 100% 17:49:58 <alan> i hit this before :) 17:50:04 <alan> you have to use the raw bytes 17:50:10 <meebey> ok thanks