| View previous topic :: View next topic |
| Author |
Message |
Jackil Enthusiastic Coder
Joined: 24 May 2006 Posts: 97
|
yeah, I got a lot of errors with filenames and some Unicode coerce stuff
Reply with quote
|
| |
|
|
BigDaddy Enthusiastic Coder
Joined: 26 May 2006 Posts: 147
|
1. make sure the xml file is really utf8 by trying to decode it into Unicode with .decode('utf-8')... if this works there are at least no utf-8 errors
2. either convert the unicode to ascii with .encode('ascii', 'xmlcharrefreplace') to replace funny chars with &uXXXX;
Or 3. run the unicode through the xml parser and hope it works
Or 4. convert it to nationalized 8-bit text with .encode('latin-1') or somesuch, then run it through, then convert back to utf-8 later (maybe with .decode('latin-1').encode('utf-8'))
Reply with quote
|
| |
|
|
Jackil Enthusiastic Coder
Joined: 24 May 2006 Posts: 97
| |
BigDaddy Enthusiastic Coder
Joined: 26 May 2006 Posts: 147
| |
Jackil Enthusiastic Coder
Joined: 24 May 2006 Posts: 97
|
yeah, been looking at those urls lately. Just one more question, do you know of a certain way to identify what encoding a file has?
Reply with quote
|
| |
|
|
BigDaddy Enthusiastic Coder
Joined: 26 May 2006 Posts: 147
|
That is a hard problem.... if it has 8-bit data in it and the utf-8 decoder works, its probably utf-8
If there is no 8-bit data it is most likely ascii
Reply with quote
|
| |
|
|
Jackil Enthusiastic Coder
Joined: 24 May 2006 Posts: 97
| |
BigDaddy Enthusiastic Coder
Joined: 26 May 2006 Posts: 147
|
If it has a unicode byte order mark (0xFFFE or 0xFEFF) at the beginning it is probably raw UCS-16
If you are in norway and its not utf-8 and not ascii, it might be iso-8859-15
Or who knows, it might be chinese in some two-byte encoding
Reply with quote
|
| |
|
|
Jackil Enthusiastic Coder
Joined: 24 May 2006 Posts: 97
| |
|