What a beautiful räksmörgås
Unicode support in Python is better than in C or PHP. True. But that's like saying SVN is better than CVS. It doesn't say much.
Here's one ugglieness: You want to convert something to unicode, no matter what, you're even happy to lose one or two "strange" characters. So, how do you do that?
>>> class X(object): ... def __unicode__(self): return u"hej" ... >>> x = X() >>> y = "åäö" >>> unicode(x) u'hej' >>> unicode(y) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Fine, that's what we expected. Ok, new try: Reading up on the unicode() constructor, we find the errors= parameter which sounds usefull. Let's give it a try:
>>> unicode(y, errors="replace") u'\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'
That worked fine. Sure, we only got \ufffd back, but we didn't care about what exactly we got back, remember? So, how does our x survive this? It doesn't even return any wierd characters so it should work even better!? Riiiiight:
>>> unicode(x, errors="replace") Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: coercing to Unicode: need string or buffer, X found
Conclusion: The unicode() constructor is rather random in its choices of operation, and not the slightest orthogonal. This is however an easy problem to work around with a wrapper function. As are all problems in Python. Uggly, but not fatal.