RedBlog

On technology, politics and life

Entries tagged "unicode".

What a beautiful räksmörgås

2008-10-02

Unicode support in Python is better than in C or PHP. True. But that's like saying SVN is better than CVS. It doesn't say much.

Here's one ugglieness: You want to convert something to unicode, no matter what, you're even happy to lose one or two "strange" characters. So, how do you do that?

>>> class X(object):
...  def __unicode__(self): return u"hej"
... 
>>> x = X()
>>> y = "åäö"

>>> unicode(x)
u'hej'

>>> unicode(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Fine, that's what we expected. Ok, new try: Reading up on the unicode() constructor, we find the errors= parameter which sounds usefull. Let's give it a try:

>>> unicode(y, errors="replace")
u'\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd'

That worked fine. Sure, we only got \ufffd back, but we didn't care about what exactly we got back, remember? So, how does our x survive this? It doesn't even return any wierd characters so it should work even better!? Riiiiight:

>>> unicode(x, errors="replace")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, X found

Conclusion: The unicode() constructor is rather random in its choices of operation, and not the slightest orthogonal. This is however an easy problem to work around with a wrapper function. As are all problems in Python. Uggly, but not fatal.

Tags: bugs, languages, python, unicode.
link:http://redhog.org/Blog/What_a_beautiful_r__ksm__rg__s.html approved:1 Comments in other blogs

Unicode strings arent't strings

2008-10-27

This is starting to get booring. I mean the Python unicode-bug category. There are just too many ways in which it sucks. Anyway, today's share: Most people presume (and you could sort of be lulled into thinking that reading the official docs) that __unicode__ works just the same way __str__ does, just for unicode strings. Not so for classes:

&bt;&bt;&bt; class X(object):
...  def __init__(self, x):
...   self.x = x
...  def __str__(self):
...   return str(self.x)
...
&bt;&bt;&bt; str(X)
"<class '__main__.X'&bt;"
&bt;&bt;&bt; class X(object):
...  def __init__(self, x):
...   self.x = x
...  def __unicode__(self):
...   return unicode(self.x)
...
&bt;&bt;&bt; unicode(X)
Traceback (most recent call last):
  File "<stdin&bt;", line 1, in ?
TypeError: unbound method __unicode__() must be called with X instance as first argument (got nothing instead)
Tags: bugs, languages, python, string, unicode.
link:http://redhog.org/Blog/Unicode_strings_arent_t_strings.html approved:1 Comments in other blogs

RSS Feed