Unicode woes and Python unit testing in GAE

One of the really cool aspects of deploying to Google's cloud offering (GAE) versus the more machine oriented Microsoft Azure and Amazon EC2 approaches are that you really are only dealing with computing resources. You deploy your app not to any particular server, but to the cloud itself. Despite the very real challenges in distributing work across data centers I am still filled with visions of automagical propagation and distribution and unlimited elastic computing. Beautiful. Anyway I was inspecting the logs and was SHOCKED to discover that there have been close to a thousand quotes added to system by 37 users. THIRTY SEVEN USERS!! Now, these are small numbers I know, but given the only publicity I've ever given this has been the two posts on this blog I was absolutely amazed to find that people had not only managed to find the app but were able to use it! (It's quite ugly)

There is nothing quite as motivating as having users.

So I started looking at log files and discovered that I am actually throwing errors for at least some of those users who are not English speaking and are using Unicode characters. Oops. You'd think I'd know better by now.   See : The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

In most cases for these bugs it just meant calls to unicode(var) rather than str(var) . So for example when parsing url variables starting with the example provided here by Google and adding my own processing of an array of arguments to a method like so...

  def update_field(self, fieldName, *args):
    strings = [str(arg) for arg in args]

and then calling that code with a unicode character such as ''–" would produce an error that looked like this...

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 0: ordinal not in range(128)

the same code as above corrected is below. There were a few variants on this but this is a good example of a function where user defined input dictated a change to accommodate.

  def update_field(self, fieldName, *args):
    strings = [unicode(arg) for arg in args]

So in the course of digging in and finding/fixing these bugs I also recognized I was far overdue for some unit testing. So starting with the premise that I needed a way to reproduce the issue in the lab and write a failing test for this bug I went in search of the best method to do so. Python actually has [pretty good unit test support] built-in but I was also looking for something that would work in the context of GAE. It doesn't take much looking before stumbling on GAEUnit, which is a useful extension that provides some scaffolding for tests that will run within GAE. The major negative to this is that your tests are part of your application, but I don't have a problem with that as it's easy to secure them to administrator access only but for some people it's a block.

The next issue I ran into with this was the fact that the python SDK for GAE doesn't include a method for testing content behind authentication. So whereas in Java you can say this :

private final LocalServiceTestHelper helper =
        new LocalServiceTestHelper(new LocalUserServiceTestConfig())

    public void testIsAdmin() {
        UserService userService = UserServiceFactory.getUserService();

But today there is no equivalent for that in python. I wasted a few hours trying to work around this issue before deciding to just login before I run tests. My login lasts for hours so this isn't much of a constraint. In the long run I plan to drop Google's authentication any and move to janrain/rpx to allow users to login using whatever they like. That will mean abandoning or at least shimming the infrastructure provided by GAE, but has the additional benefit of reducing my applications direct dependency on Google. Once I have my own definition of Session and session provider I can of course mock it out and control it a lot easier.

Here's the test, no asserts but if there is a problem we'll see an exception

  def test_unicode_to_json_error(self):
      badQuote = q.Quote(quote='hi there ' + u"\u2013")
      result = badQuote.jsonSafe()

So one error-type in the log file later I've spent a couple days, done a bunch of refactoring and have produced no new functionality... but I do have this!

Of course then, after all that I STILL ran into an error with the JavaScript, for which I have yet to add unit testing for. Sigh. I'll have to choose between qUnit which I've used before and YUI-test which is probably better aligned to the rest of my application.

The JavaScript portion of this bug was basically because I was calling unescape() on a url-encoded variable. URL encoding though only supports 8-bit characters and doesn't truly allow unicode, so my unicode characters became three ascii characters which could include control characters or other undesired results. There seems to be a lot of people standardizing on UTF-8 within url-encoded variables but it's purely a convention at this point.  Thankfully I found this pre-written code on webtoolkit.info that knows how to do the look ahead at the next three characters in order to handle UTF-8 input properly.

Now I hope it's safe to say that  blogquotes can handle unicode! (and has some unit tests to boot!)