Tuesday, September 8, 2009

App caching, hooks, sharded counters, and bulk data uploads

This post is a kind of follow-on from a question I asked on the App Engine (Python) newsgroup. I split it up because it would have been just way too long (that one is too long for a newgroup already, and this will be pretty long too - and rambling).

The background is that at the moment I've got a pretty basic Java/SQL based web site at http://wow.gedsguides.com/ that has write-ups of several kinds of game entities (quests, factions, non-player characters, zones, etc. etc.), and at the moment, the links that let you browse them are prefixed with counts, so "2221 Quests" and so on. It would be nice, if not absolutely vital, if I could reproduce this on App Engine. At the moment, the servlet just connects to the database at startup and does six "select count(*)"s and stores the result in some class-level variables; it's updated whenever the servlet is restarted - once a week by schedule, more often by accident :) And that's good enough.

Things are very different in App Engine. Counts are relatively expensive, and from what I've heard, application instances are flushed pretty aggressively from the servers if they haven't been used for a minute or two. The site only gets about 20,000 page hits a month (awww!) so that would seem to indicate that a naive average time between hits is 2 minutes - just long enough for the app to get flushed, and ensure that it would have to hit the datastore again six more times for its counts. So that would translate to a site that has a page load time of about 30 seconds, each time. Oh dear!

So, I read the article Sharding counters with interest. As also the article Using hooks in Google App Engine. I'm thinking, do a sharded counter for each article type, put the results in memcache if I'm feeling optimistic, and then even if the app does get flushed at least repeated reads-and-summations of the sharded counters will be a lot faster than doing a load of select counts.

So then the natural way to progress is to make the sharded counters a module, import it into the admin handler script of the site, and increment the counters whenever a new article is added.

BUT, I've got a ton of data from the existing site that I'm bulkloading; about 4,000 articles altogether. They need to be counted too. One way to do that would be to add the sharded counter increments as hooks on the models' put method(s). That's nice and elegant, and ensures ... wait a second, what does it ensure???

Presumably it ensures that any puts in the admin handler script would call the hook that examines the entity to see if it's new and then increments the counter if it is. But what about code not in the admin handler script? Bulk data loads go through an entirely different code path, $PYTHON_LIB/google/appengine/ext/remote_api/handler.py, and it's not clear to me that the code there would hit the put-hook at all. Perhaps it would, if I ensured that my admin handler script had been executed, had installed the hook, and was still cached by App Engine when I did the data load. But what if I forgot to do that before doing the data load? Or what if I did it, but App Engine decided to uncache the admin script before I started the data load? Would it also unhook the hooks that that script installed? I'm not sure I want to do the work necessary to find out.

Maybe I'll go the lazy way, and write a page in the admin application that counts the article types and resets the sharded counters accordingly. It seems a bit of a waste for a process that will only ever have to be done once (ok, that's in theory; in practice, I'll be doing that data load a good few times yet, before I'm ready to go live with the site - maybe it's not a waste after all).

In fact, really, since there's only ever me that updates the site, it's not really necessary to have sharded counters at all. There's never going to be any contention for writing the counter record, since I can't write articles that fast :))). Maybe I'll save myself a lot of bother if I have non-sharded counters, have a page to recount the articles and reset the counters, and then increment the counters on new puts, without using hooks.

Yes, writing this post really helped :)

Wednesday, April 22, 2009

URL routing in Python web apps

There are two places to put URL routing information. The first is in your app.yaml, in the 'handlers' section. The second is in the python script that is called *from* the handlers section, at the end, where you create the webapp.WSGIApplication.

"So," you might think, "all I have to do is split my application into modules, assign modules to urls in app.yaml, and then assign sub-urls to classes when I create the webapp.WSGIApplication, right?" Wrong.

Say you are creating a backend for a Flex rich client application, and suppose you have something like this in your app.yaml:

handlers:
- url: /flex
  script: client.py

and then in your client.py:

application = webapp.WSGIApplication(
    [('/test', TestPage)],
    debug=True )

If you think that this will cause the TestPage class to be invoked for a URL like "http://localhost:8080/flex/test" then you are unfortunately wrong. Perhaps because Google's Getting Started documentation deals with pages that are available directly off the root, I naively thought that the url filters in the webapp.WSGIApplication got *added* to the end of those in app.yaml, so that you could use app.yaml for first-order routing, webapp.WSGIApplication for second order routing.

But no, the url mappings in webapp.WSGIApplication have to match the *full* url. App.yaml just points to the script that needs to do the work: the url fragment that it matches is not then stripped from the start of the url that's passed to the script.

So if I want to handle "http://localhost:8080/flex/test" in a TestPage class in client.py, I need to have something like:

handlers:
- url: /flex/.*
  script: client.py

in app.yaml and:

application = webapp.WSGIApplication(
    [('/flex/test', TestPage)],
    debug=True )

in client.py.

Why is this important? It's important because it means that your WSGIApplication scripts need to know how you are dividing up your url space, which I consider bad form. I can't, for example, change the base url for the data endpoints for my rich internet applications from "/flex" to "/ria" just by changing one line in app.yaml, which is what I would very much like to do. Instead I have to hunt through all my WSGIApplication scripts and change every single mapping in them as well.

Wednesday, April 8, 2009

Google App Engine, now with added caffeine!

So I'm coding away my baby steps in Python, feeling all proud because I've got about 80% (ha!) of the site that I'm using as a test project converted over, when I open a new tab to look once again at the syntax for the templating language and BAM! I see "Java Early Look" on the left hand of the page.

Seriously? Java? They actually did it? Trembly-hands time, but I managed to click to this post on the Google App Engine Blog before the excitement got to me and I had to make a cup of tea.

But there's even more, including long awaited support for cron jobs, a Secure Data Connector so that App Engine applications can access data from behind your firewall, and a new version of GWT that can work with the Java version of App Engine as a back end. Empires will surely rise and fall due to this release; I wonder when they'll be adding support for .net? (Answer: Never!)

Secure Data Connectors look particularly interesting. It all depends of how fast it works of course, but I can see how it might enable mash-ups between relational data held on your own private servers and big-table data held in the app engine data store. I had been wondering if Google were going to address the fact that Microsoft is including SQL Server support in Azure (though we have yet to see how well that scales in practice), and this provides at least a partial answer.

There's also intriguing talk in the blog post of something they call Database Import: "move GBs of data easily into your App Engine app. Matching export capabilities are coming soon, hopefully within a month." Unfortunately the link they give points to the old how-to page on uploading data from CSV files. Maybe they'll fix the link, or maybe that page will get updated soon.

Saturday, April 4, 2009

What is this blog, and why am I writing it?

I've just started seriously porting a Java/Postgres-based website to App Engine, and already I'm finding problems with holes in the documentation, misleading threads on discussion groups and so on.

The documentation Google are currently supplying with App Engine is excellent for those first baby steps, but seems to fall down as soon as you enter the real world. No doubt that will be rectified in time, but until then you're left with scanning the source code.

The App Engine newsgroup, while a necessary resource for anyone doing App Engine based development, has quite a high signal to noise ratio. There are a lot more questions there than answers.

There are some good App Engine-related blogs out there, some of them written by very skilled developers, but they seem to concentrate on either topics of personal interest or problems they are running into at work. That is, they don't seem general enough. Sometimes, they aren't *basic* enough :)))

So I want this blog to be a resource for would-be App Engineers: intelligent developers who are new(ish) to App Engine. Because you can be extremely experienced in Java say, and SQL databases, but still be tripping over yourself when it comes to App Engine basics. The sort of thing you need when you are beginning.

I was originally going to call this blog "App Engineering", which sounds either more objective or more collegiate, or both. But appengineering.blogspot.com was gone (it redirects somewhere), so "App Engineer" it is, even though that might sound a bit me-me-me. I promise to try to make it us-us-us.