This post is a kind of follow-on from a question I asked on the App Engine (Python) newsgroup. I split it up because it would have been just way too long (that one is too long for a newgroup already, and this will be pretty long too - and rambling).
The background is that at the moment I've got a pretty basic Java/SQL based web site at http://wow.gedsguides.com/ that has write-ups of several kinds of game entities (quests, factions, non-player characters, zones, etc. etc.), and at the moment, the links that let you browse them are prefixed with counts, so "2221 Quests" and so on. It would be nice, if not absolutely vital, if I could reproduce this on App Engine. At the moment, the servlet just connects to the database at startup and does six "select count(*)"s and stores the result in some class-level variables; it's updated whenever the servlet is restarted - once a week by schedule, more often by accident :) And that's good enough.
Things are very different in App Engine. Counts are relatively expensive, and from what I've heard, application instances are flushed pretty aggressively from the servers if they haven't been used for a minute or two. The site only gets about 20,000 page hits a month (awww!) so that would seem to indicate that a naive average time between hits is 2 minutes - just long enough for the app to get flushed, and ensure that it would have to hit the datastore again six more times for its counts. So that would translate to a site that has a page load time of about 30 seconds, each time. Oh dear!
So, I read the article Sharding counters with interest. As also the article Using hooks in Google App Engine. I'm thinking, do a sharded counter for each article type, put the results in memcache if I'm feeling optimistic, and then even if the app does get flushed at least repeated reads-and-summations of the sharded counters will be a lot faster than doing a load of select counts.
So then the natural way to progress is to make the sharded counters a module, import it into the admin handler script of the site, and increment the counters whenever a new article is added.
BUT, I've got a ton of data from the existing site that I'm bulkloading; about 4,000 articles altogether. They need to be counted too. One way to do that would be to add the sharded counter increments as hooks on the models' put method(s). That's nice and elegant, and ensures ... wait a second, what does it ensure???
Presumably it ensures that any puts in the admin handler script would call the hook that examines the entity to see if it's new and then increments the counter if it is. But what about code not in the admin handler script? Bulk data loads go through an entirely different code path, $PYTHON_LIB/google/appengine/ext/remote_api/handler.py, and it's not clear to me that the code there would hit the put-hook at all. Perhaps it would, if I ensured that my admin handler script had been executed, had installed the hook, and was still cached by App Engine when I did the data load. But what if I forgot to do that before doing the data load? Or what if I did it, but App Engine decided to uncache the admin script before I started the data load? Would it also unhook the hooks that that script installed? I'm not sure I want to do the work necessary to find out.
Maybe I'll go the lazy way, and write a page in the admin application that counts the article types and resets the sharded counters accordingly. It seems a bit of a waste for a process that will only ever have to be done once (ok, that's in theory; in practice, I'll be doing that data load a good few times yet, before I'm ready to go live with the site - maybe it's not a waste after all).
In fact, really, since there's only ever me that updates the site, it's not really necessary to have sharded counters at all. There's never going to be any contention for writing the counter record, since I can't write articles that fast :))). Maybe I'll save myself a lot of bother if I have non-sharded counters, have a page to recount the articles and reset the counters, and then increment the counters on new puts, without using hooks.
Yes, writing this post really helped :)