Haha, apparently you feel a little more pressured than I do!
I took a second look at my Java code, and I can tell I don't have the intellectual fortitude to work on it until I'm a couple days removed from my current project. Coding all day for work has me averse to coding for funsies.
One thing I noticed in working out my own scraper for GAF was even though it seemed very simple conceptually Discussions-->Threads-->Posts, there are a lot of mildly challenging data structure challenges in handling those from a stateless connection and no data store. I ended up:
- Holding most of GAF--as it was browsed by users of my webapp--in memor
- assigned cascading caches to each layer (i.e. 2 days for the discussions page, 5 minutes for the thread listings, 1 minute for each individual thread), so that I couldn't rape the GAF server too badly.
- Manual pruning of the caches to prevent them from getting too massive.
- Used redundant data structures so I could easily do reverse-lookups (i.e. given a post ID, find the thread as well as the more typical opposite)
Otherwise, all very doable. I think I spent a total of ~12 hours building the scraper and optimizing it (though I know there's a memory leak in there, because after a few days the web server would run out of heapspace)