Halfbakery: still some things we can try

Halfbakery: it's not just a good idea, it's also some bad ones.

It's late, Thursday.
Something is wrong with the halfbakery.
The machine doesn't boot.
It's probably the disk.
I really should have set up those backups.

Jutta Degener, Oct 2004

still some things we can try
Recovering from a catastrophic halfbakery disk crash without recent backups.

The halfbakery disk that didn't boot that Thursday night would never boot again. In fact it wouldn't become readable again beyond its first 25%, which didn't hold halfbakery data. The last backup we found was from late 2002, a little less than two years out of date.

This paper describes how we nevertheless got our site back, and why, when you go to www.halfbakery.com today, you pretty much see what you saw before the disk crash. Especially if you're coming from a Google search.

What is the halfbakery?

The halfbakery is a collaborative database where people post, annotate, and vote for poorly thought out ideas, all via a web interface. It's pretty large for an amateur operation, with more than 10,000 registered users and an active core of maybe 200 that visit the site between once a day and once a week.

Like many such systems, it has roughly three parts. One part is the source code for a CGI script: five libraries and a using application. Part two is a set of templates that control how the output is generated, and some editorial framework; the help file, "what's new", etc. Part three is the actual data; ideas, links, annotations that the users entered. The data is stored in a mostly binary format in large database files.

There are a few other pieces here and there, for example the settings in the httpd configuration file.

halfbakery.com

Hard disk recovery

Hard disk recovery services exist. You ship them your drive; if they can recover the data, they ship you back another drive or some other piece of mass storage with your data on it. They're expensive, and they don't always work. For example, it didn't work for us.

They give you quotes over the phone and online. Some don't do it over the phone. Quotes depend on your type of drive (SCSI is more expensive than IDE); the type of damage (whether clean room work is required or not), and on whether there's a RAID involved.

Some places give you a guaranteed maximum. Some places don't charge you if they don't recover the data. Most give you a step-by-step process; they look at the disk, then give a more detailed assessment with possible charges, and the option to back away before the really expensive part - the clean room work - starts.

Before picking a service, it's a good idea to decide whether you're willing to pay thousand dollars or more for your data if clean room work is needed to recover it. If not, it's probably worth looking around among your acquaintances for someone who has experience with, say, swapping out a faulty controller. It's easier to find people who can do that than to find people who have a clean room in their garage.

The highest quote I got for an obviously badly damaged 18-gig SCSI drive was at most $2700, with a fixed clean room fee that would be charged whether or not they recovered anything.

The lowest quote I got was $1500 for successful recovery only, from dtidata.com. I went with that one. They ended up being unable to recover the disk, and I didn't pay anything but shipping.

Maybe the $1200 more in the expensive service would have paid for near-magic expertise that would have gotten our data back; or maybe dtidata has economies of scale that other places can't match, and no other place would have been able to recover the disk, either. To tell that difference, one will need a lot more experience with disk recovery than I hope to get.

It's interesting to deal with services that can charge you this much for a few days' work. It makes many things easy. For example, it's simply no use to worry about shipping costs if you're about to pay this much money, and they do answer the phone and are generally eager to get your business. They even had a "tracking number" interface where I could enter a the case number everybody referred to the disk as and get status updates.

But in the end, it's still comes down to Brett and Bob working in a bunny suit in a clean room somewhere trying to figure out how to stitch your file system back together. The status tracker updates never got past "pre-recovery". All the interesting news required me to call in, ask to talk to the guy in the clean room, wait a while on the unusually silent line - no "on hold" music - and then hear that it's looking bad, but there are still some things they can try.

dtidata.com

Parts

The last major rewrite of the source code had happened about a year ago, in 2003, when I briefly hoped to sell a copy of the site's code for further development to a third party. That didn't happen, but it meant that I had the halfbakery source code and templates, minus the bugs fixed over the course of a year, on a laptop.

We had most recently changed systems in late 2002. During the transition, the data had moved to a system on which it remained, and it hadn't changed formats since then. So we had data from late 2002, and code from mid 2003, as the basis. With a bit of tooling, that quickly got us to the point of having a site that at least picked up the phone. That was our starting point.

Work

While I was waiting for the local expert to give up on the disk, and then, later, for the commercial service in Pasadena to pronounce it irrecoverable, the halfbakery users were without a site many of them were used to visiting every day. They got more work done in those weeks. I got more work done, too.

Google

You can limit Google searches to a site by using the site: modifier. For example, a search for

site:www.halfbakery.com custard

only finds pages on the halfbakery in which the word "custard" occurs, and

site:www.halfbakery.com

finds everything Google has from the halfbakery.

Google had about a third of the site in its cache. The site had a redundant sub-arm of pages starting with "/lr", for the low-resolution version; between "/lr/idea" and "/idea", about half of the ideas were in the cache. About 16,000.

But Google only ever serves 1000 links per search query. Results after the 1000th are counted, but not actually accessible.

Google
search
modifiers

I wrote a semi-automatic Google-sucker that saved Google-cached responses for me and ran it for two days, just saving the idea and user webpages as files in a big idea/ directory.

User input was still necessary to cut different slices out of the Google cache and get new results under the 1000 item limit. I used names of prolific users (excluding them on the later searches), common terms, names of categories the halfbakery is organized in. I'm sure there's a way of doing this automatically, but I couldn't think of a good algorithm.

I talked about the crash and what I was doing to a friend who works at Google. He wrote his own cache-sucker and sent me his results. His results were a bit clearner than mine, so I used them first. My saved copies still had the search terms highlighted in them. In the end, that wouldn't matter.

Because his tool ran a little later than mine, fewer ideas were left. My tool had found 4,000 more. But many of those that both our tools found were already in the halfbakery from the 2002 backups.

Fakebakery

A user set up a replacement halfbakery, the fakebakery, in about a day, using free bulletin board software. The replacement supported avatars, private messages, polls, and voting.

I was happy that it happened, and I used the fakebakery for status announcements, as promised. But I couldn't use the interface very well. After posting something, I didn't quite know where I was. Things were in the wrong place, and too many of them; when I wanted to see someone's posting, I'd instead end on their profile page; when I wanted to go home, to the home page of the site, I'd end up on my own profile page, and so on.

The special thing about the halfbakery is not its features but its simplicity. When people asked me before about getting a copy of the software, I referred them to the other, more full-featured systems. That was stupid and I'll never do it again.

But that simplicity isn't necessary to have a functioning online community. Everything seemed to carry over from the old halfbakery, including even a troll -- one user whom I had banned from the halfbakery got an account in the replacement and jeered at us. "This is karmic justice!"

fakebakery

archive.org

I never ended up using the contents of archive.org, and don't know how much of the site is actually saved there. The halfbakery itself is not a good navigation interface for downloading all of the site.

Those are minor complaints; if Google hadn't been so easy, I would have written something to traverse archive.org. And someone, somewhere, should. But I was worn out by then.

archive.org

Heroic programming

For a weekend, and then sporadically afterwards, I wrote software, tried it on the live system, wrote some more, tried it again, wrote some more, nudging the halfbakery database into something stable. I don't usually work like that, usually, the goal is to make an aesthetically pleasing program that stands on its own, not just do one job, and there is much reading and rereading of code - does this make sense? Is it clearly expressed? Being able to just hack and move on felt different.

I ended up with one back-end, with four calls that inserted a new user, idea, link, or note into the halfbakery; and different frontends built to parse different structures that were fed into the system: the regular halfbakery page, the low-res version, and an XML-like mark-up for texts that I couldn't fit into either of the other forms.

Parsing them was mostly a matter of matching patterns in strings; taking out the text fragments between them, and post-procesing them to remove URL or HTML mark-up. A perl programmer would have easily been able to do that in perl; I'm more comfortable in C, especially when doing a lot of string processing.

The proper way of doing this would be to build the reverse of the halfbakery formatting engine, a "scanning engine" that reconstitutes the XML-like output of the halfbakery database application from the HTML. That's possible (it's not generally possible, but here, it would have worked), but it was much easier to just use some sloppy heuristics.

The halfbakery data is not saved in the form of webpages; it is saved as a big database of objects that are referenced by names (via a separate hashtable) and numbers. The hashtable can be rebuilt from the object table. An idea object contains its name, summary, text, and author; a created and modified date; a list of who voted for or against it; and a list of references to notes and links, which appear as part of the idea when the idea is displayed.

Most of this is visible. The dates internally are at second granularity, but are only displayed as days externally. That means that now, after the recovery, there are a lot of links and notes that purport to have been created at 12 noon. (But users can't see that.)

The votes are lost; without internal information about who voted (which isn't visible in the printed page), I can't tell whether someone is voting twice for the same page.

The recovery had to be incremental. For example, there are pages that had been created before 2002, but annotated after 2003, and I had to be able to combine the current text and new annotations after 2003 with the old idea from the 2002 backup without duplicating all the annotations. So, part of adding new things is an analysis of the old stuff that's there, and heuristical guesses about whether something truly is new or just a variation of something old.

Ontogeny recapitulates phylogeny.

Any evolution the site has undergone over its four years has left traces somewhere. Users have changed their names, categories have changed or expanded or split up, ideas have been renamed (and now exist twice). Accounts have been deleted by accident or in anger, including my own.

In the recovery, I get the chance to pick and choose from those evolutions, and do those over again that I agree with, and ignore those I disagree with.

Thanks, Dad

Within three weeks of the crash, two archives showed up that had been independently created by two fathers - one with about 900 ideas, one with 6000. In both cases, the proud fathers were attempting to archive their son's postings, but ended up downloading large parts of the site instead, as a fortunate accident.

Changes

All of the saved pages users sent me had been modified by their tools in some way. In different ways. I guess I understand it when mirroring tools change the pathnames in the links, but many went beyond that.

Some change the capitalization of tags. The tags are read in, become a tree of "objects" and then the objects are "sequentialized" again.
Some tools add new element parts. The tags are read in, become a tree, the tree is fixed up to comply with some internal idea of how things should look, and then everything is written out.
Some Microsoft tool changed all the fonts into those fonts that were actually displayed in the user's browser at the time. There, the tags are read in, become a tree of objects, become a tree of editing objects that exist in Microsoft's internal document universe (annotating everything with font names, point sizes, colors and so on), and are only then written out as HTML.

Of course, these weren't tools intended to create backups; they were intended to copy documents. But they didn't copy documents, they made them their own in some way and then kept a copy of that localized version.

Another way in which tool transformations created problems for me had to do with links from halfbakery pages.

An internal halfbakery "link" wraps a URL, a name, a description, and a modified date. Links with identical URLs are considered the same. So, when restoring data for an existing page, I discarded links if it had the same URL as an existing link on the page as redundant copies.

The tools created problems when they changed the URLs in links, making links that were really the same look different.

In any one idea, you might see the following as links:

bar.html

originally the idea "bar", linked to as http://www.halfbakery.com/idea/bar. Its contents have been saved into "bar.html" in the same directory. (The ".html" is a file ending the software added to help file processing tools that guess a document's file type based on a suffix; the halfbakery doesn't use this convention.) After saving the target, the helpful software has edited the link in the webpage to point to the saved copy of the idea, not to the original idea.
Of course, "bar.html" on the real halfbakery website doesn't exist. When naively restoring something like that, I create dangling, redundant links.

../../www.somewhere.com/xyz.html

originally a link to http://www.somewhere.com/xyz.html. A recursive copying software has saved not only this idea page, but also the other pages it pointed to, as before then translating external references to internal ones.

../../www.somewhere.com/_xyz/x.html

originally a link to http://www.somewhere.com/~xyz/x.html. The tilde has been replaced with an underscore, perhaps to avoid complications when addressing it without a prefix.

http://www.somewhere.com/

originally a link to http://www.somewhere.com whose poster had left off the closing '/', something the tool helpfully restored -- creating yet another dangling link.

Because the links are a relatively small amount of data in the overall mix of ideas and annotations, I always only discovered these issues when it was too late and ideas had already been annotated with redundant copies of the links. Time to write yet another tool that traverses all of the halfbakery and changes things around - and hope that it doesn't destroy more than it fixes.

Reclamations

[This account was destroyed in a disk crash in October 2004 and has been partially restored from a cached copy. If it is yours, please send e-mail to <bakesperson@gmail.com> to reclaim it.]

That's how most of the user account texts in the halfbakery still start. Meanwhile, I'm getting three, four e-mails a day from users who came back and wondered what exactly it meant for them to "reclaim their account".

The halfbakery stored no security questions and no hidden e-mail addresses. (If it had, they'd be gone.) In practice, I really have no idea who anyone is, other than the few users that I've had side conversations with and whose e-mail addresses I recognize.

Fortunately, there's really nothing of value for people to steal. What a typical malicious attacker to the halfbakery wants is lots and lots of accounts (to vote in favor of their own ideas), and I'm trusting myself to eventually detect a pattern in the requests to take over other people's accounts.

So, all that's required is a brief email saying, "I'm so-and-so!", so that I can get back to them with a new password. That's not something that people are used to having to do.

For a while, I used the user's first name as their password. This now works, but at first, it didn't -- halfbakery passwords are case-sensitive, people can't capitalize, so someone told to enter "Kim" would invariably enter "kim" and not even think about it. It works the other way around - people told to enter "kim" don't enter "Kim".

New account

For the first couple of days, users could still create new accounts without talking to a real human. Looking at the audit trail showed that people would try to log in as their old account, wouldn't get in (because their passwords were lost), and would then just create a second account.

The halfbakery is one of the few places where you can take everything back, and edit everything - but for that to work, I need people to keep their account. If their password doesn't work, they need to complain, and have their password reset. But that takes time and social interaction, and people don't like to wait - so they go and make a new account, and let the old one rot.

So I turned off automatic account creation altogether. All new accounts would be created manually by me or another moderator.

Which meant that I then got e-mails asking for new accounts and, if I was lucky, apologetically stating that the user couldn't access their old account - many people didn't think to just demand access to their old account.

I think they should. If you wrote something, the reality that you wrote it should be more important, and more important to your identity, than a mechanism about whether or not you remember some password somewhere. It should be possible to sound sane and insistent and get control back over your stuff, unless there's two people fighting over which of them is real.

still some things we want to try

The halfbakery still doesn't scale. The categories are still filling up. It's still hard to not bog down regular users with mechanism yet retain a reasonable level of discussion. The jokes haven't gotten any better, either.

Yet somehow, having to rebuild the site loosened me up enough to dare change it more. I've added new features to the moderator interface. Recently, I took a few days off work and made the first steps towards adding illustrations. Compared to three weeks without it, the interruption from these changes hardly even register. If users are putting up with this much disruption, and if I'm working this hard to just keep the site, it would be idiotic not to throw in the extra work to make it do the things I want it to do.

And, yes, I just set up a daily backup.

jutta@pobox.com