Halfbakery: layout optimisation for filesystems

Despite the dropping price of flash most computers still use a hard disk for storing files. They're are kind of slow though. Reading a randomly chosen block can take 5-10 milliseconds.

The problem is that hard disks aren't random access. There's a spinning disk and a moving arm. To read or write a block of data at some position on the platter, the arm has to move to the right offset and the disk has to spin round until the sliver of data passes under the read head to be read. When reading or writing blocks one after another my current disk can process 50,000 per second. When reading blocks randomly, It can barely manage 100 per second.

Modern filesystems are built mostly with trees. BTRFS (B-TRee FileSystem) is a filesystem named after the kind of tree it uses internally. A directory, for example, would have a start block. For small directories, all the data about the files might fit into a single block. For a large directory, that data will have to be split up into parts and stored in other blocks.

When your computer looks for a file in that directory, it will read the start block, figure out which of the linked blocks has the file information and then fetch that block. It might have to go another step and get another block if the directory is really big.

Trees are used for lots of other things in file systems and navigating them quickly is important to cut down on waiting time.

Since trees are usually accessed sequentially, following links in one block to find the next, we can lay the data out on the platter so that after reading a block, the blocks we might want next are nearly at the read head. The block gets read, the computer quickly sends a request for one of the next blocks in the tree and the hard disk moves the head a little bit and gets you the next block really fast. Repeat for however many levels there are in the tree.

The only barrier to doing this today is figuring out how hard disks lay out their blocks. This can be done by requesting blocks and seeing how long between the hard disk returning one and returning the other. This would be done for lots of blocks until the way the drive has them layed out can be mapped. It's going to be some variation of a spiral.

The end result would be a faster filesystem. And more performance for the software engineers to waste.