Halfbakery: Universal Archival Data Format

Oof. This is quite long. Sorry.

This is an idea I had some time ago. I've posted and discussed it elsewhere (before I came on the halfbakery), where to be honest it didn't fare too well. In particular someone got hung up on it being English based, which it needn't be. I'd post a link, however unfortunately my initial description seems to be missing and it all looks a little confusing.
Some people there had good suggestions I've already incorporated.

Also please at least read this carefully before suggesting that XML does what is required, because I don't think it does.

The Problem:
Archiving data is fraught with difficulties. The file format may be lost rendering the data unusable. This has definitely happened with more than one project.

The Solution (well, perhaps?)
(brief) Have a file format which unambiguously defines the data.

Please note before I begin that this does not address the problems of bit rot and hardware obsolescence. These have to be addressed separately (and they can be).

(verbose)
The Universal Archival Data Format (UADF) is designed with the intention that intelligent beings (ie humans) would be able to recover the data, almost from first principles. As such it has some similarities to those messages beamed into outer space in an attempt to contact aliens.

UADF is essentially a meta-format, in that it describes the standard to which any data needs to be described for conformity. It is also possible for readers to exist. These would have a 'plug-in' structure, and degrade gracefully, reporting the sections of the file they could not comprehend. A big advantage of UADF is that portability is guaranteed. It is not possible for a company to create a 'closed' format complying to UADF. If your program writes UADF files, a programmer can write another program to read them.

As I currently see it, UADF files have 5 parts:
1) a computer-parsable and human-readable header, which defines the version number etc.
2) a 'bootstrap', designed to show intelligent readers how the file data is stored and in particular how part 3 works.
3) human-readable descriptions of the 'file formats of the 'sections' in part 5, with associated computer parsable meta-data on each section type.
4) computer-parsable pointers and other meta-data for all the sections.
5) the 'sections'. This is the data you really want to store, and it can be anything... text, graphics, sound files, raw instrument data etc.

Part 1 is fairly short, and exists so that UADF readers can decide whether they can read the file. Probably something like an ASCII string "Universal Archival Data Format, Version 1.1 (UADF foundation mods)"(zero terminator)

Part 2 is a bit tricky. Not to get bogged down in detail, for English it would probably have (8*8 byte) bitmaps for some of the ASCII characters (using zero as off and the character code for on), then use those characters to define the others. It would then describe the use of some of the remaining character-codes in part 3 to give computer-parsable information.

Part 3 gives a precise, natural language description of each section type. Each of these needs to have a computer-parsable name, originator and version number, so that a UADF reader can check for a suitable plug-in. They are also given number-specifiers which are unique to this document.

Part 4 is just a table of each individual section in the document. It gives their types, the offset to each one and its length.

Part 5 contains all the data of interest.

As you can see, each file contains a significant amount of meta-data. Obviously you want to keep this to a minimum, however note that a lot of it is 'fixed cost'. It doesn't cost much to add another section of a type already present.
Also, if you want to make some assumptions, for instance that your descendants will speak English and use 8-bit ASCII, then some of the meta-data could be left out.