Halfbakery: Web Versioning

This is quite a technical idea with little or no comedic content. Before I explain it, I will first briefly explain the following four concepts.

A "hash" is a one-way function. For any given input, it will produce a predictable and exact output, but by looking at just the output it is very difficult to guess what the input was. One use of a hash is as a checksum: when downloading a big file, it's common to specify the MD5 hash of the file too, so that the user can check it downloaded without errors.

A "secure website" means two things, encryption and proof of identity. Encryption means that no-one can spy on the data sent back and forth. Identity -- the idea that the website you're connecting to really is who they say they are -- is done through certificates and a chain of trust. The core concept is that a certificate is shown, with cryptographic proof (similar to a hash function) that another website or company can vouch for them. That authority's certificate may point to another, and so on, until it leads to a number of root certificates. A copy of these root certificates is stored on your computer, or phone, and all of the trust you put into a secure website stems from there.

"Version Control" is software to keep track of software. The simplest type of version control is to rename a file with a number on the end, and whenever you change the file, increment the number. Modern version control such as "git" uses a chain of hash functions to link versions together unambiguously. It is common to share a commit hash, which points to an exact snapshot of the data at a given point in time.

An "ETag" is a tool used for caching web pages. It is a header field that gives either a version number or a hash checksum of a webpage. When you visit the site again, if the ETag hasn't changed, there's no need to re-download the page and a cached version is shown. This is generally a good thing, except that there's no standard for how an ETag is generated, and malicious websites can abuse it by sending an identifier instead, and use it to track you (and annoyingly, most browsers provide no control at all over ETags).

So, to the idea:

Apply version control to web pages, predictably generate caching checksums and publish them through the certificate process.

The result would be that going to a secure website would, in addition to being encrypted and proving identity, provide a fixed version number for the site. Predictable checksums mean that unlike the ambiguous ETag, the checksum calculation could be done client side (in the browser), much like the old way of verifying the MD5 hash of a download. Not only does give you an added assurance that nothing's been corrupted, but it means that you don't necessarily have to tell the server which version you currently have cached (which is the privacy concern with ETags).

Certificates are intentionally short-lived and are usually renewed on a regular basis, sometimes at intervals as short as two months. Websites are often updated only slightly more frequently than this, so linking the version hashes into the chain of trust wouldn't be too drastic. A secure website would be expected to keep its version number for a while, then update to the newer version only when a new certificate is issued.

One problem is dynamic content (such as a web page being different when the user has logged in) but this mixture of code and data has always been messy. Many modern sites have a clear division between code and data anyway. Within this system, there would be an enforced difference between versioned code (such as javascript files) and data (for example JSON objects). Parts of the site that have to be dynamic can be left out, and a security wall could separate it in the same way that cookies can be flagged as https-only.

Overall, security may not be improved all that much, since during a man-in-the-middle attack the browser has warned the user of the certificate error, and usually the user has clicked "ignore this, show me anyway". If an attacker can spoof a certificate, then they can spoof a version number too. I thought about having an enforced "valid until..." marker on the version number, but in the event of a vulnerability in the code being discovered, we want to update the code as soon as possible. Perhaps regular updates should be the expected behaviour, with emergency updates an option. In that case, if your internet banking website has an unexpected update, you know that either an attack is taking place, or that a critical vulnerability has just been patched. In either case, you may want to delay your banking until the situation is clears up.