Halfbakery: WTF-64 and WTF-512 Unicode charset encodings

This idea is a bit technical, so I'll start with some background and definitions for the uninitiated, before jumping into the problem statement and a description of the solution.

BACKGROUND / DEFINITIONS:

- Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Source: Wikipedia)

- UTF-8 is a variable width character encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.

-WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. (Source: Simon Sapin, see link)

- WTF-8 (alternative definition) is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such as Windows-1252, then encodes those characters as UTF-8 (Source: Hacker News, see link)

PROBLEM STATEMENT

UTF-8 is extremelly hard to implement correctly, as evidenced by a Google search for "utf-8 issue" (32 million results), and the emergence of grassroots alternative encodings such as WTF-8 (two variants), and computer-aided UTF-8 issue diagnostic tools such as ftfy (see link).

UTF-8 and its related WTF-8 encodings also suffer from a lack of human readability, as someone looking at the bits in memory would not be able to immediatelly tell what characters they represent, without referencing a Unicode table.

THE SOLUTION

We hereby propose two new sets of encodings for Unicode: WTF-64 and WTF-512.

WTF-512 consists of an 8x8 bitmap, 8-bit grayscale representation of the underlying character, as often represented (written) by humans.

WTF-64 is a compressed representation of WTF-512, using an 8x8 bitmap in monochrome, which can be used when it is known that the text contains only latin-derived scripts. WTF-64 conveniently uses one character per 64-bit machine word, and hence can be processed efficiently by contemporary computers.

The key feature of this scheme, is that there is no universal decoding table for either WTF-64 or WTF-512: all decoding must be done by either displaying the bitmap directly on the screen, or using pre-trained Machine Learning algorithms to decode the character set into the machine's internal representation (see also: MNIST dataset).

WTF-64 is intended as a transitional technology for compute-limited devices or until machines can be upgraded. New systems should be architected for WTF- 512 natively.