Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Think of it as a spell checker that insults you, as well.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



WTF-64 and WTF-512 Unicode charset encodings

Human-readable replacement for UTF-8 and all WTF-8 variants
  [vote for,

This idea is a bit technical, so I'll start with some background and definitions for the uninitiated, before jumping into the problem statement and a description of the solution.


- Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Source: Wikipedia)

- UTF-8 is a variable width character encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.

-WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. (Source: Simon Sapin, see link)

- WTF-8 (alternative definition) is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such as Windows-1252, then encodes those characters as UTF-8 (Source: Hacker News, see link)


UTF-8 is extremelly hard to implement correctly, as evidenced by a Google search for "utf-8 issue" (32 million results), and the emergence of grassroots alternative encodings such as WTF-8 (two variants), and computer-aided UTF-8 issue diagnostic tools such as ftfy (see link).

UTF-8 and its related WTF-8 encodings also suffer from a lack of human readability, as someone looking at the bits in memory would not be able to immediatelly tell what characters they represent, without referencing a Unicode table.


We hereby propose two new sets of encodings for Unicode: WTF-64 and WTF-512.

WTF-512 consists of an 8x8 bitmap, 8-bit grayscale representation of the underlying character, as often represented (written) by humans.

WTF-64 is a compressed representation of WTF-512, using an 8x8 bitmap in monochrome, which can be used when it is known that the text contains only latin-derived scripts. WTF-64 conveniently uses one character per 64-bit machine word, and hence can be processed efficiently by contemporary computers.

The key feature of this scheme, is that there is no universal decoding table for either WTF-64 or WTF-512: all decoding must be done by either displaying the bitmap directly on the screen, or using pre-trained Machine Learning algorithms to decode the character set into the machine's internal representation (see also: MNIST dataset).

WTF-64 is intended as a transitional technology for compute-limited devices or until machines can be upgraded. New systems should be architected for WTF- 512 natively.

ignobel, Mar 24 2019

Simon Sapin's WTF-8 https://simonsapin.github.io/wtf-8
Code library for dealing with UTF-8 issues [ignobel, Mar 24 2019]

The WTF-8 encoding https://news.ycombi...com/item?id=9611710
Lengthy and deep WTF-8 discussion [ignobel, Mar 24 2019]

ftfy https://github.com/...Insight/python-ftfy
Python library for decoding WTF-8 [ignobel, Mar 24 2019, last modified Apr 05 2019]


       And here I thought I knew what wtf meant
theircompetitor, Mar 24 2019

       I think 8 bits of grey-scale are too many. I know that I wouldn't be able to tell the difference between grey-175 and grey-176 (out of 256 levels). Maybe 4 bits for grey-scale, leaving 4 bits for other stuff (I'm not geek enough to know what stuff could be relevant...).
neutrinos_shadow, Mar 24 2019


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle