First, some prior art.
It's been a long day in the scriptorium for Prior Cadfael. He has to stand at the front, and read from the master copy of a codex, while the brother scribes write down new manuscript copies of ... whatever it might be. Today, let's say it's Cicero's Pro Murena. There is no
coffee, because it has not yet reached Europe. So no-one can really blame him when, instead of "nemo paene sobrius saltat, nisi insanit", he reads "[...] salsat [...]". Which makes just as good sense, and may well be true.
This is a well-known problem, with a well-known solution. The well-known solution, for people wanting to know many generations later what Cicero really said, is called an apparatus criticus, which will be familiar to scholarly readers of, for example, Oxford Classical Text editions.
Brother Cerdic has starting doodling dicks in the marginal illuminations again, but we will leave him to it.
For now we are reading an Amazon e-book that has been scanned in along with thousands of others, and never properly proof-read. The original texts are not at all ancient; on the contrary, there are often people still alive who knew the authors. However, it would have cost too much to find them and ask their opinion, so the e-books were just churned out as-is.
And scanning means OCR errors.
Now, the Kindle App offers a "report content error" option, where we can submit our observations of what look like OCR errors, so that they can be completely ignored. But in any case, this is not just an Amazon problem. Students often lack money, and Humanities students often also lack any hope of earning any, so they often do their own scanning-to-pdf, and then sharing to the internet. And the kinds of text they scan tend to diverge from the texts probably supplied as learning data to OCR AI algorithms, especially with respect to the prevalence of foreign words. It gets worse when the original text is, say, a hundred years old and printed on cheap, high-acid paper in fonts that have since fallen out of fashion. And then stained with fly-poop (or something), which gets mistaken for full stops.
So, let us imagine a text in which multiple versions of any given passage may be given in-line, but tagged with XML tags specifying the origin of each variant, and where the document as a whole is accompanied by a stemma codium in XML form, to which attributes of those in-line tags refer.
If all these tags, both the in-line ones and the "stemma" ones, conformed to an agreed XML Schema Definition (XSD), then the marked up text could be rendered using any one of a range of different XSLT transformations, of which the first two would probably include one "scholarly" version, which showed the apparatus in the footnotes, and one "Ee-Zee Read" version, which moved it to the back.
Each text with its tags could be held in a public project on, say, GitLab, and the XSD itself in a separate project.
And then the idea might have a witty and apposite punchline.