Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
I like this idea, only I think it should be run by the government.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



OCR OCD Apparatus Criticus XSD

Did they really write that?
  [vote for,

First, some prior art.

It's been a long day in the scriptorium for Prior Cadfael. He has to stand at the front, and read from the master copy of a codex, while the brother scribes write down new manuscript copies of ... whatever it might be. Today, let's say it's Cicero's Pro Murena. There is no coffee, because it has not yet reached Europe. So no-one can really blame him when, instead of "nemo paene sobrius saltat, nisi insanit", he reads "[...] salsat [...]". Which makes just as good sense, and may well be true.

This is a well-known problem, with a well-known solution. The well-known solution, for people wanting to know many generations later what Cicero really said, is called an apparatus criticus, which will be familiar to scholarly readers of, for example, Oxford Classical Text editions.

Brother Cerdic has starting doodling dicks in the marginal illuminations again, but we will leave him to it.

For now we are reading an Amazon e-book that has been scanned in along with thousands of others, and never properly proof-read. The original texts are not at all ancient; on the contrary, there are often people still alive who knew the authors. However, it would have cost too much to find them and ask their opinion, so the e-books were just churned out as-is.

And scanning means OCR errors.

Now, the Kindle App offers a "report content error" option, where we can submit our observations of what look like OCR errors, so that they can be completely ignored. But in any case, this is not just an Amazon problem. Students often lack money, and Humanities students often also lack any hope of earning any, so they often do their own scanning-to-pdf, and then sharing to the internet. And the kinds of text they scan tend to diverge from the texts probably supplied as learning data to OCR AI algorithms, especially with respect to the prevalence of foreign words. It gets worse when the original text is, say, a hundred years old and printed on cheap, high-acid paper in fonts that have since fallen out of fashion. And then stained with fly-poop (or something), which gets mistaken for full stops.

So, let us imagine a text in which multiple versions of any given passage may be given in-line, but tagged with XML tags specifying the origin of each variant, and where the document as a whole is accompanied by a stemma codium in XML form, to which attributes of those in-line tags refer.

If all these tags, both the in-line ones and the "stemma" ones, conformed to an agreed XML Schema Definition (XSD), then the marked up text could be rendered using any one of a range of different XSLT transformations, of which the first two would probably include one "scholarly" version, which showed the apparatus in the footnotes, and one "Ee-Zee Read" version, which moved it to the back.

Each text with its tags could be held in a public project on, say, GitLab, and the XSD itself in a separate project.

And then the idea might have a witty and apposite punchline.

pertinax, Oct 15 2022

Apparatus Criticus https://en.wikipedi.../Critical_apparatus
Also, stemma codium [pertinax, Oct 15 2022]

XSD https://www.w3schoo...ml/schema_intro.asp
[pertinax, Oct 15 2022]

OCR https://en.wikipedi...aracter_recognition
[pertinax, Oct 15 2022]

OCD https://www.mayocli...causes/syc-20354432
Look, it's just annoying, OK? [pertinax, Oct 15 2022]

Pro Murena: when rules of evidence haven't been invented yet ... https://www.perseus....0019%3Atext%3DMur.
... so you can just distract the jury by telling lawyer-jokes. [pertinax, Oct 15 2022]

Is that you, Brother Cerdic? https://d3h6k4kfl8m...07/phallus-tree.jpg
[pertinax, Oct 16 2022]


       this is brilliant [+] although if this idea's title is an elaborate pun, i don't understand it.
sninctown, Oct 16 2022

pocmloc, Oct 16 2022

       I understood one of those references.   


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle