Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Inexact change.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



Please log in.
Before you can vote, you need to register. Please log in or create an account.

Automatic name redaction

  (+3, -2)
(+3, -2)
  [vote for,

This programs looks though a document, finds any names and blacks its out with a randomly generated name

(which is psudorandom, but cannot be reverted because it will be a limited pool solely used to allow people to know whose talking in the article like "john doe, jackie dan, etc..."; alternatively you can just say "Guy1, Guy2, etc...")

This will be useful for wikileaks, as it would speed up redaction to protect some people in the article being leaked.

mofosyne, Aug 29 2010

Random name generator http://www.kleimo.com/random/name.cfm
"...uses data from the US census..." [normzone, Aug 29 2010]


       Applications I can think of would require a very, very low error rate at distinguishing names from other text. The most promising approach, so it seems to me, would be brute force, relying on a huge database of names. Would work even better in a context (e.g. medical records) where you had a database certain to contain every name that would need to be redacted.
mouseposture, Aug 29 2010

       [bigsleep] Sure, and s/wedding guests/terrorists/. And s/military installation/Chinese embassy. And s/peasant/Viet Cong/ And s/cathedral/arms factory/ and s/sports stadium/troop concentration/.
mouseposture, Aug 29 2010

       We could begin by assuming that all the non- identified words are names and then tag all the word/name crossover words, then also black all examples of capitalized words in the middle of sentences. seems like it should be very easy even for an amateur programmer. Then if you wanted to get tricky you could use the sentence structure deducing software used in MS word to identify the subject of every sentence and redact it unless it was clearly not a name.
WcW, Aug 29 2010

       [WcW] //begin by assuming that all the non- identified words are names// Good.   

       //structure deducing software used in MS word// Bad,
because complex, therefore untrustworthy. (e.g., you'd also need foreign-language-detection to protect the sentence-structure-deducing algorithm, and then what if some joker used Igpay Atinlay, or leetspeak, ... and so on).

       Moreover, you'd need to redact nicknames like "Red," or "Chuck" (and on the internet, you can't rely on capitalization) and noms de web (could it tell that "WcW" was a name but not "W3C" ?).
mouseposture, Aug 29 2010

       I don't think this would help much. "Guy1, CEO of Pharmacia, today said....". Redacting is more complex than just the names, I think.
MaxwellBuchanan, Aug 29 2010

       [MaxwellBuchanan] ...which is why the best solution is to apply the filter when the documents are created. Require the person writing the document to use some simple markup to flag redactable passages.   

       This avoids the need for unreliable machine intelligence, and requires the minimum of expensive human intelligence (since no one has an easier time understanding a passage than the person writing it). It also solves the problem you raise of identifying redactable passages that aren't names.   

       This is more feasible than perhaps it seems, if documents are created within the organization's electronic record- keeping and internal-communication system. I work in such an environment, and the system is a powerful tool for imposing standard format on documents as they are written.
mouseposture, Aug 29 2010

       // unreliable machine intelligence //   

       We're going to kill you.   

       Nothing personal.   

       // human intelligence //   

       Where ? Where ? Show us ! We've always wanted to see that ....
8th of 7, Aug 29 2010

       Mark the documents, Mark.
MechE, Aug 30 2010

       Don't feel bad, [8th]. Intelligence is overrated.
mouseposture, Aug 31 2010

       perhaps people are unfamiliar with current government policies regarding redaction let me explain a ========= ======== ==== ========= ========= ============== ====== ====== ========== ==== ======== ========== which allows for re======== ================ ==== =========== ========= ========= ===== ========of most ====== =======   

       This page inte==== === ======
WcW, Aug 31 2010

       ' "cuppa Joe" isn't a name, but would get screwed up by this system.'
"I could live with that," thought st6f as he drank his cup of Pete.
st3f, Nov 11 2010


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle