Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Like a magnifying lens, only with rocks.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



Fürst annual HalfBakery “Wo ist der Führer ?” programming competition.

[MARKED-FOR-EXPIRY] 31 January 2021
  (+4, -1)
(+4, -1)
  [vote for,

It seems that the task of detecting and logging occurrences of Chancellor Adolph’s surname on the HB has generated quite a bit of interest.

A rather crude but basically effective bit of code has been developed, although we will be amongst the first to acknowledge that it’s very far from perfect. So, we have decided to institute an actual competition to encourage the development of coding skills by the other inmates of the asylum.


(subject to revision without notice, E&OE)


- use more than the absolute minimum of site bandwidth for your task. This means no ripping the entire site to your computing device and then searching it locally.

- generate excessive new HB threads. You must either annotate an existing thread, or if you do create a new thread you must delete all previous ones.

- talk about Fight Club.

- Send heavy naval units through the Denmark Strait without destroyer escort or air cover, or sink any British battlecruisers.

- use an existing search engine. Those doing so will be summarily shot without trial, and then hung up outside a garage in Milan.

- request, conscript or coerce any assistance from any other person, organization or nation, including but not limited to Austria, Italy, Hungary or Romania. The code must be all your own work.

- register a new HB username to spare yourself embarrassment and ridicule.

You should try to avoid :

- using anything other than the existing HB "search" facility.

- overly provoking the U.S. unless absolutely necessary, unavoidable, accidental, funny, because the smug isolationist gits deserve it (being perfectly happy to let someone else do all the actual fighting while making silly money from loans and selling munitions at a swingeing profit), or you* just can't help yourself. (*This condition is of course waived those of Japanese nationality).

- bombing any neutral countries (except the Republic of Ireland, who deserve whatever they get for not helping).


- reliably spot and log any new occurrences of the name; you must ignore previous mentions or edited annotations. You must weed out misspellings and deliberate attempts to deceive the algorithm, but the code must not be case sensitive.

- demonstrate kindness and consideration for dogs, particularly German Shepherds.

- handle month and year-ends and other date and time rollovers (including leap years) cleanly and correctly.

- use your own existing username and do your testing and debugging "in public" to furnish copious opportunities for derision and mockery. If you fail, you must be seen to fail. Badly, repeatedly, and in a humiliating way.

- respect the Geneva Conventions.

- only log the latest occurrence on any particular day, but this can be dependent on the time zone in which the user's machine resides.

Other rules:

You may use any device, operating system, programming tool, language, link library, SDK, web interface, browser, device driver or other legitimate general purpose device or system.

You will not be expected to publish your code, but you may be asked to send a copy to at least one other HB user to verify that you’re not just doing the edits manually.

Points will be deducted for date rollover issues, bad formatting, truncated or corrupted usernames or thread names, or getting the Sixth Army surrounded at Stalingrad and captured en masse.

Your code does not have to provide links to the thread, or line numbers containing the mention, but this may earn extra Luftflotten if implemented in a neat and consistent way.


The contest will end on 31 December 2020, or when Russian troops reach the Reichstag.

The results of the judging will be announced at Nuremberg the day after the competition closes.

The winner gets to annexe the Eastern European nation of their choice.

The losers get a conducted tour of a restored WW2 bunker in Berlin, and then have to go into a small private room and shoot themselves with a small-calibre pistol. Their bodies will then be taken outside and burnt (allegedly). When they reach Argentina, they should delete their old HB account and register a new username.

Jutta is very respectfully requested please not to enter, as she would not only automatically be announced the winner no matter what anyone else does, but direct access to the hb source code is a bit like the equivalent of finding Marshall Zhukov and 86 armoured divisions down the back of the sofa - it makes things all just a bit too easy.

So far, there appear to be at least two potential participants. We hope others will feel motivated to join in.

8th of 7, Dec 15 2020

Days Since Hitler Was Mentioned Here Days_20Since_20Hitl..._20Mentioned_20Here
The genesis of this idea. [8th of 7, Dec 15 2020]

Start here... https://www.halfbak...=Q:d=iq:dn=100:ds=1
Run this at same time each day and parse/compare results last seen / previous search hit? [kdf, Dec 16 2020, last modified Dec 17 2020]

Today's Notions - rss https://www.halfbak...Today_27s_20Notions
Here's the equivalent of the temporal component of kdf's search link, expressed as rss. Note the dc:coverage attribute that seems to express an idea's "lifespan" [zen_tom, Dec 16 2020]

Mine != Mein https://www.youtube...watch?v=s6-XeyNMCJw
A video illustration of a language matching problem where the matching parameters are set too loose.. [zen_tom, Dec 18 2020]

Github repo https://github.com/zentomhb/hb_scraper
Development in the open - it's messy, but over time should get better. [zen_tom, Dec 19 2020]

epoch time to human readble ... https://www.epochconverter.com/
[kdf, Dec 21 2020]


       I think there should be a Hitler category
pocmloc, Dec 15 2020

       Seems reasonable - you should post that in "Halfbakery: Category: Wanted".
8th of 7, Dec 15 2020

       What if we mention H, but then delete it 10 minutes later?
RayfordSteele, Dec 15 2020

       The system only needs to flag up "existing" occurrences at the end of the "day" (wherever the user is). Something transient doesn't need to be logged.   

       That's possibly covered by the "deliberate attempts to deceive the algorithm" clause. We might add or amend a rule.
8th of 7, Dec 15 2020

       [+] where would be a good place to post the results once we start generating them; here? I'd limit ourselves to posting results manually, I wouldn't want to inadvertently unleash anything horrendous as I sleep soundly one night.
zen_tom, Dec 16 2020

       "LET THE WORLD TREMBLE BEFORE MY SUPERIOR SCRIPT-FU!" was actually not at all the sub-text I had in mind when requesting a copy of the existing code.   

       On the contrary, I haven't used shell scripting for years, and have never tried defining an HB view, so I was hoping for a leisurely opportunity to refresh my rusty and uncompetitive skills.   

       I'm off for a quiet stroll in the Hürtgenwald now, and if [8th] could refrain from detonating any shells at treetop height, that would be appreciated.
pertinax, Dec 16 2020

       You're safe enough there, all the artillery is currently engaged in the Ardennes, supporting the Panzer thrusts towards Bastogne and the Meuse crossings.   

       // I was hoping for a leisurely opportunity to refresh my rusty and uncompetitive skills. //   

       ... and a cheap laugh.
8th of 7, Dec 16 2020

       I'm seeing this as a gamified opportunity to rekindle an old project from a long time ago - where the "win" is less a chance to summer in the Kehlsteinhaus than an excuse to indulge in some seasonal dictator-themed distraction. Happy to share a github repo if anyone would like to work more collaboratively though.
zen_tom, Dec 16 2020

       Just as long as it's not "collaboration horizontale" ... oh, that's an ugly, ugly mental image...
8th of 7, Dec 16 2020

       //... and a cheap laugh//   

       Well, obviously, yes. Incidentally, my inbox is so far void of code, risible or not, as is my spam filter. Did you send it?
pertinax, Dec 16 2020

       No; hence the competition. You show us yours, and we'll show you ours.
8th of 7, Dec 16 2020

       Oh, someone please get that image out of my head...
RayfordSteele, Dec 16 2020

       We've been struggling ... starting to look like it might need special tools.
8th of 7, Dec 16 2020

kdf, Dec 16 2020

       My step-mother's father was the first brain surgeon in their city. His first toolset was, shall we say... interesting.
RayfordSteele, Dec 16 2020

       I'm not gonna code it, just thinking about how I would go about it...   

       Execute a view (linked) to filter for ideas that contain H references and have modify timestamps in past 24 hours. Any search hits will sort newest first. Parse top idea's annos to make sure the H reference does indeed occur in the last 24 hours (as opposed to an idea having an older H anno floated to the top by a newer, non H anno). The first one you find is your reportable event. Repeat on the entire list just in case there are multiple H's on the same day - until you've exhausted the list.   

       If/When you do find one, save the date & other info on the most recent so you can compare future runs, get a "days since" count, etc. If you find multiple on the same day - er... let me go read 8th's rules again on how to report those - "... only log the most recent occurrence" but if there are multiples then it's zero days since the last one?   

       That's such a low bandwidth approach that you could run it once an hour (ftm=r3600 parameter instead of ftm=r86400) for better granularity.
kdf, Dec 16 2020

       Fine, but you don't need that level of finesse. Once a day, just before midnight in your local time zone, will met the spec, since the original idea stated "days" since last mentioned.
8th of 7, Dec 16 2020

       OK, I've got the preliminaries setup now I think - connect to a suitable "recent" link and harvest all the idea pointers, then ping each of those one by one and retrieve the idea details, links and annotations, along with their respective meta data (poster, date and sequence). Take an md5 hash of that collection of information and construct a utility idea_posting object that expresses each of these things as a record.   

       Tomorrow, I need to think about how to serialise those objects down in a form that can be compared with future trawls. If we see an object that shares the same md5 hash for a given url, it can be ignored (though strictly speaking, it shouldn't have come up on the "recent" search - but you can't be too careful - maybe our device went down for a day or more, and we need to restart with some leeway, not knowing exactly when the outage started )   

       After filtering out any md5-measured non-changes (which we can assume we've already reported on anyway) we can scan the new content (possibly a "diff"-filtered selection to weed out old content for hitherto unseen pages) for matching phrases - might be nice to include alternate dictator options in some kind of config file, to keep tabs of mentions of Idi Amin, Muammar Gaddafi and Genghis Khan for example. Or, rather than diffing all the time, which might be expensive and presupposes one day potentially loading and storing a copy of every idea page in the bakery, it might be simpler just to filter out content posted before a certain point in time...tbd   

       A search topic match should retrieve the poster, date and appoximate location of the mention, storing this for posterity. The most recent one (there could be many in any given day's scan) fingered for reporting - at which point the time-delta between this and the last recorded match calculated and added to the trigger report.   

       The last two paragraphs are very much at the "hand wavy" stage where things sound a great deal simpler than they turn out to be in practice - but let's see how it goes. Having egregiously over-engineered the data collection part given the remit, there ought to be a plenty of room for options - but it's still early days, and I'm yet to figure out how to commit to a github repository using an alias.
zen_tom, Dec 16 2020

       [zen] why start with a “recent” view? Why not a custom Halfbakery view which shows only those ideas with Adolf’s surname in the idea text?
hippo, Dec 16 2020

       "Once a day, just before midnight in your local time zone...
—8th of 7, Dec 16 2020

       It really makes no difference what time of day I run it, does it? But I have to report based on date-time strings as they are given on HB, right? Two items, posted at 2345 GMT on the 16th and 0115 GMT on the 17th are separated by a "day" in your thinking - even though they'll both become visible to me in Detroit (GMT-5) on the evening of the 16th.
kdf, Dec 16 2020

       "Why not a custom Halfbakery view which shows only those ideas with Adolf’s surname in the idea text?"
-hippo, Dec 16 2020

       I think 8th wanted to spot new annos along with just the original idea text - but yes, that's the View I linked earlier, shortly before zen_tom's remark.   

       I sure wish HB offered complete time stamps instead of just dates. I can see the timestamp of the most recent anno down to the second - but the penultimate one, not so precisely.
kdf, Dec 16 2020

       [hippo] I'm using "recent" here as shorthand for a custom halfbakery view filtered on some time period - I think the reason for reserving the text-based processing till later is this idea that you might want to repurpose it for multiple or alternate topic options. And while you could push that part of the process up front, given the churn of ideas is rarely > 20 a day, it shouldn't be a problem dealing with that volume after data retrieval - and again, building an incremental backup of the halfbakery isn't necessarily a bad side-effect of doing it in this more open way.   

       [kdf] Time stamped annos would be handy - interestingly, under the search view options, there's a sub-option to view an idea's header content in RSS. These details contain a dc:date attribute that *is* a timestamp, presumably of the latest annotation. It's not strictly usable for this, but it might be interesting to see what else is available under the rss feed data contents - I saw somewhere previously a kind of time-period attribute (can't find it now) that seemed to show the effective time-span of an idea, presumably from post-time to include latest annotation/link addition time.
zen_tom, Dec 16 2020

       Now, this is interesting -   

       The result from last nite looks like this:   

       "Der Führer was last mentioned by user:[zen_tom]in thread "Fishties"\n on 17-DEC-2020   


       Days have elapsed since the previous mention of the former Nazi Chancellor of Germany. — 8th of 7, Dec 18 2020 [edit, delete]"   

       Note that the datestamp is Dec 18, not 17, altho cron kicked off the run at 23:59. Some latency there, we wonder ?
8th of 7, Dec 18 2020

       And *that* is why I think you need better granularity. What would it really cost you to run it hourly, or even a few times a day?
kdf, Dec 18 2020

       Only your immortal soul, so no problem at all.
8th of 7, Dec 18 2020

       Your cron, and the bakery might well operate in different timezones - does your job post its results automatically? I thought you might be copy pasting from another source.
zen_tom, Dec 18 2020

       Sorry, my immortal soul runs on a different architecture, and there's no linux distro for it (yet).
kdf, Dec 18 2020

       I had to read the whole thing twice, because I don't get this:   

       // YOU MUST NOT: use more than the absolute minimum of site bandwidth for your task. This means no ripping the entire site to your computing device and then searching it locally.   

       The absolute minimum IS ripping the entire site at least once and then again and again to detect the changes. In fact, storing it locally as a cache helps to minimize bandwidth, not to increase it. What am I missing here?
ixnaum, Dec 18 2020

       It depends - if you rely on pulling an extract on a daily basis, and if your extract is fed by an appropriately synchronised search query, then anything you find in that extract should be reportable. If you're less sure of the reliability of those extracts, then you might need to start caching some of the information - and since sureness is a continuum, then you might (at the extreme end of that continuum) want to cache the entire system. The middle ground is to set a starting point, and from then on, cache and report on what changes since that point in time.   

       I've added a repo in the links - this gets the content, caches it to a sqlite database (by fiddling about you can populate this by pulling extracts from the past, but normally, you'd just have it collect "fresh" content since your last pull. It does some checks to see if the content pulled is already cached (to avoid double-caching) and builds a dataframe of date-tagged "contributions" attributable to the users who posted them - these can be of type link, anno or idea. Finally, a crude regex matcher configured with one (or more) keyword topics scans over the text and returns matches. There's a bit more to do in terms of storing these down in order to remember which ones have already been reported on. And also, do determine a "how many days since" figure. Once it's stable, we can pull it out of the jupyter notebook and turn it into a proper set of runnable python files.
zen_tom, Dec 19 2020

       //The absolute minimum IS ripping the entire site at least once and then again and again to detect the changes. In fact, storing it locally as a cache helps to minimize bandwidth, not to increase it. What am I missing here?//   

       He wants you to use the halfbakery search function.
Observe the "You should try to avoid" section.
Quite how to reconcile this with "YOU MUST NOT [...] - use an existing search engine." is up to you, though.

       Just use Perl. Perl was made for this.
Loris, Dec 19 2020

       D'oh. Instead of a view for the last 24 hours "ftm=r86400" it's easier to specify a view since the last time you checked or the last known H reference. "ftm=aX" for example where X is epoch time... It's about 1608579930 as I'm typing this, give or a take a few seconds for when I hit OK and it posts.
kdf, Dec 21 2020

       Make sure you suppress leading and trailing zeroes in the output, or they'll all start whingeing and whining and wanting the month in Roman numerals and all sorts of petty, pointless fixes ... bet they all come from sales or marketing backgrounds, "never mind the superb functionality, can you make that bit red instead of green ?" ... <bitter bitter bitter/>
8th of 7, Dec 21 2020

       I can sympathize. I tend to focus on runtime improvements and don't know why people whinge about output formatting.
kdf, Dec 21 2020

       // I can sympathize. //   

       Oh no, not +++ ETERNAL DOMAIN ERROR +++ again.   

       <Prepares to re-install correct Universe/>
8th of 7, Dec 21 2020

       Right. Add "agreeing with 8th" to the long list of things that throw him for a loop. Along with mentioning missing crayons, departed friends, and baseball.
kdf, Dec 21 2020

       Zzzzzzzzzzzz ..... zzzzzzzzzz ..... zzzzzzzzzzz ...... zzzzzzzzz.....
8th of 7, Dec 21 2020

       To paraphrase Data, you really shouldn't let anyone know you have an "off" switch.
kdf, Dec 21 2020

       //I can sympathize. I tend to focus on runtime improvements and don't know why people whinge about output formatting.//   

       I'm one of those that whine about formatting, because it ties directly into functionality and efficiency. I spend half my bloody time formatting graphs for presentations because computer scientists apparently never have to give a single presentation to a 68-year-old board member with coke-bottle glasses. I don't care how fast your code runs if it takes me twice the amount of time to invoke some common sense into the output.   

       Matlab is a godsend to creating all sorts of simulations and graphs and analyses, but who in god's name decided that the default graphical outputs should have like 2 point font numbers and black backgrounds? And where do they live?
RayfordSteele, Dec 21 2020


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle