h a l f b a k e r y"More like a cross between an onion, a golf ball, and a roman multi-tiered arched aquaduct."
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
It seems that the task of detecting and logging occurrences of Chancellor Adolphs surname on the HB has generated quite a bit of interest.
A rather crude but basically effective bit of code has been developed, although we will be amongst the first to acknowledge that its very far from perfect.
So, we have decided to institute an actual competition to encourage the development of coding skills by the other inmates of the asylum.
THE RULES:
(subject to revision without notice, E&OE)
YOU MUST NOT:
- use more than the absolute minimum of site bandwidth for your task. This means no ripping the entire site to your computing device and then searching it locally.
- generate excessive new HB threads. You must either annotate an existing thread, or if you do create a new thread you must delete all previous ones.
- talk about Fight Club.
- Send heavy naval units through the Denmark Strait without destroyer escort or air cover, or sink any British battlecruisers.
- use an existing search engine. Those doing so will be summarily shot without trial, and then hung up outside a garage in Milan.
- request, conscript or coerce any assistance from any other person, organization or nation, including but not limited to Austria, Italy, Hungary or Romania. The code must be all your own work.
- register a new HB username to spare yourself embarrassment and ridicule.
You should try to avoid :
- using anything other than the existing HB "search" facility.
- overly provoking the U.S. unless absolutely necessary, unavoidable, accidental, funny, because the smug isolationist gits deserve it (being perfectly happy to let someone else do all the actual fighting while making silly money from loans and selling munitions at a swingeing profit), or you* just can't help yourself.
(*This condition is of course waived those of Japanese nationality).
- bombing any neutral countries (except the Republic of Ireland, who deserve whatever they get for not helping).
You MUST:
- reliably spot and log any new occurrences of the name; you must ignore previous mentions or edited annotations. You must weed out misspellings and deliberate attempts to deceive the algorithm, but the code must not be case sensitive.
- demonstrate kindness and consideration for dogs, particularly German Shepherds.
- handle month and year-ends and other date and time rollovers (including leap years) cleanly and correctly.
- use your own existing username and do your testing and debugging "in public" to furnish copious opportunities for derision and mockery. If you fail, you must be seen to fail. Badly, repeatedly, and in a humiliating way.
- respect the Geneva Conventions.
- only log the latest occurrence on any particular day, but this can be dependent on the time zone in which the user's machine resides.
Other rules:
You may use any device, operating system, programming tool, language, link library, SDK, web interface, browser, device driver or other legitimate general purpose device or system.
You will not be expected to publish your code, but you may be asked to send a copy to at least one other HB user to verify that youre not just doing the edits manually.
Points will be deducted for date rollover issues, bad formatting, truncated or corrupted usernames or thread names, or getting the Sixth Army surrounded at Stalingrad and captured en masse.
Your code does not have to provide links to the thread, or line numbers containing the mention, but this may earn extra Luftflotten if implemented in a neat and consistent way.
Judging.
The contest will end on 31 December 2020, or when Russian troops reach the Reichstag.
The results of the judging will be announced at Nuremberg the day after the competition closes.
The winner gets to annexe the Eastern European nation of their choice.
The losers get a conducted tour of a restored WW2 bunker in Berlin, and then have to go into a small private room and shoot themselves with a small-calibre pistol. Their bodies will then be taken outside and burnt (allegedly). When they reach Argentina, they should delete their old HB account and register a new username.
Jutta is very respectfully requested please not to enter, as she would not only automatically be announced the winner no matter what anyone else does, but direct access to the hb source code is a bit like the equivalent of finding Marshall Zhukov and 86 armoured divisions down the back of the sofa - it makes things all just a bit too easy.
So far, there appear to be at least two potential participants. We hope others will feel motivated to join in.
Days Since Hitler Was Mentioned Here
Days_20Since_20Hitl..._20Mentioned_20Here The genesis of this idea. [8th of 7, Dec 15 2020]
Today's Notions - rss
https://www.halfbak...Today_27s_20Notions Here's the equivalent of the temporal component of kdf's search link, expressed as rss. Note the dc:coverage attribute that seems to express an idea's "lifespan" [zen_tom, Dec 16 2020]
Mine != Mein
https://www.youtube...watch?v=s6-XeyNMCJw A video illustration of a language matching problem where the matching parameters are set too loose.. [zen_tom, Dec 18 2020]
Github repo
https://github.com/zentomhb/hb_scraper Development in the open - it's messy, but over time should get better. [zen_tom, Dec 19 2020]
[link]
|
|
I think there should be a Hitler category |
|
|
Seems reasonable - you should post that in "Halfbakery: Category: Wanted". |
|
|
What if we mention H, but then delete it 10 minutes
later? |
|
|
The system only needs to flag up "existing" occurrences at the end of the "day" (wherever the user is). Something transient doesn't need to be logged. |
|
|
That's possibly covered by the "deliberate attempts to deceive the algorithm" clause. We might add or amend a rule. |
|
|
[+] where would be a good place to post the results once we
start generating them; here? I'd limit ourselves to posting
results manually, I wouldn't want to inadvertently unleash
anything horrendous as I sleep soundly one night. |
|
|
"LET THE WORLD TREMBLE BEFORE MY SUPERIOR SCRIPT-FU!"
was actually not at all the sub-text I had in mind when
requesting a copy of the existing code. |
|
|
On the contrary, I haven't used shell scripting for years, and have
never tried defining an HB view, so I was hoping for a leisurely
opportunity to refresh my rusty and uncompetitive skills. |
|
|
I'm off for a quiet stroll in the Hürtgenwald now, and if [8th]
could refrain from detonating any shells at treetop height, that
would be appreciated. |
|
|
You're safe enough there, all the artillery is currently engaged in the Ardennes, supporting the Panzer thrusts towards Bastogne and the Meuse crossings. |
|
|
// I was hoping for a leisurely opportunity to refresh my rusty and uncompetitive skills. // |
|
|
I'm seeing this as a gamified opportunity to rekindle
an old project from a long time ago - where the "win"
is less a chance to summer in the Kehlsteinhaus than
an excuse to indulge in some seasonal dictator-themed
distraction. Happy to share a github repo if anyone
would like to work more collaboratively though. |
|
|
Just as long as it's not "collaboration horizontale" ... oh, that's an ugly, ugly mental image... |
|
|
//... and a cheap laugh// |
|
|
Well, obviously, yes. Incidentally, my inbox is so far void of
code, risible or not, as is my spam filter. Did you send it? |
|
|
No; hence the competition. You show us yours, and we'll show you ours. |
|
|
Oh, someone please get that image out of my head... |
|
|
We've been struggling ... starting to look like it might need special tools. |
|
|
My step-mother's father was the first brain surgeon in their
city. His first toolset was, shall we say... interesting. |
|
|
Fine, but you don't need that level of finesse. Once a day, just before midnight in your local time zone, will met the spec, since the original idea stated "days" since last mentioned. |
|
|
OK, I've got the preliminaries setup now I think - connect to a suitable "recent" link and harvest all the idea pointers,
then ping each of those one by one and retrieve the idea details, links and annotations, along with their respective
meta data (poster, date and sequence). Take an md5 hash of that collection of information and construct a utility
idea_posting object that expresses each of these things as a record. |
|
|
Tomorrow, I need to think about how to serialise those objects down in a form that can be compared with future
trawls. If we see an object that shares the same md5 hash for a given url, it can be ignored (though strictly speaking,
it shouldn't have come up on the "recent" search - but you can't be too careful - maybe our device went down for a day
or more, and we need to restart with some leeway, not knowing exactly when the outage started ) |
|
|
After filtering out any md5-measured non-changes (which we can assume we've already reported on anyway) we can
scan the new content (possibly a "diff"-filtered selection to weed out old content for hitherto unseen pages) for matching phrases - might be nice to
include alternate dictator options in some kind of config
file, to keep tabs of mentions of Idi Amin, Muammar Gaddafi and Genghis Khan for example. Or, rather than diffing all the time, which might be expensive
and presupposes one day potentially loading and storing a copy of every idea page in the bakery, it might be simpler just to filter out content posted
before a certain point in time...tbd |
|
|
A search topic match should retrieve the poster, date and appoximate location of the mention, storing this for
posterity. The most recent one (there could be many in any given day's scan) fingered for reporting - at which point
the time-delta between this and the last recorded match calculated and added to the trigger report. |
|
|
The last two paragraphs are very much at the "hand wavy" stage where things sound a great deal simpler than they
turn out to be in practice - but let's see how it goes. Having egregiously over-engineered the data collection part given
the remit, there ought to be a plenty of room for options - but it's still early days, and I'm yet to figure out how to
commit to a github repository using an alias. |
|
|
[zen] why start with a recent view? Why not a custom Halfbakery
view which shows only those ideas with Adolfs surname in the idea
text? |
|
|
[hippo] I'm using "recent" here as shorthand for a custom
halfbakery view filtered on some time period - I think the
reason for reserving the text-based processing till later is
this idea that you might want to repurpose it for multiple
or alternate topic options. And while you could push that
part of the process up front, given the churn of ideas is
rarely > 20 a day, it shouldn't be a problem dealing with
that volume after data retrieval - and again, building an
incremental backup of the halfbakery isn't necessarily a
bad side-effect of doing it in this more open way. |
|
|
[kdf] Time stamped annos would be handy - interestingly, under
the search view options, there's a sub-option to view an
idea's header content in RSS. These details contain a
dc:date attribute that *is* a timestamp, presumably of
the latest annotation. It's not strictly usable for this, but
it might be interesting to see what else is available under
the rss feed data contents - I saw somewhere previously a
kind of time-period attribute (can't find it now) that
seemed to show the effective time-span of an idea,
presumably from post-time to include latest
annotation/link addition time. |
|
|
Now, this is interesting - |
|
|
The result from last nite looks like this: |
|
|
"Der Führer was last mentioned by user:[zen_tom]in thread "Fishties"\n on 17-DEC-2020 |
|
|
Days have elapsed since the previous mention of the former Nazi Chancellor of Germany.
8th of 7, Dec 18 2020
[edit, delete]" |
|
|
Note that the datestamp is Dec 18, not 17, altho cron kicked off the run at 23:59. Some latency there, we wonder ? |
|
|
Only your immortal soul, so no problem at all. |
|
|
Your cron, and the bakery might well operate in different
timezones - does your job post its results automatically? I
thought you might be copy pasting from another source. |
|
|
I had to read the whole thing twice, because I don't get this: |
|
|
// YOU MUST NOT: use more than the absolute minimum of site bandwidth for your task. This means no ripping the entire site to your computing device and then searching it locally. |
|
|
The absolute minimum IS ripping the entire site at least once and then again and again to detect the changes. In fact, storing it locally as a cache helps to minimize bandwidth, not to increase it.
What am I missing here? |
|
|
It depends - if you rely on pulling an extract on a daily
basis, and if your extract is fed by an appropriately
synchronised search query, then anything you find in that
extract should be reportable. If you're less sure of the
reliability of those extracts, then you might need to start
caching some of the information - and since sureness is a
continuum, then you might (at the extreme end of that
continuum) want to cache the entire system. The middle
ground is to set a starting point, and from then on, cache
and report on what changes since that point in time. |
|
|
I've added a repo in the links - this gets the content,
caches it to a sqlite database (by fiddling about you can
populate this by pulling extracts from the past, but
normally, you'd just have it collect "fresh" content since
your last pull. It does some checks to see if the content
pulled is already cached (to avoid double-caching) and
builds a dataframe of date-tagged "contributions"
attributable to the users who posted them - these can be
of type link, anno or idea. Finally, a crude regex matcher
configured with one (or more) keyword topics scans over
the text and returns matches. There's a bit more to do in
terms of storing these down in order to remember which
ones have already been reported on. And also, do
determine a "how many days since" figure. Once it's
stable, we can pull it out of the jupyter notebook and
turn it into a proper set of runnable python files. |
|
|
//The absolute minimum IS ripping the entire site at least once and then again and
again to detect the changes. In fact, storing it locally as a cache helps to minimize
bandwidth, not to increase it. What am I missing here?// |
|
|
He wants you to use the halfbakery search function.
Observe the "You should try to avoid" section.
Quite how to reconcile this with "YOU MUST NOT [...] - use an existing search engine."
is up to you, though. |
|
|
Just use Perl. Perl was made for this. |
|
|
Make sure you suppress leading and trailing zeroes in the output, or they'll all start whingeing and whining and wanting the month in Roman numerals and all sorts of petty, pointless fixes ... bet they all come from sales or marketing backgrounds, "never mind the superb functionality, can you make that bit red instead of green ?" ... <bitter bitter bitter/> |
|
|
Oh no, not +++ ETERNAL DOMAIN ERROR +++ again. |
|
|
<Prepares to re-install correct Universe/> |
|
|
Zzzzzzzzzzzz ..... zzzzzzzzzz ..... zzzzzzzzzzz ...... zzzzzzzzz..... |
|
|
//I can sympathize. I tend to focus on runtime
improvements and don't know why people whinge about
output formatting.// |
|
|
I'm one of those that whine about formatting, because it
ties directly into functionality and efficiency. I spend half
my bloody time formatting graphs for presentations
because computer scientists apparently never have to give
a single presentation to a 68-year-old board member with
coke-bottle glasses. I don't care how fast your code runs if
it takes me twice the amount of time to invoke some
common sense into the output. |
|
|
Matlab is a godsend to creating all sorts of simulations and
graphs and analyses, but who in god's name decided that
the default graphical outputs should have like 2 point font
numbers and black backgrounds? And where do they live? |
|
|
Just to draw competitors attention to the date - seven days remaining for entrants to make their submissions. |
|
|
Submissive entrants!! The horror, the horror. |
|
|
You can squeal like a pig, can't you, [xen] ... ? |
|
|
Und was sind die Ergebnisse? |
|
|
Wir haben noch keine endgültige Endlösung. |
|
|
Scheisse ... Ihr seid alle faule Bastarde. |
|
|
Not sure whether I need to do anything here - the repo is
still up-to-date. I could run it again, but the only addon I
can think of including now is to steer it towards building up
a complete and ongoing incremental bakery backup. Though
I'm not sure releasing that for public consumption would
necessarily be a good thing. Longer term, it might be nice
to unSwiss the occasional posting after a rogue account
deletion, but most of the people who might have done that
probably already have, so it might be too late. |
|
| |