Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
"More like a cross between an onion, a golf ball, and a roman multi-tiered arched aquaduct."

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


       

fast-grep

JIT compiler for grep
 
(0)
  [vote for,
against]

Sometimes I want to grep a command line output with maybe five lines of input, and sometimes I want to grep gigabytes of data against a file full of patterns. Proposed is a "--just- in-time-compiler" command line option in grep where certain costs are paid up front in order to optimize throughput of larger data sets. Without the flag, the existing version runs to avoid compiling costs.
kevinthenerd, Apr 04 2014

[link]






       Ah yes.
MaxwellBuchanan, Apr 04 2014
  

       The compiling overhead isn't that high ... use shell script, shirley ?
8th of 7, Apr 04 2014
  

       // Already done in existing grep // Are you sure about that? No doubt it parses the text search string into some intermediate binary form that is more efficient as the grep code loops through the files, but I seriously doubt that grep writes native assembly code optimized for the regular expression you just entered or writes C++ code and calls the C++ compiler.   

       When I searched for regular expression compilers I found lots of references that turn out to be compiling to an intermediate form. My quick search found one academic paper about an actual compiler for regular expressions, but it wasn't clear if there had ever been a usable implementation. I suspect alot of people have started such projects but I don't expect you would be that highly compensated if you managed to finish a useful version.   

       It's hard to guess how much optimization would actually be achievable. For a simple search criteria, say seaching for [aeiou][rstln], uncompiled code would might have a loop to compares the current data from the file to each of the 5 characters it is looking for. Compiled code for this purpose would probably be written to be space innefficient, so the loop would be unrolled into 5 comparisons to immediate values. I could easily see code execution time for the comparison cut by a factor of 2 to 4. With a really smart compiler you might be able to take advantage of multi-core processors to optimize it even more, and if it's aware of the cache size, if the unrolled code wouldn't fit in cache, it might compile an actual loop rather than an unrolled loop to avoid cache misses.   

       Of course for most search terms that won't make a difference because the limiting factor is the hard drive, so the CPU does all the procesing while waiting for the drive. If you do have a whole file full of patterns you're searching against, maybe you could see some benefit.
scad mientist, Apr 04 2014
  
      
[annotate]
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle