Halfbakery: Massively Distributable Primary Key Generation using Prime Numbers

The problem - databases tend to use an integer 'key' to identify rows in the database.

This is fine when you only have one process adding rows into the list, you just look up the last row entered, increment by one and that's your new index number. If you're clever and know how many records you're due to insert, you can do this once at the beginning of your batch, and then update the value at the end, keeping i/o to a minimum.

But as you add processes this gets trickier to manage - you could have each process allocate itself a 'block' to be filled in - but beyond a certain number of concurrent processes, this gets trickier to manage (a problem almost as tricky as the one you started off with)

You could pre-allocate a numeric code that describes the process currently inserting the data. e.g. if your index field is a 10 digit numeric, you could allocate the first 3 digits to represent your process number, and leave the remaining 7 to provide index numbers. Since each process id from 000 to 999 is unique, then all records sent by each process are protected against having index numbers that might 'clash' with those sent by a different process.

But this puts limits on both the number of processes you can run concurrently (999 in the example quoted) and also a limit on the number of records each process can generate (9999999 in the above example) before you run into a clash.

You could have all the parrallel processes generating temporary (unindexed) data and then use a single high-bandwidth process to upload in a single, streamlined method - but then you have to decide when to schedule such updates, and ensure that the data you upload doesn't change while you're uploading it.

You could generate a unique 'checksum' (perhaps an MD5 or other algorithm) to generate a hexadecimal key that should allow you to populate your index with a unique value - nice, but not ideal as you can't be 100% sure (99.99999% sure maybe, but not 100%) that you're not going to get a clash of keys.

So, the challenge is to squeeze as many records into the smallest numeric space possible, using multiple simultaneous processes without them
a) having to lookup each time to see which number is going to be free next, or b) having to chat to one another in order to negotiate what numbers are free to use next.

Freakily, since we're adding naturally collected data, the frequency (and number) of times you're going to need multiple concurrent processes is likely to follow a sort of ln(x) distribution.

Also freakily, the distribution of prime numbers almost follows a ln(x) style distribution.

So, the solution is to maintain a big table of the first (say) 1 billion prime numbers.

Each time your load-balancing requirements need you to spawn a new load process, you assign it the next available prime number. So process 1 uses the 1st prime - (2) and process 2 uses the 2nd prime, (3) and process 3 uses the 3rd prime, (5) and so on.

We'll call this number P1, P2, P3...Pn - where Pn is the n'th processor, and is assigned the n'th prime.

Each processor keeps a track of how many items it has inserted, call this number i - it then determines the value of the i'th Prime Number - Pi(looking it up in the big table) and multiplies this number by its own process-assigned prime to get the next number in the sequence...S.
So
S = Pn x Pi
(you have to start from Pn^2 in order to skip a few trivial overlaps)

Once the batch is complete, the system stores the accumulated number of items it has inserted into the table, to be looked up the next time it is called upon to perform a batch.

The upshot, P1 always inserts items in the sequence 2, 4, 6, 10, 14, 22, 26, 34, 38; P2 inserts items in the sequence 9, 15, 21, 33, 39, 51, 57, 69, 87, 93, 111, 123, 129; P3 inserts items in the sequence 25, 35, 55, 65, 85, 95, 115, 145, 155 etc and so on.

This means that you are always confident that each processes data can be inserted without violating the numeric primary key - you are also confident that you are using the number-space provided in an efficient manner (i.e. you're not leaving too many gaps) and you can scale the number of concurrent processes as wide as you like without even having to think about changing your code - all you need to do is make sure you maintain a big enough list of prime numbers.