Massively Distributable Primary Key Generation using Prime Numbers

Prime-ary Index

The problem - databases tend to use an integer 'key' to identify rows in the database.

This is fine when you only have one process adding rows into the list, you just look up the last row entered, increment by one and that's your new index number. If you're clever and know how many records you're due to insert, you can do this once at the beginning of your batch, and then update the value at the end, keeping i/o to a minimum.

But as you add processes this gets trickier to manage - you could have each process allocate itself a 'block' to be filled in - but beyond a certain number of concurrent processes, this gets trickier to manage (a problem almost as tricky as the one you started off with)

You could pre-allocate a numeric code that describes the process currently inserting the data. e.g. if your index field is a 10 digit numeric, you could allocate the first 3 digits to represent your process number, and leave the remaining 7 to provide index numbers. Since each process id from 000 to 999 is unique, then all records sent by each process are protected against having index numbers that might 'clash' with those sent by a different process.

But this puts limits on both the number of processes you can run concurrently (999 in the example quoted) and also a limit on the number of records each process can generate (9999999 in the above example) before you run into a clash.

You could have all the parrallel processes generating temporary (unindexed) data and then use a single high-bandwidth process to upload in a single, streamlined method - but then you have to decide when to schedule such updates, and ensure that the data you upload doesn't change while you're uploading it.

You could generate a unique 'checksum' (perhaps an MD5 or other algorithm) to generate a hexadecimal key that should allow you to populate your index with a unique value - nice, but not ideal as you can't be 100% sure (99.99999% sure maybe, but not 100%) that you're not going to get a clash of keys.

So, the challenge is to squeeze as many records into the smallest numeric space possible, using multiple simultaneous processes without them

a) having to lookup each time to see which number is going to be free next, or
b) having to chat to one another in order to negotiate what numbers are free to use next.

Freakily, since we're adding naturally collected data, the frequency (and number) of times you're going to need multiple concurrent processes is likely to follow a sort of ln(x) distribution.

Also freakily, the distribution of prime numbers almost follows a ln(x) style distribution.

So, the solution is to maintain a big table of the first (say) 1 billion prime numbers.

Each time your load-balancing requirements need you to spawn a new load process, you assign it the next available prime number. So process 1 uses the 1st prime - (2) and process 2 uses the 2nd prime, (3) and process 3 uses the 3rd prime, (5) and so on.

We'll call this number P1, P2, P3...Pn - where Pn is the n'th processor, and is assigned the n'th prime.

Each processor keeps a track of how many items it has inserted, call this number i - it then determines the value of the i'th Prime Number - Pi(looking it up in the big table) and multiplies this number by its own process-assigned prime to get the next number in the sequence...S.

So

S = Pn x Pi

(you have to start from Pn^2 in order to skip a few trivial overlaps)

Once the batch is complete, the system stores the accumulated number of items it has inserted into the table, to be looked up the next time it is called upon to perform a batch.

The upshot, P1 always inserts items in the sequence 2, 4, 6, 10, 14, 22, 26, 34, 38; P2 inserts items in the sequence 9, 15, 21, 33, 39, 51, 57, 69, 87, 93, 111, 123, 129; P3 inserts items in the sequence 25, 35, 55, 65, 85, 95, 115, 145, 155 etc and so on.

This means that you are always confident that each processes data can be inserted without violating the numeric primary key - you are also confident that you are using the number-space provided in an efficient manner (i.e. you're not leaving too many gaps) and you can scale the number of concurrent processes as wide as you like without even having to think about changing your code - all you need to do is make sure you maintain a big enough list of prime numbers.

-- zen_tom,
Sep 20 2010

Another Idea using lots of primes

Rather different in detail, though. [Vernon, Sep 20 2010]

Clever - although might there not be a danger that the index number you're generating swiftly grows very big?

-- hippo,
Sep 20 2010

//The idea gets a [ ] since the negotiation between process you write off is exactly what your proposing by establishing N and P1 ... PN.//

But there's no inter-process negotiation - each process, identifiable by its process number, simply stores the value it last got to - or rather, the number of items it has cumulatively written (the actual value written is calculated - and I suppose, could be stored for information). None of the processes checks, or cares what any other process has gotten up to as each one is on an entirely independent 'domain' of numbers.

Also, assuming you are doing some level of batch processing (where each process knows the number of records it is going to write ahead of time) you can prefetch/pre-calculate the appropriate index numbers in memory, without worrying about potential conflicts.

In terms of using numeric 'domains' - different strategies might be applied where you already know how many domains you need to operate at once.

e.g.

If you only had two processes, you might give one of them even numbers, and the other one odd. Or if you had 3, you might give each a series of values linked to the determination of mod(3) - and that works out fine, for as long as you have the same number of processes -

but here, we have a dynamically assigned series of processes - up to a very large number - as long as no two processes are given the same id concurrently, there's no problem.

[hippo] yes, after you start using more than 1000 processes, the P1000 processor would be writing to the database in primal sequence increments of 7919 - also, to avoid trickyness, it would have to start at 62710561. But, and here's why I mentioned the ln(x) thing, assuming your load increases/decreases naturally, it would be on rarer occasions that the higher numbers are used, in comparison with say the first or second processes P1 and P2, who would be able to write a lot more data before getting to a number like 62710561.

At some point, we do run the risk of running out of numbers, but could add a date/time element into the key (something not suggested earlier) or some other trick that should restrict usage at the higher levels.

I'd like to use the same formula to allow process P1 to write all even numbers, but can't seem to write a single formula that does that, without also letting in conflicts as you add other processes.

i.e. S = Pn x Pi

which means that you can't get 8, 16, 32 or any other value who's prime factors are only 2 - leaving out quite a bit of numeric real-estate.

-- zen_tom,
Sep 20 2010

Ah, well...hmmm - as long as each process has access to the central database (a reasonable assumption since each one is adding data to that central db)

Each one could scan through the table of prime numbers to see which ones have already got processes assigned to them. So perhaps there's a control-style table with the following fields:

Number, PrimeNumber, BeingUsed, MACAddress

The process skips through this table until it finds a row who's BeingUsed value is 'FALSE' [select min (primenumber) from controlTable where BeingUsed = FALSE;] at which point it 'grabs' it, setting the BeingUsed value to 'TRUE' and (for good measure) populating the MAC (or some other address-type value) with its own identifier. It then is free to use that Prime number for any records that it wants to insert. Once complete, it signs out, tidily setting the BeingUsed flag back to 'FALSE' when it finishes.

The negotiation isn't strictly inter-process - in the same way that customers aren't strictly negotiating with one another when they take a ticket-number at the deli.

I've not come across mutex (mutual exclusion) scopes before - at least not during record creation - but presumably what we're talking about here is a high-level (although presumably the same thing could be implemented lower-down if it caught-on - isn't there an idea on here somewhere for a hardware chip for prime numbers?) implementation of a mutex-scope, is it not?

-- zen_tom,
Sep 20 2010

Have a table with a single identity column which is set to auto-increment by a certain value, and a column whose default value is the auto-increment step (if the table is created so the default value of the second column doesn't match the auto-increment size, bad things can happen). An application that wants to write some records creates a new row in this table with default values, uses @scope_identity() to read it, and then deletes it. The application will then know that it can create up to (identity step) rows in the main table starting with the reported identity value. Fully scalable for any number of processes; if one uses 64-bit numbers, one could allow for batch inserts of up to e.g. 10,000 rows and never use up the counter (if the system was processing a batches per second, it would take over a century to use up the counter). Not quite optimal if one is adding more than 10,000 rows per batch, but not too far off (one extra operation on that other table for every 10,000 rows added to the main one).

-- supercat,
Sep 20 2010

I'm not entirely clear what problem this is solving, or how it's solving it.

Why can't each process just have a reasonably large block of numbers, and grab a new block base number when it's exhausted?

You seem to have excluded this with something about not being able to negotiate it, but there must be *something* which gives out unique numbers, because all processes have unique IDs.

-- Loris,
Oct 30 2014

random,
halfbakery