0

I have a requirement of developing a service that must generate each day few millions of random and unique alphanumeric String of length 10 (I can't increase this length, this is a customer prerequisite). Having a ID, the next ID must not be guessable.

I want to be sure that all IDs will be unique because thoses IDs will be later used in a database (not of my own) and will represent the ID of a product, so it MUST be unique. I know that the probability of a collision is low but I want to prevent this.

I have no problem to generate a random String, but I want to ensure that all generated IDs are really unique. I was thinking about using a local SQL database and inserting few millions per day in this database (which will ensure unicity as this ID would be a primary key). I will then pick up those IDs, mark them as "processed" and send them.

To improve insertion performance, I was thinking about having a table for each year (a partition?).

Do you think this is a good solution to ensure unicity? Would you have any other solution, may be better than a SQL database?

Thanks.

Rouliboy
  • 137
  • 7
  • Your approach will work while your database contains unique strings. After that no guarantees... Why is it that the customer cannot simply verify that the string is not in use, and rerequest if it collides? – Kain0_0 Jun 27 '19 at 23:51
  • "the next ID must not be guessable" - this seems like a strange requirement for a product ID. If this is some form of security requirement, you'll want to look into cryptographic hashes instead of just "random" strings. – casablanca Jun 28 '19 at 04:19

2 Answers2

3

Re-Think

You did not specify how hard the "not guessable" requirement is. Of course you would not use simple serial numbers, but I would not rule out a mechanism that can create unique strings using simple cryptographic techniques and does not need a database at all.

Risk evaluation

  • What are the IDs used for? Just identification of an item? Or are they used as some kind of license key or access token?

  • What is at stake? What damage happens when an ID is guessed? With millions of products per day, I don't assume that the value of each is in the hundreds of dollars (if it were, your customer would pay an expert who wouldn't ask on stack exchange :-) )

  • What level of adversaries do you expect? Ordinary users, script kiddies, determined hackers, state-level actors?

  • What profit can they gain, i.e. what amount of effort would they invest to hack your scheme?

If there is a real risk of being hacked and the possible damage is high, this scheme is probably too simple, but from the limited information you provide, I don't think so.

The mechanism

10 alphanumeric characters give roughly 60 bits (if you use uppercase, lowercase and digits). That's close enough to 64 that a 64-bit block cipher could be used here. So you start with just sequential numbers and encrypt them using a 64-bit block cipher (and a key that needs to be kept secret, of course), skipping those numbers which result in out-of-bounds encrypted values - on average, every 24th number is good.

Since the block encryption is reversible, the resulting bit patterns are unique, and you can safely use them without storing them anywhere. By decrypting a given ID, you can verify that it was generated using your algorithm and key (if the decrypted value is less than the current value of the generator, it was created by this algorithm.) If you store just the last value of the sequence number generator for each day, you can even determine on which day a key was generated.

Of course, once your algorithm and the block cipher being used is known, breaking the key requires "just" a number of IDs known to be sequential, and enough time and computation power. State level actors could probably do this, I'm not sure about determined hackers.

Hans-Martin Mosner
  • 14,638
  • 1
  • 27
  • 35
  • Thanks for your great answer, I'm gonna investigate this and come back to you – Rouliboy Jun 28 '19 at 10:20
  • Here's a python program demonstrating the concept, even includes a history file: https://pastebin.com/eANVbxE5 (note this contains several errors in the comments, didn't proofread, but the code works) – Hans-Martin Mosner Jun 29 '19 at 11:27
1

Your approach is fine. Only using a new table for each year is debatable, this will trade insertion speed for query speed (you will have to query then multiple tables to make sure your uniqueness requirement is fulfilled), so it is not inherently clear if this will really bring the benefit you expect from it.

I would actually keep away from this premature optimization and start thinking about improvements if performance degradation will become a real issue. Then it is probably soon enough to reorganize the db, and you have numbers at hand to make a better decision on how to partition the db. Maybe it is faster to have a table per month. Maybe it is better to use a DB which can do the partitioning for you "under the hood". Maybe it is better to do the partitioning by splitting tables according the first letter(s) of your randomly generated string, who knows. But I am pretty sure currently it is too early to make the decision for "partitions per year".

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • Thanks for you answer. I was thinking about having a table for each year because will hundreds of millions of records in a single table, inserts become slower and slower. So having a dedicated table for a single year will fix the maximum of records in a table to about 400 millions and insert/query/update will be faster. I was also thinking about encoding the year in the first 2 digits of the ID. – Rouliboy Jun 28 '19 at 06:59
  • @Rouliboy: see my edit. Sure, you can also encode the year into the first 2 digits which trades performance for "randomness". But as I said, maybe "per year" turns out to be the wrong granularity, so better try to find a way to defer this decision to a point in time where you have numbers at hand. – Doc Brown Jun 28 '19 at 07:11
  • @Rouliboy which database causes inserts to become slower as the table grows? Or do you mean the unique index checking will become slower? – Kayaman Jun 28 '19 at 09:16
  • @Kayaman : I mean unique index (i.e my ID) checking ;) – Rouliboy Jun 28 '19 at 10:09
  • @Rouliboy with the time complexity of a unique index (depending on whether it's implemented as a hash / b-tree / whatever), it's probably the last place where you should start improving your performance. – Kayaman Jun 28 '19 at 10:15