CSPRNG: Any time guarantees? - cryptography

Does a cryptographically secure pseudorandom number generator also guarantee, that the entropy is gathered in such a way, that the value cannot occur twice when generated at a different time?
I know it's highly unlikely already, but are there specific guarantees?
I need to generate a series of unique IDs from a CSPRNG that must not have conflicts.

An ideal (CS)PRNG assures you that the probability of extracting a certain value is constant and does not change over time, no matter whether that value was already output in the past.
For instance, let's assume your ID is 32 bits long and today you extract 0x12345678. What just happened had a probability of 1/(2^32).
Tomorrow (and at any point in the future), you will still have the same probability 1/(2^32) of extracting the value 0x12345678.
However, the birthday paradox tells us that if you generate 65 536 (=2^(32/2)) values, there is a probability of 50% that two IDs are the same.
In other words, there are no hard guarantees the output of the CSPRNG will not be the same. Whether the chances are sufficiently small strongly depends on how long your ID is and how many IDs you expect to have in total over the whole lifetime of your system (special attention should be paid to security concerns when the attacker can generate IDs at will).
For completeness, all of that is applicable to any good PRNG - including the simplest coin to flip. Cryptographically Strong PRNGs have additional properties about complexity of predicting future or past outputs from any given output (it should be hard), ability to recover from compromise of the state, and ability to feed entropy.

Related

Likelihood of Collision

I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.
What is less likely to result in a collision.
Taking the SHA-1 of the SHA-256 hash of the account number.
Taking the SHA-256 of the account number and picking out 40 characters.
These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."
An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.
(*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.
Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.
The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:
Select a "salt" value. This can be publicly known. It does not matter. For this particular use case, because every account number is different, you can select a single global salt. (If the protected data could be the same, then it's important to have different salts for each record.)
Compute PBKDF2(salt, iterations, length, payload)
The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.
A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.
(Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)
A 40 character ID is 320 bits, which gives you plenty of space. With only 250 records, you can easily fit a unique counter into that. Three digits is only 24 bits, and you have the range 000 to 999 to play with. Fill up the rest of the ID with, say, the hex expression of part of the SHA-256 hash. With a 3-digit ID, that leaves 37 places for hex which covers 37*4 = 148 bits of the Sha-256 output.
You may want to put the counter in the middle of the hex string in a fixed position instead of at the beginning or end to make it less obvious.
<11 hex chars><3 digit ID><26 hex chars>

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

Plone - ZODB catalog query sort_on multiple indexes?

I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?
The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

Increase increment size to match GUID advantage

I've been thinking of implementing this system, but can't help but feel there's a catch somewhere. One of the points of using GUID over incrementing int is that, in the future, if you were to merge databases together, you wouldn't have any clashes over the primary key/identifier. However, my approach is to set the increment size to X where X is the number of servers I'll most likely have in the future. Then, on each server, have the seed be an increment over the seed number on the previous server. That way, during merging, there would be no clashes with the primary key. Is this a safe, normal method or have I gone mental :)?
Thanks
In multi-master SQL replication, you typically have primary keys defined as:
GUIDs
int's with a increment size > number of installs
int's with a fixed offset
The downside of GUIDs is they can be harder to read and take up slightly more space. However, it allows you to scale to n instances.
Integers are a bit easier to deal with. They also have the advantage of being able to easily tell which server created a record. The downside is you must either predict the maximum number of databases which might be merged, or guess the maximum number of rows a single instance might insert.
An example of a fixed offset is: site A starts at 0, site B starts at 1,000,000 and site C starts at 2,000,000. This scheme works fine until one site inserts one million rows. This scheme might work well for cars at car dealerships, where it's unlikely that any one dealer would ever sell more than 1,000,000 cars, and you might have hundreds of dealerships over the life of the application.
What scares me here is your use of "most likely". You're assuming on the future here, and typically that's not a good thing to do with things like this. Why not use a GUID?
What if you add one extra server over what you thought you'd have? I could see things getting really complicated really quickly.