Likelihood of Collision - cryptography

I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.
What is less likely to result in a collision.
Taking the SHA-1 of the SHA-256 hash of the account number.
Taking the SHA-256 of the account number and picking out 40 characters.

These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."
An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.
(*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.
Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.
The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:
Select a "salt" value. This can be publicly known. It does not matter. For this particular use case, because every account number is different, you can select a single global salt. (If the protected data could be the same, then it's important to have different salts for each record.)
Compute PBKDF2(salt, iterations, length, payload)
The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.
A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.
(Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)

A 40 character ID is 320 bits, which gives you plenty of space. With only 250 records, you can easily fit a unique counter into that. Three digits is only 24 bits, and you have the range 000 to 999 to play with. Fill up the rest of the ID with, say, the hex expression of part of the SHA-256 hash. With a 3-digit ID, that leaves 37 places for hex which covers 37*4 = 148 bits of the Sha-256 output.
You may want to put the counter in the middle of the hex string in a fixed position instead of at the beginning or end to make it less obvious.
<11 hex chars><3 digit ID><26 hex chars>

Related

Redis bitmap split key division strategy

I'm grabbing and archiving A LOT of data from the Federal Elections Commission public data source API which has a unique record identifier called "sub_id" that is a 19 digit integer.
I'd like to think of a memory efficient way to catalog which line items I've already archived and immediately redis bitmaps come to mind.
Reading the documentation on redis bitmaps indicates a maximum storage length of 2^32 (4294967296).
A 19 digit integer could theoretically range anywhere from 0000000000000000001 - 9999999999999999999. Now I know that the datasource in question does not actually have 99 quintillion records, so they are clearly sparsely populated and not sequential. Of the data I currently have on file the maximum ID is 4123120171499720404 and a minimum value of 1010320180036112531. (I can tell the ids a date based because the 2017 and 2018 in the keys correspond to the dates of the records they refer to, but I can't sus out the rest of the pattern.)
If I wanted to store which line items I've already downloaded would I need 2328306436 different redis bitmaps? (9999999999999999999 / 4294967296 = 2328306436.54). I could probably work up a tiny algorithm determine given an 19 digit idea to divide by some constant to determine which split bitmap index to check.
There is no way this strategy seems tenable so I'm thinking I must be fundamentally misunderstanding some aspect of this. Am I?
A Bloom Filter such as RedisBloom will be an optimal solution (RedisBloom can even grow if you miscalculated your desired capacity).
After you BF.CREATE your filter, you pass to BF.ADD an 'item' to be inserted. This item can be as long as you want. The filter uses hash functions and modulus to fit it to the filter size. When you want to check if the item was already checked, call BF.EXISTS with the 'item'.
In short, what you describe here is a classic example for when a Bloom Filter is a great fit.
How many "items" are there? What is "A LOT"?
Anyway. A linear approach that uses a single bit to track each of the 10^19 potential items requires 1250 petabytes at least. This makes it impractical (atm) to store it in memory.
I would recommend that you teach yourself about probabilistic data structures in general, and after having grokked the tradeoffs look into using something from the RedisBloom toolbox.
If the ids ids are not sequential and very spread, keep tracking of which one you processed using a bitmap is not the best option since it would waste lot of memory.
However, it is hard to point the best solution without knowing the how many distinct sub_ids your data set has. If you are talking about a few 10s of millions, a simple set in Redis may be enough.

CSPRNG: Any time guarantees?

Does a cryptographically secure pseudorandom number generator also guarantee, that the entropy is gathered in such a way, that the value cannot occur twice when generated at a different time?
I know it's highly unlikely already, but are there specific guarantees?
I need to generate a series of unique IDs from a CSPRNG that must not have conflicts.
An ideal (CS)PRNG assures you that the probability of extracting a certain value is constant and does not change over time, no matter whether that value was already output in the past.
For instance, let's assume your ID is 32 bits long and today you extract 0x12345678. What just happened had a probability of 1/(2^32).
Tomorrow (and at any point in the future), you will still have the same probability 1/(2^32) of extracting the value 0x12345678.
However, the birthday paradox tells us that if you generate 65 536 (=2^(32/2)) values, there is a probability of 50% that two IDs are the same.
In other words, there are no hard guarantees the output of the CSPRNG will not be the same. Whether the chances are sufficiently small strongly depends on how long your ID is and how many IDs you expect to have in total over the whole lifetime of your system (special attention should be paid to security concerns when the attacker can generate IDs at will).
For completeness, all of that is applicable to any good PRNG - including the simplest coin to flip. Cryptographically Strong PRNGs have additional properties about complexity of predicting future or past outputs from any given output (it should be hard), ability to recover from compromise of the state, and ability to feed entropy.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

Calculating a hash code for a large file in parallel

I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.
Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slower when compared to the time it takes to read the file from a [fast] SSD). Let's call this Method 1.
The idea is to hash multiple 1 MB blocks of the file in parallel on 8 CPUs and then hash the concatenated hashes into a single final hash. Let's call this Method 2.
A picture depicting this method follows:
I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.
For example:
Let's use the SHA-256 variant of SHA-2 and set the file size to 2^34=34,359,738,368 bytes. Therefore, using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.
Compare this with:
Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.
Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?
Update: I have just discovered that my construction in Method 2 is very similar to the notion of a hash list. However the Wikipedia article referenced by the link in the preceding sentence does not go into detail about a hash list's superiority or inferiority with regard to the chance of collisions as compared to Method 1, a plain old hashing of the file, when only the top hash of the hash list is used.
Block-based hashing (your method 2) is a well known technique that is used in practice:
Hash tree, Merkle tree, Tiger tree hash
eDonkey2000 file hash (single-level tree with ~9 MiB block size)
Just like what you're doing, these methods takes the list of block hashes and hashes that again, down to a single short hash. Since this is a well established practice, I would assume that it is as secure as sequential hashing.
Some modern hash designs allow them to be run in parallel. See An Efficient Parallel Algorithm for Skein Hash Functions. If you are willing to use a new (and hence less thoroughly tested) hash algorithm, this may give you the speed increase you want on a multi-processor machine.
Skein has reached the final stage of the NIST SHA-3 competition so it is not completely untested.
I think it would be significantly easier for an attacker to find a collision, because the time required to generate a hash is a function of the size of the data to hash. One of the great things about cryptographically secure hashes is that an attacker can't take your 100Gb file, find a spot they want to mutate, hash everything before and after that block, and then use those pre-computed values to quickly get the hash of the whole file after small/quick permutations to the bit their interested in. This is because theres an overlapping sliding window in the hashing algorithm.
In short, if you edit the middle of the file, you still need to hash the whole file to get the final checksum. Thus a 100Gb file takes a lot longer to find a collision in, than a 100byte file. The exception is when the edit is nonsense right at the end of the file, which is why that is so frequently seen for 'in-the-wild' in collision examples for large files.
However, if you break your original file up into blocks, the speed of an attack is now a function of the smallest block (or the size of the block you want to mutate). Since file size increases linearly with hashing time, a 100Gb file will take roughly 2000 seconds for each permutation/MD5, while a 1Mb block would allow an attacker to try 50 per second.
The solution would be to break your file up into overlapping chunks, then MD5 those chunks individually. The resultant hash would be a concatenation of the hashes in both start-to-end order, and end-to-start. Now finding a collision requires the whole file to be hashed - albeit in a parallelized way.

Is there an advantage on setting tinyint fields when I know that the value will not exceed 255?

Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.