Calculating a hash code for a large file in parallel

Calculating a hash code for a large file in parallel - cryptography

I would like to improve the performance of hashing large files, say for example in the tens of gigabytes in size.
Normally, you sequentially hash the bytes of the files using a hash function (say, for example SHA-256, although I will most likely use Skein, so hashing will be slower when compared to the time it takes to read the file from a [fast] SSD). Let's call this Method 1.
The idea is to hash multiple 1 MB blocks of the file in parallel on 8 CPUs and then hash the concatenated hashes into a single final hash. Let's call this Method 2.
A picture depicting this method follows:
I would like to know if this idea is sound and how much "security" is lost (in terms of collisions being more probable) vs doing a single hash over the span of the entire file.
For example:
Let's use the SHA-256 variant of SHA-2 and set the file size to 2^34=34,359,738,368 bytes. Therefore, using a simple single pass (Method 1), I would get a 256-bit hash for the entire file.
Compare this with:
Using the parallel hashing (i.e., Method 2), I would break the file into 32,768 blocks of 1 MB, hash those blocks using SHA-256 into 32,768 hashes of 256 bits (32 bytes), concatenate the hashes and do a final hash of the resultant concatenated 1,048,576 byte data set to get my final 256-bit hash for the entire file.
Is Method 2 any less secure than Method 1, in terms of collisions being more possible and/or probable? Perhaps I should rephrase this question as: Does Method 2 make it easier for an attacker to create a file that hashes to the same hash value as the original file, except of course for the trivial fact that a brute force attack would be cheaper since the hash can be calculated in parallel on N cpus?
Update: I have just discovered that my construction in Method 2 is very similar to the notion of a hash list. However the Wikipedia article referenced by the link in the preceding sentence does not go into detail about a hash list's superiority or inferiority with regard to the chance of collisions as compared to Method 1, a plain old hashing of the file, when only the top hash of the hash list is used.

Block-based hashing (your method 2) is a well known technique that is used in practice:
Hash tree, Merkle tree, Tiger tree hash
eDonkey2000 file hash (single-level tree with ~9 MiB block size)
Just like what you're doing, these methods takes the list of block hashes and hashes that again, down to a single short hash. Since this is a well established practice, I would assume that it is as secure as sequential hashing.

Some modern hash designs allow them to be run in parallel. See An Efficient Parallel Algorithm for Skein Hash Functions. If you are willing to use a new (and hence less thoroughly tested) hash algorithm, this may give you the speed increase you want on a multi-processor machine.
Skein has reached the final stage of the NIST SHA-3 competition so it is not completely untested.

I think it would be significantly easier for an attacker to find a collision, because the time required to generate a hash is a function of the size of the data to hash. One of the great things about cryptographically secure hashes is that an attacker can't take your 100Gb file, find a spot they want to mutate, hash everything before and after that block, and then use those pre-computed values to quickly get the hash of the whole file after small/quick permutations to the bit their interested in. This is because theres an overlapping sliding window in the hashing algorithm.
In short, if you edit the middle of the file, you still need to hash the whole file to get the final checksum. Thus a 100Gb file takes a lot longer to find a collision in, than a 100byte file. The exception is when the edit is nonsense right at the end of the file, which is why that is so frequently seen for 'in-the-wild' in collision examples for large files.
However, if you break your original file up into blocks, the speed of an attack is now a function of the smallest block (or the size of the block you want to mutate). Since file size increases linearly with hashing time, a 100Gb file will take roughly 2000 seconds for each permutation/MD5, while a 1Mb block would allow an attacker to try 50 per second.
The solution would be to break your file up into overlapping chunks, then MD5 those chunks individually. The resultant hash would be a concatenation of the hashes in both start-to-end order, and end-to-start. Now finding a collision requires the whole file to be hashed - albeit in a parallelized way.

Related

Likelihood of Collision

I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.
What is less likely to result in a collision.
Taking the SHA-1 of the SHA-256 hash of the account number.
Taking the SHA-256 of the account number and picking out 40 characters.

These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."
An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.
(*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.
Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.
The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:
Select a "salt" value. This can be publicly known. It does not matter. For this particular use case, because every account number is different, you can select a single global salt. (If the protected data could be the same, then it's important to have different salts for each record.)
Compute PBKDF2(salt, iterations, length, payload)
The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.
A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.
(Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)

A 40 character ID is 320 bits, which gives you plenty of space. With only 250 records, you can easily fit a unique counter into that. Three digits is only 24 bits, and you have the range 000 to 999 to play with. Fill up the rest of the ID with, say, the hex expression of part of the SHA-256 hash. With a 3-digit ID, that leaves 37 places for hex which covers 37*4 = 148 bits of the Sha-256 output.
You may want to put the counter in the middle of the hex string in a fixed position instead of at the beginning or end to make it less obvious.
<11 hex chars><3 digit ID><26 hex chars>

Using HSET or SETBIT for storing 6 billion SHA256 hashes in Redis

Problem set : I am looking to store 6 billion SHA256 hashes. I want to check if a hash exist and if so, an action will be performed. When it comes to storing the SHA256 hash (64 byte string) just to check the if the key exist, I've come across two functions to use
HSET/HEXIST and GETBIT/SETBIT
I want to make sure I take the least amount of memory, but also want to make sure lookups are quick.
The Use case will be "check if sha256 hash exist"
The problem,
I want to understand how to store this data as currently I have a 200% increase from text -> redis. I want to understand what would the best shard options using ziplist entries and ziplist value would be. How to split the hash to be effective so the ziplist is maximised.
I've tried setting the ziplist entries to 16 ^ 4 (65536) and the value to 60 based on splitting 4:60
Any help to help me understand options, and techniques to make this as small of a footprint but quick enough to run lookups.
Thanks

A bit late to the party but you can just use plain Redis keys for this:
# Store a given SHA256 hash
> SET 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 ""
OK
# Check whether a specific hash exists
> EXISTS 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae
0
Where both SET and EXISTS have a time complexity of O(1) for single keys.
As Redis can handle a maximum of 2^32 keys, you should split your dataset into two or more Redis servers / clusters, also depending on the number of nodes and the total combined memory available to your servers / clusters.
I would also suggest to use the binary sequence of your hashes instead of their textual representation - as that would allow to save ~50% of memory while storing your keys in Redis.

Identifying Differences Efficiently

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.

The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

SHA1-Indexed Hash table in D

I'm using a D builtin hash table indexed by SHA1-digests (ubyte[20]) to relate information in my file system search engine.
Are there any data structures more suitable for this (in D) because of all the nice properties of such a key: uniformly, distributed, random, fixed-sized or will the behaviour of D's builtin hash tables automatically figure out that it could for example just pick the first n (1-8) bytes of a SHA1-digest as a bucket index?

I think the hash function used inside standards maps is trivial enough (cost wise) that it won't make much if any difference unless you are running code that is mostly look-ups. Keep in mind that the full key will be read to do the final comparison so it will get loaded into the cache either way.
OTOH I think there is a opHash method you can overload.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?

Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas