Which data structure should I use for storing hash values? - optimization

I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.

You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.

5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
Define hash function, which produces integer values from keys.
Sort records in file according to values, produced by this function
Store file offsets where each hash value starts
To locate value:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
Hash function must be fast to be effective
Hash function must produce linear distributed values or near that
Table of hash value offsets can be placed in separated file
Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.
You can find some examples of hash functions here.

Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.

First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.

Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
int key[4];
int a[1<<24][4];
#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])
i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);

As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.

Related

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like this:
384:w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND:wemfOGxqCfOTPi0ND
They can be broken down into three sections (split the string on the colons):
Block size: 384 in the hash above
First signature: w2mhnFnJF47jDnunEk3SlbJJ+SGfOypAYJwsn3gdqymefD4kkAGxqCfOTPi0ND
Second signature: wemfOGxqCfOTPi0ND
Block size refers to the block size for the first signature, and the block size for the second signature is twice that of the first signature (here: 384 x 2 = 768). Each file has one of these composite hashes, which means each file has two signatures with different block sizes.
The spamsum signatures can be compared only if their block sizes correspond. That is to say that the composite hash above can be compared to any other composite hash that contains a signature with a block size of 384 or 768. The similarity of signature strings for hashes with similar block size can be takes as a measure of similarity between the files represented by the hashes.
So if we have:
file1.blk2 = 768
file1.sig2 = wemfOGxqCfOTPi0ND
file2.blk1 = 768
file2.sig1 = LsmfOGxqCfOTPi0ND
We can get a sense of the degree of similarity of the two files by calculating some weighted edit distance (like Levenshtein distance) for the two signatures. Here the two files seem to be pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
One can also calculate a normalized similarity score between two hashes (see the details here).
I would like to find any two files that are more than 70% similar based on these hashes, and I have a strong preference for using the available software packages (or APIs/SDKs), although I am not afraid of coding my way through the problem.
I have tried breaking the hashes down and indexing them using Lucene (4.7.0), but the search seems to be slow and tedious. Here is an example of the Lucene queries I have tried (for each single signature -- twice per hash and using the case-sensitive KeywordAnalyzer):
(blk1:768 AND sig1:wemfOGxqCfOTPi0ND~0.7) OR (blk2:768 AND sig2:wemfOGxqCfOTPi0ND~0.7)
It seems that Lucene's incredibly fast Levenshtein automata does not accept edit distance limits above 2 (I need it to support up to 0.7 x 64 ≃ 19) and that its normal editing distance algorithm is not optimized for long search terms (the brute force method used does not cut off calculation once the distance limit is reached.) That said, it may be that my query is not optimized for what I want to do, so don't hesitate to correct me on that.
I am wondering whether I can accomplish what I need using any of the algorithms offered by Lucene, instead of directly calculating the editing distance. I have heard that BK-trees are the best way to index for such searches, but I don't know of the available implementations of the algorithm (Does Lucene use those at all?). I have also heard that a probable solution is to narrow down the search list using n-gram methods but I am not sure how that compares to editing distance calculation in terms of inclusiveness and speed (I am pretty sure Lucene supports that one). And by the way, is there a way to have Lucene run a term search in the parallel mode?
Given that I am using Lucene only to pre-match the hashes and that I calculate the real similarity score using the appropriate algorithm later, I just need a method that is at least as inclusive as Levenshtein distance used in similarity score calculation -- that is, I don't want the pre-matching method to exclude hashes that would be flagged as matches by the scoring algorithm.
Any help/theory/reference/code or clue to start with is appreciated.
This is not a definitive answer to the question, but I have tried a number of methods ever since. I am assuming the hashes are saved in a database, but the suggestions remain valid for in-memory data structures as well.
Save all signatures (2 per hash) along with their corresponding block sizes in a separate child table. Since only signatures of the same size can be compared with each other, you can filter the table by block size before starting to compare the signatures.
Reduce all the repetitive sequences of more than three characters to three characters ('bbbbb' -> 'bbb'). Spamsum's comparison algorithm does this automatically.
Spamsum uses a rolling window of 7 to compare signatures, and won't compare any two signatures that do not have a 7-character overlap after eliminating excessive repetitions. If you are using a database that support lists/arrays as fields, create a field with a list of all possible 7-character sequences extracted from each signature. Then create the fastest exact match index you have access to on this field. Before trying to find the distance of two signatures, first try to do exact matches over this field (any seven-gram in common?).
The last step I am experimenting with is to save signatures and their seven-grams as the two modes of a bipartite graph, projecting the graph into single mode (composed of hashes only), and then calculating Levenshtein distance only on adjacent nodes with similar block sizes.
The above steps do a good pre-matching and substantially reduce the number of signatures each signature has to be compared with. It is only after these that the the modified Levenshtein/Damreau distance has to be calculated.

Redis, how does SCAN cursor "state management" work?

Redis has a SCAN command that may be used to iterate keys matching a pattern etc.
Redis SCAN doc
You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)
I'm wondering how Redis implements the scanning algorithm-wise?
You may find answer in redis dict.c source file. Then I will quote part of it.
Iterating works the following way:
Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.
The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.
How it works
The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.
This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).
For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.
What happens if the table changes in size?
If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).
If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.
By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.
Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.
Wait... You have TWO tables during rehashing!
Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.
Limitations
This iterator is completely stateless, and this is a huge advantage, including no additional memory used.
The disadvantages resulting from this design are:
It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Is varchar(128) better than varchar(100)

Quick question. Does it matter from the point of storing data if I will use decimal field limits or hexadecimal (say 16,32,64 instead of 10,20,50)?
I ask because I wonder if this will have anything to do with clusters on HDD?
Thanks!
VARCHAR(128) is better than VARCHAR(100) if you need to store strings longer than 100 bytes.
Otherwise, there is very little to choose between them; you should choose the one that better fits the maximum length of the data you might need to store. You won't be able to measure the performance difference between them. All else apart, the DBMS probably only stores the data you send, so if your average string is, say, 16 bytes, it will only use 16 (or, more likely, 17 - allowing 1 byte for storing the length) bytes on disk. The bigger size might affect the calculation of how many rows can fit on a page - detrimentally. So choosing the smallest size that is adequate makes sense - waste not, want not.
So, in summary, there is precious little difference between the two in terms of performance or disk usage, and aligning to convenient binary boundaries doesn't really make a difference.
If it would be a C-Program I'd spend some time to think about that, too. But with a database I'd leave it to the DB engine.
DB programmers spent a lot of time in thinking about the best memory layout, so just tell the database what you need and it will store the data in a way that suits the DB engine best (usually).
If you want to align your data, you'll need exact knowledge of the internal data organization: How is the string stored? One, two or 4 bytes to store the length? Is it stored as plain byte sequence or encoded in UTF-8 UTF-16 UTF-32? Does the DB need extra bytes to identify NULL or > MAXINT values? Maybe the string is stored as a NUL-terminated byte sequence - then one byte more is needed internally.
Also with VARCHAR it is not neccessary true, that the DB will always allocate 100 (128) bytes for your string. Maybe it stores just a pointer to where space for the actual data is.
So I'd strongly suggest to use VARCHAR(100) if that is your requirement. If the DB decides to align it somehow there's room for extra internal data, too.
Other way around: Let's assume you use VARCHAR(128) and all things come together: The DB allocates 128 bytes for your data. Additionally it needs 2 bytes more to store the actual string length - makes 130 bytes - and then it could be that the DB aligns the data to the next (let's say 32 byte) boundary: The actual data needed on the disk is now 160 bytes 8-}
Yes but it's not that simple. Sometimes 128 can be better than 100 and sometimes, it's the other way around.
So what is going on? varchar only allocates space as necessary so if you store hello world in a varchar(100) it will take exactly the same amount of space as in a varchar(128).
The question is: If you fill up the rows, will you hit a "block" limit/boundary or not?
Databases store their data in blocks. These have a fixed size, for example 512 (this value can be configured for some databases). So the question is: How many blocks does the DB have to read to fetch each row? Rows that span several block will need more I/O, so this will slow you down.
But again: This doesn't depend on the theoretical maximum size of the columns but on a) how many columns you have (each column needs a little bit of space even when it's empty or null), b) how many fixed width columns you have (number/decimal, char), and finally c) how much data you have in variable columns.