SHA1-Indexed Hash table in D - indexing

I'm using a D builtin hash table indexed by SHA1-digests (ubyte[20]) to relate information in my file system search engine.
Are there any data structures more suitable for this (in D) because of all the nice properties of such a key: uniformly, distributed, random, fixed-sized or will the behaviour of D's builtin hash tables automatically figure out that it could for example just pick the first n (1-8) bytes of a SHA1-digest as a bucket index?

I think the hash function used inside standards maps is trivial enough (cost wise) that it won't make much if any difference unless you are running code that is mostly look-ups. Keep in mind that the full key will be read to do the final comparison so it will get loaded into the cache either way.
OTOH I think there is a opHash method you can overload.

Related

What is indexing? Why don't we use hashing for everything?

Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...

Identifying Differences Efficiently

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..
There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.
The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

Is O(1) access to a database row is possible?

I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.

Representing Sparse Data in PostgreSQL

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:
Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.
Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.
I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.
Are there any other solutions? Which one should I use?
I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
Postgres 8.4
Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
33% non-zero elements --> 427kiB effective size per matrix
averaged using ~1000 different random populated matrices
Competing methods:
Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
Automatic compression
no extra implementation efforts
no reduced network traffic
minimal compression overhead
persistent storage = 39% of the raw size
Bitmap
acceptable implementation effort
network traffic slightly decreased; dependent on sparsity
persistent storage = 33.9% of the raw size
Replace zeros with NULLs
some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
no change in network traffic
persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
A few solutions spring to mind,
1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying
2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table
3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables
4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)
5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated
6) live with lots of empty fields
A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.
However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.
If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).
If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.
I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.