Identifying Differences Efficiently - sql

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.

The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

Related

Optimal DB for bulk inserts with unique constraint on binary data field

I am trying to find the optimal DB for inserting binary data many times, when theres a unique constraint on it - there is a hash identifying each record, that can't be repeated. I'm interested only in inserting the data, receiving a feedback if the data can't be inserted because another record has the same identifying hash. Not going to do anything else with the data - not querying it or anything else (don't want to do aggregations or any computations, etc.).
There is a huge amount of inserts, and I am not sure that a regular SQL DB can keep up with it.
The DB will also constantly grow in size (data will not be deleted. Maybe there will be some kind of retention every few years), so I can't use a DB which has a hard memory limit.
I've tried running on postgres with an index for the binary field, however it was too slow (around 2ms for each insert, inserting as a batch of 50000).
I've also tried mongoDB (which seemed to have a slightly better performance), however it stopped responding after many inserts for an unknown reason.
I have tried to write the hash field as a Base64 coded string instead of a binary field but it didn't work good as well.
Looking for the best simple DB that can support a very large amount of data (Terabytes over time) in good performance for a simple unique constraint.
Thanks in advance

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.
This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)
This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

Using key-value databases as a set with persistent indices

Since the below got a bit long: Here's the tl;dr; version: Is there an existing key/value best-practice for fast key and value lookup, something like a hash-based set with persistent indices?
I'm interested in the world of key-value databases and have so far failed to figure out how one would efficiently implement the following use-case:
Assume we want to serialize some data and reference them somewhere else by a persistent, unique integer index. Thus e.g.: Key = unsigned int, Value = MyData.
The database should have fast key lookup and ensure that MyData is unique.
Now, when I insert a new value into my the database, I could assign it a new index key, e.g. the current size of the database or to prevent clashes after removing items, I could keep some counter externally.
But how would I ensure that I do not insert the same MyData value into my database? So far, it looks to me as if this is not efficiently possible with key-value databases - is this correct? I.e. I do not want to iterate over the whole database just to ensure MyData value is not in there already...
What is the best pratice to implement this, then?
For background: I work on KDevelop where we use the above for our code analysis cache. We actually have a custom implementation of the above use-case 1. Search for Bucket and ItemRepository if you are interested in the internals, and see 2 for an examplatory usage of the ItemRepository.
But you will probably agree, that this code is quite hard to understand and thus hard to maintain. I want to compare its performance to alternative solutions which might result in simpler code - but only if it does not incur a severe performance penalty. Considering the hype around the performance of key-value storages such as OpenLDAP MDB, Kyoto Cabinet and LevelDB, this is where I wanted to start.
What we have in KDevelop - as far as I figured out - is basically a sort of hybrid on-disk/in-memory hash map which gets saved to disk periodically (which of course can result in major data corruption in case of crashes etc.). Items are stored in a location based on their hash value which then of course also allows relatively fast value lookups as long as the hash function is fast. The added twist is that you also get some sort of persistent database index which can be used to lookup the items quite efficiently.
So - long story short - how would one do that with a key/value database such as LevelDB, Kyoto Cabinet, OpenLDAP MDB - you name it?
Sounds like you want to do what OpenLDAP does with its Equality index. Perhaps this is the same as the OrientDB example, I didn't read it.
The main table is indexed by a monotonically increasing integer key (called the entryID), and stores the data value. The equality index is indexed by a hash of the value, and stores a list of entryIDs that match the hash. Since the hash might have collisions, just the existence of an entry in the equality index doesn't prove uniqueness or duplication. You still need to check the actual values.
A faster/simpler approach, if you're using MDB, BDB, or some other database that supports duplicate keys, is to just keep one table, using the hash as the key. In both MDB and BDB there is a GET_BOTH request which matches both the key and the data to perform a fetch. If it succeeds then you know for certain that the value already exists. Otherwise, it allows you to save whatever data values and not worry whether or not there are hash collisions.
A caveat here, in MDB using duplicate keys, the size of the values is limited to less than one half of a disk page.
Unless I'm missing something here - typically your hash algorithm is consistent and will provide the same key for the same data. Thus you should only need to look up the key to see if it already exists, or handle the (likely duplicate key) error the DB gives back to you.
afaik Key/Value DBs can and will enforce a unique Value constraint for you i.e. you will get an error if you try and save a value that already exists.
How big are your value strings?
I would just store them in a key and let the database do all the work.
Typical LevelDB style, which applies to most KV stores, would be to use a pair of keys, prefixed to indicate type
eg:
Key = 'i' + ID
Value = valueString
Key = 'v' + valueString
Value = ID
In a system that needs to allow for multiple identical valueStrings you would move the ID into the tail of the second key
Key = 'v' + valueString + ID
Value = empty

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.