I need to compare two columns declared as string in 2 different databases, all values of the columns, like 10000 rows at once.
One database is Firebird, and one is SQLite. I want to create some sort of checksum of two columns to see if there are exactly the same values in the two columns.
For integer, I can make a sum(column) and if the sum match from the tables, I can assume that it has the same values. This is not bulletproof, but if there are many records, the accuracy increases.
For numerical the same can be used. However, I cannot know how to make similar things to a string column.
I see several possibilities.
1. Per row comparison
The easiest is to check the values one row per one row from Delphi code. It has the huge benefit of finding immediately which data row is not synchronized.
But if has the drawback of reading the whole data from the DB.
2. Hash in SQL SELECT
If you want to do the check in-place, with no data retrieval, then you need to compute the hash of the columns.
Firebird 4 has cryptographic hash functions like MD5 or SHA1, and you may be able to find a suitable UDF library for older Firebird revisions. They are easy to implement with SQLite3, as a custom aggregate function.
Ensure you hash the UTF-8 text version of the columns, because binary/raw storage content is not compatible with the two DBs.
Then you could compute the hash of a column in two SELECT SQL statements.
3. Iterative approach
If the data is INSERTed into the DB, not UPDATEd, then you could add a "hash" field in each row. When a row is INSERTed, you retrieve the last hash (which may be cached in memory for efficiency), then you hash the new appended values.
Then you can easily compare the resulting hash of the last row, and check if data are the same on both DB.
An alternative may be to compute this hash in memory, not as field in the DB. It will only retrieve the data once at startup, then refresh it when something is INSERTed. It may be enough to check if two DBs are in synch.
Related
My chosen database is MongoDB. But the question should be independent.
So for example, each row of record will have a flag that can take 1 of 2 possible values.
What is the pro and con of:
Having 1 table with a column to hold the value of this flag.
versus:
the pro and con of:
Having 2 tables to hold the two different types of records distinguished by the aforementioned flag?
Would this be cheaper in terms of storage, since you don't have that extra column?
Would this also be faster in queries, since you know exactly which table to look without having to perform a filter?
What is the common practice in industry?
Storage for a single column holding just a flag (e.g. active and archived) should be negligible. The query could be faster with two tables, however your application is more complex, you have to write 2 queries.
When you have only 2 distinct values and these values are more or less evenly distributed, then an index will not be used, thus the performance should be equal - unless you select the entire table.
It might be useful to have 2 tables if the flags are not evenly distributed. For example you have a rather small active data set which is queried frequently, and a big archive data set which is much bigger but hardly queried.
If available, you can also work with partitions which is actually a good combination of both.
This question is about how to achieve the best possible performance with SQLite.
If you have a table with one column that you update very often, and another column that you update very rarely, is it better to split up that table into 2 different tables, so that you have 2 tables with 1 column (excluding primary key) each, instead of 1 table with 2 columns (excluding primary key) each?
I have tried to find information about this in the SQLite Documentation, but I have no been able to find an explanation for what exactly happens on updating one column of one row of a table. The closest answer to my question I found was this:
During part of SQLite's INSERT and SELECT processing, the complete content of each row in the database is encoded as a single BLOB. So the SQLITE_MAX_LENGTH parameter also determines the maximum number of bytes in a row.
(Quoted from here: https://www.sqlite.org/limits.html)
To me that sounds as if every row is internally stored as one big blob of all columns, and that would mean that updating a single column of the row will indeed lead to the whole row being internally re-encoded again, including all other columns of that row, even if they have not been modified as part of the UPDATE. But I am not sure if I understand that sentence I quoted correctly.
I am thinking about a case where you have one column that stores some big blob with multiple MB of size, and another column that stores some integer. The column with the big blob might only be updated once a month, while the column with the integer might be updated once per second. As far as I currently understand it based on that quote, it seems that updating the integer once per second will lead to the multiple MB of blob to be re-encoded again every time you update the integer, and that would be very inefficient. Having one table for the blob, and a different table for the integer would be a lot better then.
Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.
The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.
Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..
There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.
One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
How to hash varies by database product, but all of the major providers have some sort of hashing available in them.
The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.
It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.
The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.
Both rsync and git are examples of widely used software that use hashes in this way.
In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.
There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.
In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.
My organisation has hundreds of DB2 tables that each have a randomly generated unique integer index. The random values are generated by either COBOL CICS mainframe programs or Java distributed applications. The normal approach taken is to randomly generate an integer value (only positive values are employed), then attempt to insert the data row, retrying when a duplicate index value has already been persisted. I would like to improve the performance of this approach and I'm considering trying to identify integer values that have not been generated and persisted to each table, this would mean we don't ever need to retry. We would know our insert would work. Does db2 have a function that can return unused index values?
The short answer is no.
The slightly longer answer is to point out that, if such a function existed, in your case on the first insert into one of your tables the size of the result set it would return would be 2,147,483,647 (positive) integers. At 4 bytes each, that would be 8,589,934,588 bytes.
Given the constraints of your existing system, what you're doing is probably the best that can be done. If the performance of retrying is unacceptable, I'm afraid redesigning your key scheme is the next step.
I think that's a question to ask: Is this scheme of using random numbers for unique keys causing a performance problem? As the tables fill up the key space you will see more and more retries, but you have a relatively large key space. If you're seeing large numbers of retries maybe your random numbers are less random than you'd like.
just a thought but you could use one sequence for a group of tables. In this way, the value will still be random (because you wouldn't know which it the next table you perform an insert to) but based on a specific sequance wich mean that most of the time you won't get a retry because the number keep ascending. that same Sequance can loop after a few hunderd million inserts and start to "fill in the blanks".
as far as other key ideas are concerned,You could also try and use a diffrent key, maybe one based on Timestamp or Rowid. that will still be random but not repetitive.