We have an internal software based on an SQL-server DB with a master table and multiple joined tables. The nature of data we store is quite difficult to describe, but suppose we have a customers table with some joined tables: orders, shipments, phone-logs, complaints, etc.
We need to sync this software with an external one that has its own DB (with the very same structure) and produces an XML file with updated information about our "customers" (one file per customer). Updates may be in the master table and/or in 0 to n joined tables.
To import these files, one option is to query all the involved tables and compare them with the XML file, possibly adding-updating-deleting rows.
This would require a lot of coding.
Another option is to completely delete all data for the given customer (at least from the joined tables) and insert them again.
This would be not so efficient.
Please consider that the master table has 13 fields and there are about 6 tables with 3 to 15 fields.
In this app, we mainly use LINQ.
How would you proceed?
PS: I noticed a few answers on this subject here on StackOverflow, but almost all concern (single rows in) single tables.
For a scenario where I have a lot of join and lots of rows I prefer to update and make logical deletes. Example I have millions of customers and happens I have dozens tables with millions rows with FK poiting to customer ID. Trying to delete a customer can take several minutes.
For your particular scenario I can use a flag in each pertinent table to tell me: This rows was already synchronized, the row was inserted as is pending exportation, the rows is pending a delete or the rows was exported to xml in the past but was updated.
For exports:
It can make easy to query just the rows pending to be iserted, updated or deleted and ignore rows are up to date.
For imports:
If the other system don't have this facility there's a little trick you can do. Add a "external ID" column to fast search your database and identify the rows originated from that external source.
Even using this trick can be a pain to find if only that phone number was updated int that large table. For those extreme cases you can use a hash computed column to quickly identify if the two rows are different and update the entire (at least the common column) row.
An idea (considering you do this at database server side):
Build tables out of the customer xml. It can be temporary table or in-memory table.
Create SELECT queries to find new, updated, deleted data. These select queries would join the tables in your database and the table built from the customer xml. The output of the join would tell whether you have new records, updated records, deleted records or mix of it.
Run insert, update, delete accordingly.
Related
For context, we have a variety of data sources being ingested into our Redshift instance. Our ingestion tool marks rows as deleted if they are deleted from the original source with a marked_deleted column.
Now this is where it gets kind of complicated, some of the sources we ingest data from already do this form of soft deletion and have their own marked_deleted or deleted_at columns. We're aware of these. But we'd like to find the tables that don't have soft deletes enabled on the data source. The tables that have their rows hard deleted.
Does a query exist that can query all tables on a Redshift instance to find out where marked_deleted = true and return a list of those table names? We have 250+ tables already and that number is set to grow fast, so ideally we would like a query we can periodically run to update the list of tables we need to be aware of that contain hard deletes.
I have no idea where to even begin with a query like this, so any information is helpful!
We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.
I have a table called Customers which contains a number of columns such as FirstName, LastName, DateFileOpened, OrderedInLastMonth, etc... There are more than 20 columns per row, and there are around 500 rows.
Each hour, I have a mechanism that scrapes another source of this data for updated customer records, and puts them into a temporary table, which then need to be copied into my main Customers table. However, any or all of the columns in any or all of the new rows could differ to the existing ones in Customers.
At present, and to avoid creating nearly-duplicate records, my pretty crappy code does eg a delete from [Customers] where CoOrigin = 'England', before then importing the new ones to take their place. However, I have other queries that need to run at around the same time as this, and often this gets in the way, resulting in those other queries returning 0 data because the customer records that may be returned are missing, thanks to the delete command.
Once again, I'm aware this is terrible coding, but I'm still quite new. I've looked at the update / replace statements but they seem to need to specify which columns in each row need updating, but it could be any of the 20+. I'm aware that this would achieve the task, but it seems like more bad code. I'm also unsure how to reference the temporary table that the new records are imported into, before they are then copied to the main Customers table (and the temporary one dropped).
Any help or pointers you can give me would be very much appreciated. Thanks.
You are dealing with a microscopic amount of data, so you can use a "big hammer" approach to refresh every customer without impacting other processes.
begin;
lock table customer in exclusive mode;
delete from customer;
insert into customer select from temp_customer;
commit;
Processes that need access to the customer table will just block while the update completes (a couple of seconds tops) then continue unaffected.
I've got question concerning auto deleting particular records in one table of Oracle database using SQL.
I am making small academic project of database for private clinic and I have to design Oracle database and client application in Java.
One of my ideas is to arrange table "Visits" which stores all patients visits which took place in the past for history purposes. Aforementioned table will grow pretty fast so it will have weak searching performance.
So the idea is to make smaller table called "currentVisits" which holds only appointments for future visits because it will be much faster to search through ~1000 records than few millions after few years.
My question is how to implement auto deleting records in SQL from temporary table "currentVisits" after they took place.
Both tables will store fields like dateOfVisit, patientName, doctorID etc.
Is there any possibility to make it work in simple way? For example using triggers?
I am quite new in this topic so thanks for every answer.
Don't worry about the data size. Millions of records is not particularly large for a database on modern computing hardware. You will need an appropriate data structure, however.
In this case, you will want an index on the column that indicates current records. In all likelihood, the current records will be appended onto the end of the table, so they will tend to be congregating on a handful of data pages. This is a good thing.
If you have a heavy deletion load on the table, or you are using a clustered index, then the pages with the current records might be spread throughout the database. In that case, you want to include the "current" column in the clustered index.
I have some database tables those contains some aggregated data. Their records (some thousand / tables) are recomputed periodically by an external .NET app, so the old data should be deleted and the new should be inserted periodically. Update is not an option in this case.
Between the delete / insert there is an intermediate time, when the records state is inconsistent (old ones are deleted, new ones are not in the table yet), so making select query in that state results an incorrect result.
I use subsonic simplerepository to handle database features.
What is the best practice / pattern to workaround / handle this state?
Three options come to my mind:
Create a transaction with a lock on reads until it is done. This only works if processes are relatively fast. A few thousand records shouldn't be too bad if you transact/lock a table at a time -- if you lock the whole process, that could be costly! But if data is related, this is what you'd have to do
Write to temporary versions of the table, then drop old tables and rename temp tables.
Same as above, except bulk copy from temp tables (not necessarily SQL temporary tables, but ancillary holding tables would suffice) into correct tables, first deleting from main table. you'd still want to use a transaction for this.