Background
I have a huge table (Table A) in my database. From which, I apply some filters based on business rules. I then apply these filters into another table (Table B).
Therefore, Table B will always contain data from A and will always be much smaller. Table A contains 500,000 entries and Table B contains 3000 entries.
Both tables have the same structure.
Issue
The huge table (Table A) can be updated at any moment without notice. Therefore, to ensure that Table B contains the most up-to-date business data, it needs to be refreshed regularly. In this instance, I do this once a week.
Solution
How I go about this is by setting up a simple stored procedure that does the following:
TRUNCATE Table B
Apply filters on Table A and INSERT data into Table B
Issue with this is the fact that I have to truncate the table every week. I was wondering if there is a more efficient way of doing this, rather than deleting all the data and inserting the data into table all over again?
Is there a way to check which values are in A that are missing in B and add them accordingly?
Thanks
Instead of doing that in a stored procedure, create a View. View keep in sync with the table, as the data changes in the Table A the View will automatically update.
Here are some details you may want to read about Views and how they work.
How to create Views
Related
We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.
DATA_CONSISTENCY_CHECK is On in my table. I'm trying to check data consistency for audit purpose.When I update same value in the main table, the temporal table keeps history of the same row, which causes difficult to track the version changes.I'm using MSSQL server.
You misunderstood the function of DATA_CONSISTENCY_CHECK option. It's used to check if time ranges definded by system_start_time_column_name and system_end_time_column_name columns in PERIOD FOR SYSTEM_TIME do not overlap in base and historical table when you enable the link between the base and historical table ( this done when you execute CREATE/ALTER TABLE command).
If you need data deduplication in historical table you have to implement it yourself. It can be a maintenance task which disable the link, remove duplicates, update the time range columns correctely and enable link between base and historical table back.
This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active
I need to create a table based on another table in MYsql including the constraints and indices.
I have following scenario:
Table A- exists probably with millions of rows.
I want to create table B with exactly same as table A (including constraints and indices).
Process data from A and some other source and insert to B.
At the end of processing drop table A(drop indices associated with table A) and Rename table B to A including indices.
What is the best way to do this? Performance is my real concern.
Thanks
In cases like this, we assume you know the structure of the table. In other words, you are not asking "how do I find out what all of these columns, indexes and constraints are".
Second we tend to assume that all data in table A is valid, so you do not need to enforce constraints while copying from A to B.
Your "some other source" is a wildcard. I'm assuming you do not know if this other source contains valid data, and would suggest:
1) Create B w/o indeces or constraints
2) Copy/bulk insert from "other source" to B
3) Execute constraints by issuing SELECTS to find invalid rows. Skip this step if you know the data is valid. Once it is ok to proceed:
4) Copy A to B in "chunks". The issue here is that a straight SELECT...INTO... of all X millions of rows will take forever (because of explosion of resources required to do it in a single implied transaction), but a row-by-row will also take forever (because its just plain slow to do one row at a time). So you process chunks of 1000 or 10000 rows at a time.
5) When all data is copied over, add the indeces
6) Add the constraints
7) Drop A
8) Rename B
First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks
I think your fastest method would be to:
Drop all foreign keys and indexes
from your table.
Truncate your
table.
Bulk insert your data.
Recreate your foreign keys and
indexes.
Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;
An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements
There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
Add a DataDate column representing the date for which your ~150K rows are valid.
Create a one-row, one-column table with "today's" DataDate.
Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
Update the DataDate table. View updates Instantly!
Delete yesterday's rows at your leisure.
SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.
For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.