Limit Rows in ETL Without Date Column for Cue - sql

We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...

Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.

Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.

Related

How to do update partitioned table [duplicate]

So I have a main table in Hive, it will store all my data.
I want to be able to load a incremental data update about every month
with a large amount of data couple billion rows. There will be new data
as well as updated entries.
What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.
What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.
The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.
Anyone got any ideas as to how to do this most efficiently and effectively ??
I'm pretty new to Hive and databases in general.
If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number.
To find all entries that will be updated you can join increment data with old data:
insert overwrite target_data [partition() if applicable]
SELECT
--select new if exists, old if not exists
case when i.PK is not null then i.PK else t.PK end as PK,
case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
...
case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
FROM
target_data t --restrict partitions if applicable
FULL JOIN increment_data i on (t.PK=i.PK);
It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.
Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344
Here is my solution/work around if you are using old hive version. This works better when you have large data in target table which we can't drop and recreate with full data every time.
create one more table say delete_keys table. This will hold all the key from main table which are deleted along with its surrogate key.
While loading incremental data into main table, do a left join with main table. For all the matching records, we ideally should update the main table. But instead, we take keys (along with surrogate key) from main table for all matching records and insert that to delete_keys table. Now we can insert all delta records into main table as it irrespective of whether they are to be updated or inserted.
Create view on main table using delete-keys table so that matching keys with delete-keys table are not fetched. So, this view will be final target table. This view will not show records from main table which are updated with latest records.

Incremental load for Updates into Warehouse

I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.

What's a good logic/design of a SQL script to incrementally update a table?

So there's this table of just about 40,000 rows I am looking to update. Colleague said it's best to incrementally update the table instead of complete delete and load.
So I've tried hashing out the design and logic of a script to do this, but my inexperience is getting to me. I just don't know what's efficient and unneeded to incrementally update a table.
Currently, the warehouse looks like this: data comes from source into a table (let's call this T1) in Teradata. Then it's sent into another table (let's call this T2) in Teradata with some added fields such as timestamp. Lastly, a view is built on that last table for security reasons.
So with that laid out, I was thinking of creating a temp/volatile table with data from T1. This would have all the data up to the time the script is run with new records. Then, go through the entire table seeing if the ID (primary index) already exists in T2, and if not, add it to another temp table. Then somehow combine the second temp table with T2 and override T2 and build a view on top of that.
Does this make any sense?
There's also the possibility of records being updated. So they would already exist in T2, but have updated data in a new version of T1. I think comparing the values of all the columns from T1 to T2 would be highly inefficient, but can't think of another way to do this
A 40,000 row delete and insert should be pretty painless for any modern database. Ditto for updates.
The real reason for doing and incremental delete/update/insert is so you can log the changes and timestamp rows in the permanent table with the date/time of nsertion and/or last update. The usual technique goes something like this:
remove rows from the permanent table that don't exist in the temp table
update rows that exist in both tables
insert rows that exist in the temp table, but don't exist in the permanent table.
Looking at the Teradata docs, that would be something like this (no warranties about this being syntactically correct, since I don't have a Teradata instance to play with):
delete permanent p
where not exists ( select *
from temp t
where t.id = p.id
)
update p
from permanent p ,
temp t
set ...
where t.id = p.id
insert permanent
select ...
from temp t
where not exists ( select *
from permanent p
where p.id = t.id
)
One might note that the deletes might get a little hairy if there are dependent foreign key constraints involved.
One might also note that on the update, the where clause might get a tad...complicated if you want to check for actual changes to column values: not much point in updating a row if nothing has changed.
There's a Teradata MERGE command that you might find useful, check this post:
https://forums.teradata.com/forum/database/merge-syntax-simple-version
merge into merge_tmp as t using (select 1 as a,'stf' as b,'uuj' as c) as s
on t.a = s.a
when matched then update set c = s.c
when not matched then insert values (s.a,s.b,s.c);
If you need to match on more columns simple put an and in the on statement.
Edit: If you want to use MERGE you might also need to use a delete statement like the one in nicholas' post.

Copy data between tables in different databases without PK's ( like synchronizing )

I have a table ( A ) in a database that doesn't have PK's it has about 300 k records.
I have a subset copy ( B ) of that table in other database, this has only 50k and contains a backup for a given time range ( july data ).
I want to copy from the table B the missing records into table A without duplicating existing records of course. ( I can create a database link to make things easier )
What strategy can I follow to succesfully insert into A the missing rows from B.
These are the table columns:
IDLETIME NUMBER
ACTIVITY NUMBER
ROLE NUMBER
DURATION NUMBER
FINISHDATE DATE
USERID NUMBER
.. 40 extra varchar columns here ...
My biggest concern is the lack of PK. Can I create something like a hash or a PK using all the columns?
What could be a possible way to proceed in this case?
I'm using Oracle 9i in table A and Oracle XE ( 10 ) in B
The approximate number of elements to copy is 20,000
Thanks in advance.
If the data volumes are small enough, I'd go with the following
CREATE DATABASE LINK A CONNECT TO ... IDENTIFIED BY ... USING ....;
INSERT INTO COPY
SELECT * FROM table#A
MINUS
SELECT * FROM COPY;
You say there are about 20,000 to copy, but not how many in the entire dataset.
The other option is to delete the current contents of the copy and insert the entire contents of the original table.
If the full datasets are large, you could go with a hash, but I suspect that it would still try to drag the entire dataset across the DB link to apply the hash in the local database.
As long as no duplicate rows should exist in the table, you could apply a Unique or Primary key to all columns. If the overhead of a key/index would be to much to maintain, you could also query the database in your application to see whether it exists, and only perform the insert if it is absent