How to test incremental data in ETL - sql

I have been asked at many interviews the same question again and again.The question is how would you test incremental data which gets loaded every day in their database.My position is Data warehouse QA plus BA.The main purpose of testing is to check if we have all the data from source and then to test if all the data copied from source got placed in respective tables as designed by developers.
So every time somebody asks this question i answer like this:To test incremental data we take data from staging tables which will have the data for the daily incremental file.So now i can compare the staging table against the target database.Like all databases there might be some calculation or joins we did according to design to get data from staging to production so i will use that design to make my queries to test data in production against source.
So my question here is i have tested incremental loads this way in the only project i did so can anybody give me detailed answer because i think i might not be answering it right.

Incremental loads are inevitable in any data warehousing environment. Following are the ways to render the incremental data and test it.
1) Source & Target tables should be designed in such a way where you should store date and timestamp of the data (row). Based on the date and timestamp column(s) you can easily fetch the incremental data.
2) If you use sophisticated ETL tools like informatica or Abinitio, then it is simple to see the status of the loads chronologically. These tools store the information for every load. However it has some limitation to store the last 10 loads. You need to configure it to store for more than 10 loads.
3) If you are not using sophisticated ETL tools then you should build ETL strategies to store the statistics of the load and capture the information (like no. of inserts, deletes, updates etc.,) during the load. These information can be retrieved whenever you need. But it needs lots of technical knowledge to adopt.
If you want to succeed in a data warehouse interview, i would suggest the best iOS application(data-iq) created by a us based company and its for candidates like you . check it out and you may like it. good luck for your interview.

I will answer it by telling how testing incremental data is different from History data.
I need to test only and only the incremental data. So I limit it by using the date condition in my source/staging tables and same date condition or Audit ID used for that incremental load in Target table.
Another thing that we need to check while testing incremental data is - Usually in Type 2 tables, we have a condition like
If a record is already existing in target table and there is no change
as compared to the last record in target table, then don't insert that
record.
So to take care of such condition, I need to do a History check where I compare the last record of target table with the first record of incremental data and if they are exactly same then I need drop that record. (Here ACTIVITY_DT is a custom metadata column, so we will look for change only in EMPID, NAME, CITY)
For example - Following are the records in my target table as a part of History load -
And these are the records which I am getting in my Incremental data
So In above scenario, I compare the last record of History data (sorted on ACTIVITY_DT DESC) with the first record of Incremental data (sorted on ACTIVITY_DT ASC). There is no change in data columns, so I need to drop the following record as it should not be inserted into target table
1 Aashish HYD 6/25/2014
So as part of this incremental load only two records are inserted which are as following -
1 Aashish GOA 6/26/2014
1 Aashish BLR 6/27/2014

Related

How to make 2 remote tables in sync along with determination of Delta records

We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Best practice for auditing data in SQL Server and retrieving point in time data

I've been doing history tables for some time now in databases, but never put too much effort or thought into it. I wonder what is the best practice out there.
My main goal is to record any changes to a record for a particular day. If more than one change happens in a day then then only one history record will exist. I need to record the date the record was changed, also when I retrieve data I need to pull the correct record from history as it was at a particular time. So for example I have a customers table and want to pull out what their address was for a particular date. My Sprocs like get Cust details will take in an optional date and if no date is passed in then it returns the most recent record.
So here's what I was looking for advice on:
Do I keep the history table in the same table and use a logical delete flag to hide the historical ones? I normally don't do this as some tables can change a lot and have lots of records. Do I use a separate table that mirrors the main table? I usually do this. Should I only put change records into the history table and not the current one? What is the most efficient way given a date to pull out the right record at a point in time, get every record for a customer <= date passed in, and then sort by most recent date and take the top?
Thanks for all the help... regards M
Suggestion is to use trigger based auditing and create triggers for all tables you need to audit.
With triggers you can accomplish the requirement for not storing more than one record update per day.
I’d suggest you check out ApexSQL Audit that generates triggers for you and try to reverse engineer what triggers they use, how storage tables look like and such.
This will give you a good start and you can work form there.
Disclaimer: not affiliated with ApexSQL but I do use their tools on a daily basis.
I'm no expert in the field but a good sql consultant once told me that a good aproach is generally to use the same table if all data can be changed. Otherwise have the original table contain only core nonchangable data and the historical table contain only stuff that can be changed.
You should defintely read this article on managing bitemporal data. The nice thing about this approach is it enables an auditable way of correcting historical data.
I beleive this will address your concerns about modidying the history data
I've always used a modified version of the audit table described in this article. While it does require you to pivot data so that it resembles your table's native structure, it is resilient against changes to the schema.
You can create a UDF that returns a table and accepts a table name (varchar) and point in time (datetime) as parameters. The UDF should rebuild the table using the audit (historical values) and give you the effective values at that date & time.

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.