SSIS issue with historical data - sql

I have a dimension of metrics found in a table called DimMetrics.
The columns are as follows:
MetricSK - unique key
MetricAK - surrogate key
Source
Status = Current, Expired
LastUpdate = date of the startDate of my range.
I am pulling data from one datasource to this one. Where I am retreiving storing the MetricAK and Source on a monthly basis.
The records could change from month to month, where the source could be deleted or added.
What is the best mehtod to achieve this? tried using the Slowly Moving Dimension provided, but I only managed to get it to work by adding records creating new MetricSK's.
What I would like is when i do a monthly import SSIS will check curren trecords and set records that are not part of the new batch to Expired, and then add any new records with Current and the first day of the date range I choose.
I hope this make sense, as I am stuck in a viable solution.
Thanks,
Pete

Ok.. this is kind of a hack since the Microsoft SCD does not have an inbuilt flow for delete. Before you start the data flow task that does the SCD on your table. Use an Execute SQL task that sets the status of all the rows to Expired. Now in the data flow task for the SCD component you would need to add the flow for unchanged rows and then a OLEDB Command that updates the status back to Current. Again this approach would be a performance bottle neck if you are talking about several millions of rows. But for a medium size table this should be fine.

Related

Populating fact table with different sequence time

I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html

How can I present the changes for updated data in Tableau

I am working on some data-sets which gets updated daily. By updation, I mean that three things happen:
1. New rows get added.
2. Some rows get deleted.
3. Some existing rows get replaced with new values.
Now I have prepared dash-boards on Tableau to analyze daily data, but I would also like to compare how the things are changing daily (i.e are we progressing or making loss from previous day.)
I am aware that we can take extracts from the data set. But if I go this way, I am not sure how to use all the extracts in one worksheet and compare the info given by all of them.
Tableau is simply a mechanism that builds an SQL query in the background and then builds tables and charts and such via that fetched query. This means that if you delete a row from the table it no longer exists so how can Tableau read it?? If anything your DB architecture should be creating new records and giving it a createtimestamp. You would NOT delete a record and put a new one. Then you'll only have one record in that table.... Sounds like a design issue

BigQuery update multi tables

i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)
You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.
Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.

How to test incremental data in ETL

I have been asked at many interviews the same question again and again.The question is how would you test incremental data which gets loaded every day in their database.My position is Data warehouse QA plus BA.The main purpose of testing is to check if we have all the data from source and then to test if all the data copied from source got placed in respective tables as designed by developers.
So every time somebody asks this question i answer like this:To test incremental data we take data from staging tables which will have the data for the daily incremental file.So now i can compare the staging table against the target database.Like all databases there might be some calculation or joins we did according to design to get data from staging to production so i will use that design to make my queries to test data in production against source.
So my question here is i have tested incremental loads this way in the only project i did so can anybody give me detailed answer because i think i might not be answering it right.
Incremental loads are inevitable in any data warehousing environment. Following are the ways to render the incremental data and test it.
1) Source & Target tables should be designed in such a way where you should store date and timestamp of the data (row). Based on the date and timestamp column(s) you can easily fetch the incremental data.
2) If you use sophisticated ETL tools like informatica or Abinitio, then it is simple to see the status of the loads chronologically. These tools store the information for every load. However it has some limitation to store the last 10 loads. You need to configure it to store for more than 10 loads.
3) If you are not using sophisticated ETL tools then you should build ETL strategies to store the statistics of the load and capture the information (like no. of inserts, deletes, updates etc.,) during the load. These information can be retrieved whenever you need. But it needs lots of technical knowledge to adopt.
If you want to succeed in a data warehouse interview, i would suggest the best iOS application(data-iq) created by a us based company and its for candidates like you . check it out and you may like it. good luck for your interview.
I will answer it by telling how testing incremental data is different from History data.
I need to test only and only the incremental data. So I limit it by using the date condition in my source/staging tables and same date condition or Audit ID used for that incremental load in Target table.
Another thing that we need to check while testing incremental data is - Usually in Type 2 tables, we have a condition like
If a record is already existing in target table and there is no change
as compared to the last record in target table, then don't insert that
record.
So to take care of such condition, I need to do a History check where I compare the last record of target table with the first record of incremental data and if they are exactly same then I need drop that record. (Here ACTIVITY_DT is a custom metadata column, so we will look for change only in EMPID, NAME, CITY)
For example - Following are the records in my target table as a part of History load -
And these are the records which I am getting in my Incremental data
So In above scenario, I compare the last record of History data (sorted on ACTIVITY_DT DESC) with the first record of Incremental data (sorted on ACTIVITY_DT ASC). There is no change in data columns, so I need to drop the following record as it should not be inserted into target table
1 Aashish HYD 6/25/2014
So as part of this incremental load only two records are inserted which are as following -
1 Aashish GOA 6/26/2014
1 Aashish BLR 6/27/2014

How would you maintain a history in SQL tables?

I am designing a database to store product informations, and I want to store several months of historical (price) data for future reference. However, I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries. Does anyone have a good idea of how to approach this problem? My initial design is to have a table named historical data, and everyday, it pulls the active data and stores it into the historical database with a time stamp. Does anyone have a better idea? Or can see what is wrong with mine?
First, I'd like to comment on your proposed solution. The weak part of course is that, there can, actually, be more than one change between your intervals. That means, the record was changed three times during the day, but you only archive the last change.
It's possible to have the better solution, but it must be event-driven. If you have the database server that supports events or triggers (like MS SQL), you should write a trigger code that creates entry in history table. If your server does not support triggers, you can add the archiving code to your application (during Save operation).
You could place a trigger on your price table. That way you can archive the old price in an other table at each update or delete event.
It's a much broader topic than it initially seems. Martin Fowler has a nice narrative about "things that change with time".
IMO your approach seems sound if your required history data is a snapshot of the end of the day's data - in the past I have used a similar approach with overnight jobs (SP's) that pick up the day's new data, timestamp it and then use a "delete all data that has a timestamp < today - x" where x is the time period of data I want to keep.
If you need to track all history changes, then you need to look at triggers.
I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries
We store data in Archive tables, using a Trigger, as others have suggested. Our archive table has additional column for AuditDate, and stores the "Deleted" data - i.e. the previous version of the data. The current data is only stored in the actual table.
We prune the Archive table with a business rule along the lines of "Delete all Archive data more than 3 months old where there exists at least one archive record younger than 3 months old; delete all archive data more than 6 months old"
So if there has been no price change in the last 3 months you would still have a price change record from the 3-6 months ago period.
(Ask if you need an example of the self-referencing-join to do the delete, or the Trigger to store changes in the Archive table)