Can I load historical data in Upsolver/SQLake at a later time after current data is already ingested - sqlake

Ideally historical data is loaded first and then current data but we have current data already loaded into snowflake from Kafka as Upsert outputs. We have to later ingest historical data as well and that will be loaded from diff source, lets says from S3 dumps. Can we accomplish this?

You definitely need to create new data source for the historical data. This way historical data is ingested into Upsolver.
Design considerations for next step:
If the output was append only (just keep inserting), then your Snowflake output could have used both the historical data source and current data source UNION. You can either add multiple data sources while creating output from UI or you can edit the SQL to add UNION for both data sources and both historical and current data will land into target table. This design would handle Upsert use case also if historical data came first and was fully ingested before current data source started receiving data.
However, in this specific ask, since historical data is arriving later, we can't use this approach as historical data that is arriving in future could Upsert and override current latest data.
Solution 1: if you can stop the current data source till the time historical data is completely processed.
Stop the current data source
Create a look up table on the current data source with the record key
Create a historical output which joins the historical data source and the look up to filter only those historical records which do not exist in the lookup and Upsert it into target
Once historical data is processed completely, stop the historical data source, lookup and historical output.
Restart the current data source
Solution 2: If the current data volume is small and can be reprocessed from beginning then
stop the current snowflake output job
Truncate the snowflake target table
Load (Upsert) the historical data to the target
Once historical data is processed completely, stop the historical data source and output.
Replay the current data snowflake output job (stopped in step 1) from the beginning
Solution 3:
Load historical data into separate history table
Use snowflake compute to identify what should get applied to the main table from the historical table and then discard the historical tables.
(this would need snowflake joins to identify latest record keys in historical which aren't present in current table and insert them into current table)
Hope this helps.

Related

Bigquery data transfer duration from intraday table to daily table

I am using firebase analytics and bigguery with average of 50~60 GB daily data.
For the most recent daily table, a query gives different result from yesterday even if query conditions are exact same including target date.
I just found that there are 1~2days gap between table creation date and last modified date.
I assume the difference between the query results are because of this. (Calculating on different data volume, maybe)
Is this date gap means a single daily table needs at least 2 days to be fully loaded from intraday table?
Thanks in advance.
biqguery table info
In the documentation we can find the following information:
After you link a project to BigQuery, the first daily export of
events creates a corresponding dataset in the associated BigQuery
project. Then, each day, raw event data for each linked app populates
a new daily table in the associated dataset, and raw event data is
streamed into a separate intraday BigQuery table in real-time.
It seems that the intraday table is loaded to the main table each day and if you want to access this data in real-time you`ll have to use this intraday separate table.
If this information doesn`t help you, please provide some extra information so I can help you more efficiently.

Populating fact table with different sequence time

I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html

How to effectively save & restore data from the last three months and delete the old data?

I am using PostgreSQL. I need to delete all transaction data from database (except the last three month transaction data) then restore the data to new database with created/updated timestamp updated to now timestamp. Also the data more from last three months must be recaped into one data (example all invoice from party A must be grouped into one invoice with party A). Other rules is if the data is still foreign keys referenced for the last three month data.The data must not be deleted and only change the created/updated timestamp to now timestamp.
I am not good in SQL query so for now I am using this strategy:
First create the recap data (save in other temporary table) before delete (All data).
Then delete all data except the last three months.
Next create the recap data after delete.
Create the recap data from (All data - After delete data) so i get the recap data with nominal exactly same with data before the last three month.
Then insert the recap data to table. So the old data is clean + have recap data in the database.
So my strategy is only using same database and not create new database because process importing data using the program is very slow (because have 900++ tables).
But the client doesn't want use this strategy because he want the data is created in new database and tell me to using other way. So the question is: What is the real and correct procedure to clean database from some dates (filter with date) and recap the old data?
First of all, there is no way to find out when a row was added to a table unless you track it with a timestamp column.
That's the first change you'll have to make – add a timestamp column to all relevant columns that tracks when the row was created (or updated, depending on the requirement).
Then you have two choices:
Partition the tables by the timestamp column so that you have (for example) one partition per month.
Advantage: it is easy to get rid of old data: just drop the partition.
Disadvantage: Partitioning is tricky in PostgreSQL. It will become somewhat easier to handle in PostgreSQL v10, but the underlying problems remain.
Use mass DELETEs to get rid of old rows. That's easy to implement, but mass deletes really hurt (table and index bloat which might necessitate VACUUM (FULL) or REINDEX which impair availability).

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

How to test incremental data in ETL

I have been asked at many interviews the same question again and again.The question is how would you test incremental data which gets loaded every day in their database.My position is Data warehouse QA plus BA.The main purpose of testing is to check if we have all the data from source and then to test if all the data copied from source got placed in respective tables as designed by developers.
So every time somebody asks this question i answer like this:To test incremental data we take data from staging tables which will have the data for the daily incremental file.So now i can compare the staging table against the target database.Like all databases there might be some calculation or joins we did according to design to get data from staging to production so i will use that design to make my queries to test data in production against source.
So my question here is i have tested incremental loads this way in the only project i did so can anybody give me detailed answer because i think i might not be answering it right.
Incremental loads are inevitable in any data warehousing environment. Following are the ways to render the incremental data and test it.
1) Source & Target tables should be designed in such a way where you should store date and timestamp of the data (row). Based on the date and timestamp column(s) you can easily fetch the incremental data.
2) If you use sophisticated ETL tools like informatica or Abinitio, then it is simple to see the status of the loads chronologically. These tools store the information for every load. However it has some limitation to store the last 10 loads. You need to configure it to store for more than 10 loads.
3) If you are not using sophisticated ETL tools then you should build ETL strategies to store the statistics of the load and capture the information (like no. of inserts, deletes, updates etc.,) during the load. These information can be retrieved whenever you need. But it needs lots of technical knowledge to adopt.
If you want to succeed in a data warehouse interview, i would suggest the best iOS application(data-iq) created by a us based company and its for candidates like you . check it out and you may like it. good luck for your interview.
I will answer it by telling how testing incremental data is different from History data.
I need to test only and only the incremental data. So I limit it by using the date condition in my source/staging tables and same date condition or Audit ID used for that incremental load in Target table.
Another thing that we need to check while testing incremental data is - Usually in Type 2 tables, we have a condition like
If a record is already existing in target table and there is no change
as compared to the last record in target table, then don't insert that
record.
So to take care of such condition, I need to do a History check where I compare the last record of target table with the first record of incremental data and if they are exactly same then I need drop that record. (Here ACTIVITY_DT is a custom metadata column, so we will look for change only in EMPID, NAME, CITY)
For example - Following are the records in my target table as a part of History load -
And these are the records which I am getting in my Incremental data
So In above scenario, I compare the last record of History data (sorted on ACTIVITY_DT DESC) with the first record of Incremental data (sorted on ACTIVITY_DT ASC). There is no change in data columns, so I need to drop the following record as it should not be inserted into target table
1 Aashish HYD 6/25/2014
So as part of this incremental load only two records are inserted which are as following -
1 Aashish GOA 6/26/2014
1 Aashish BLR 6/27/2014