How do I create a backup for a table which will be used for a full-refresh? - dbt

I have an incremental model A where each day is calculated using the previous day's value. Running a full-refresh means that this table needs to be calculated since the beginning of time which is very inefficient and takes too long.
I have tried to create a backup table which will take a copy of the table's value each month, and have model A refer to the backup table during a full-refresh so that the values only after the backup need to be recalculated and I can arrive at today's value much quicker. However this gives me an error:
Encountered an error:
Found a cycle: model.model_A --> model.backup --> model.model_A
This is because the backup refers to the model to get the value each month, while model A also refers to the backup to build off in the case of a full-refresh.
Is there a way around this problem, avoiding rebuilding the entire model from the beginning of time every time I do a full-refresh?

Yes, you can't have 'circular loops' or cycles in your build process.
If there is an application that calculates the values for each day, you could perhaps store the new values back in the same source table(s), just adding a 'updated_at' or something similar. If I understand your use case correctly, you could then use this value whenever you need to query only the last day's information.

Related

What is the best approach for bulk cleaning a database table that has a large amount of duplicated data loaded every day (snowflake db)

Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards

Need to create Period over Period Issue Reporting in SQL Server 2016

I am responsible for creating period-over-period and trend reporting for our Team's Issue Management Department. What I need to do is at copy table Issues at month-end into a new table IssuesHist and add a column with the current date example: 1/31/21. Then at the next month-end I need to take another copy of the Issues table and append it to the existing IssuesHist table, and then add the column again with the current date. For example: 2/28/21.
I need to do this to be able to run comparative analysis on a period-over-period basis. The goal is to be able to identify any activity (opening new issues, closing old ones, reopening issues, etc.) that occurred over the period.
Example tables below:
Issues Table with the current data from our front-end tool
I need to copy the above into the new IssuesHist and add a date column like so
Then at the following month end I need to do the same thing. For example if the Issues table looked like this (changes highlighted in Red)
I would need to Append that to the bottom of the existing IssuesHist table with the new Date. So that I could run queries comparing the data periods to identify any changes.
My research has shown that a Temporal Table may be the best solution here, but I am unable to DIM our existing database's tables to include system versioning.
Please let me know what solution would work, best, and if you have any SQL Statement Tips.
Thank you!

Doubleor triple timestamp issue

I am using SQL assistant and my data brings in snapshots from a huge database in the form of timestamps. Occasionally the snapshots bring in multiples per hour. The data is correct, multiple snapshots do happen from time to time within an hour, not always but it does happen.
I am bringing this into Spotfire and viewing by an hour and when more than one snapshot happens in the hour, the data shows as doubled.
I only want to display one per hour preferably the last(max) timestamp for the hour. Example; for the 7 am hour the data has a snapshot for 7:10 am and one for 7:55 am.
These are correct but I only want to display the last(max) timestamp, 7:55 am in this case. I can't figure the issue out in Spotfire so I am leaning towards a fix in SQL. How can I display only 1 for each hour?
You'd do this similarly to how you'd probably do it in SQL -- using a ranking/rownumber function.
The basic way Rank in Spotfire works is Rank(Order columns, order direction, partitioned columns, tie method)
You need to partition by the combination of Date and Hour, and then sort descending by your timestamp column.
So the code to identify the rows that you want to isolate should be something along the lines of:
Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first")
What you do with it from here is going to depend on how you plan to use the data - for example, you can Limit Data Using Expression and set the code above = 1 which will limit your table accordingly (helpful if you don't want your users to accidentally forget to filter), or you can create a calculated column which turns it into a flag of some form like here:
If(Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first") = 1, "Latest", "Duplicate")
Which allows your users to filter by this property. This way, they have the option to look at the extra rows.
Ultimately, though, if you want to only ever see these rows, and have no use for the earlier records, I'd probably do it in SQL, if you have that ability. This reduces the number of rows you have to load into your analytic.

How to handle reoccurring calendar events and tasks (SQL Server tables & C#)

I need to scheduled events, tasks, appointments, etc. in my DB. Some of them will be one time appointments, and some will be reoccurring "To-Dos" which must be checked off. After looking a google's calendar layout and others, plus doing a lot of reading here is what I have so far.
Calendar table (Could be called schedule table I guess): Basic_Event Title, start/end, reoccurs info.
Calendar occurrence table: ties to schedule table, occurrence specific text, next occurrence date / time????
Looked here at how SQL Server does its jobs: http://technet.microsoft.com/en-us/library/ms178644.aspx
but this is slightly different.
Why two tables: I need to track status of each instance of the reoccurring task. Otherwise this would be much simpler...
so... on to the questions:
1) Does this seem like the proper way to go about it? Is there a better way to handle the multiple occurrence issue?
2) How often / how should I trigger creation of the occurrences? I really don't want to create a bunch of occurrences... BUT... What if the user wants to view next year's calendar...
Makes sense to have your schedule definition for a task in one table and then a separate table to record each instance of that separately - that's the approach I've taken in the past.
And with regards to creating the occurrences, there's probably no need to create them all up front. Especially when you consider tasks that repeat indefinitely! Again, the approach I've used in the past is to only create the next occurrence. When that instance is actioned, the next instance is then calculated and created.
This leaves the issue of viewing future occurrences. For this, you can start of with the initial/next scheduled occurrence and just calculate the future occurrences on-the-fly at display time.
While this isn't an exact answer to your question I've solved this problem before in SQL Server (though database here is irrelevant) by modeling a solution based on Unix's cron.
Instead of string parsing we used integer columns in a table to store the various time units.
We had events which could be scheduled; they could either point to a one-time schedule table that represented a distinct point in time (a date/time) or to the recurring schedule table which is modelled after cron.
Additionally remember to model your solution correctly. An event has a duration but the duration is unrelated to the schedule (but an event's duration may impact the schedule by causing conflicts). Do not try to model duration as part of your schedule.
In the past when we've done this, we had 2 tables:
1) Schedules -> Includes recurrence information
2) Exceptions -> Edit/changes to specific instances
Using SQL, it's possible to get the list of "Schedules" that have at least one instance in a given date range. Then you can expand in the GUI where each instance lies.

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?
You can use that column to make sure your users don't overwrite data from another user.
Lets say user A pulls up record 1 and at the same time user B pulls up record 1. User A edits the record and saves it. 5 minutes later, User B edits the record - but doesn't know about user A's changes. When he saves his changes, you use the recordversion column in your update where clause which will prevent User B from overwriting what User A did. You could detect this invalid condition and throw some kind of data out of date error.
Nothing that I'm aware of, or that Google seems to find quickly.
You con't get anything inherent by using that name for a column. Sure, you can create a column and do the record versioning as described in the next response, but there's nothing special about the column name. You could call the column anything you want and do versioning, and you could call any column RecordVersion and nothing special would happen.
Timestamp is mainly used for replication. I have also used it successfully to determine if the data has been updated since the last feed to the client (when I needed to send a delta feed) and thus pick out only the records which have changed since then. This does require having another table that stores the values of the timestamp (in a varbinary field) at the time you run the report so you can use it compare on the next run.
If you think that timestamp is recording the date or time of the last update, it does not do that, you would need dateTime fields and constraints (To get the orginal datetime)and triggers (to update) to store that information.
Also, keep in mind if you want to keep track of your data, it's a good idea to add these four columns to every table:
CreatedBy(varchar) | CreatedOn(date) | ModifiedBy(varchar) | ModifiedOn(date)
While it doesn't give you full history, it lets you know who and when created an entry, and who and when last modified it. Those 4 columns create pretty powerful tracking abilities without any serious overhead to your DB.
Obviously, you could create a full-blown logging system that tracks every change and gives you full-blown history, but that's not the solution for the issue I think you are proposing.