Building a monthly Snapshot VIEW from historical data

Building a monthly Snapshot VIEW from historical data - sql

I have historical historical data that looks as following:
I need to get a DB VIEW of how each of these Opportunities looked like at the end of the month e.g.
What is the best way to implement this as a Snowflake VIEW?

Well there are two parts, the SQL that goes in the view, and the view to use.
The SQL appears to be the "last value of the month" per opportunity AND deleted is not true, which can be done a couple of ways.
Off the top of my head the most recent value of the month sounds like WHERE clause like against ROW_NUMBER:
SELECT last_modified_date,
opportunity,
amount
FROM table
WHERE deleted == false
AND ROW_NUMBER() OVER(PARTITON BY date_trunc('month', last_modified_date) ORDER BY last_modified_date DESC) = 1
This will keep the values you want, then we need to tweak the last_modified_date to the last day of the month, which can be done via
SELECT dateadd('day', -1 ,dateadd('month', 1, date_trunc('month', last_modified_date))) as month_end
which can be combined as:
SELECT dateadd('day', -1 ,dateadd('month', 1, date_trunced)) as month_end,
opportunity,
amount
FROM (
SELECT
date_trunc('month', last_modified_date) AS date_trunced
opportunity,
amount,
ROW_NUMBER() OVER(PARTITON BY date_trunced ORDER BY last_modified_date DESC) AS rn
FROM table
WHERE deleted == false
)
WHERE rn = 1
ORDER BY 1,2;
Then is you have a few million rows, you could just create a normal view, BUT if you have billions of rows, and years of data, the prior months are going to be rather gross churn over, so a materialized view would be helpful, but I am rather sure you cannot use a ROW_NUMBER in a materialized view, thus you could manually build a "historic months" table, and "current months" view, and union those together, and on some period like month roll the current periods data into history as that becomes stable.

Related

SQL - Update column based on date comparasion with previous occurrence

I have a huge table;
I want to create a third column based on the time difference between two dates for the same id. If the difference is less than a month, then it's active, if it is between 1-2 months then inactive and anything more than 2 is dormant. The expected outcome is below;( note last entries don't have activity definitions as I don't have previous occurrences.)
My question would be, how to do such operation.
case when date_>=date_add((select max(date_) from schema.table),-30) then 'Active' when date_<date_add((select max(date_) from schema.table),-30) and date_>= date_add((select max(date_) from schema.table),-60) then 'Inactive' when date_<date_add((select max(date_) from schema.table),-60) then 'Dormant3' end as Activity
the code I came up with is not what I need as it only checks for the final entry date in the table. What I need is more akin to a for loop and checking the each row and comparing it to the previous occurrence.
edit:
By partitioning over id and dense ranking them, I reached something that almost works. I just need to compare to the previous element in the dense rank groups.

Create base data first using LEAD()
Then compare than with original row.
SELECT ID, DATE,
CASE
WHEN DATEDIFF(DATE,PREVIOUS_DATE) <=30 THEN 'Active'
DATEDIFF(DATE,PREVIOUS_DATE) between 31 and 60 'Active'
ELSE 'Dormant'
END as Activity
(SELECT ID, DATE, LEAD(DATE) OVER( partition by id ORDER BY DATE) PREVIOUS_DATE FROM MYTABLE) RS

Avoiding roundtrips in the database caused by looping

I am using postgres and, I recently encountered that the code I am using has too many roundtrips.
What I am doing is basically getting data from a table on a daily basis because I have to look for changes on a daily basis, but the whole function that does this job is called once a month.
An example of my table
Amount
Id | Itemid | Amount | Date
1 | 2 | 50 | 20-5-20
Now this table can be updated to add items at any point in time and I have to see the total amount that is SUM(Amount) every day.
But here's the catch, I have to add interest to the amount of each day at the rate of 5%.
So I can't just once call the function, I have to look at its value every day.
For example if I add an item of 50$ on the 1st of may then the interest on that day is 5/100*50
I add another item on the 5th of may worth 50$ and now the interest on the 5th day is 5/100*50.
But prior to 5th, the interest was on only 50$ so If I just simply use SUM(Amount)*5/100. It is wrong.
Also, another issue is the fact that dates are stored as timestamps and I need to group it by date of the timestamp because if I group it on the basis of timestamp then it will create multiple rows for the same date which I want to avoid while taking the sum.
So if there are two entries on the same date but different hours ideally the query should sum it up as one single date.
Example
Amount Table
Date | Amount
2020-5-5 20:8:8 100
2020-5-5 7:8:8 | 100
Result should be
Amount Table
Date | Amount
2020-5-5 200
My current code.
for i in numberofdaysinthemonth:
amount = amount + session.query(func.sum(Amount.Amount)).filter(Amount.date<current_date).scalar() * 5/100
I want a query that gets all these values according to dates, for example
date | Sum of amount till that date
20-5-20 | 50
20-6-20 | 100
Any ideas about what I should do to avoid a loop that runs 30 times since the function is called once in a month.

I am supposed to get all this data in a table daywise and aggregated as the sum of amount for each day
That is a simple "running total"
select "date",
sum(amount) over (order by "date") as amount_til_date
from the_table
order by "date";
If you need the amount per itemid
select "date",
sum(amount) over (partition by itemid order by "date") as amount_til_date
from the_table
order by "date";
If you also need to calculate the "compound interest rate" up to that day, you can do that as well:
select item_id,
"date",
sum(amount) over (partition by itemid order by "date") as amount_til_date,
sum(amount) over (partition by item_id order by "date") * power(1.05, count(*) over (partition by item_id order by "date")) as compound_interest
from the_table
order by "date";
To get that for a specific month, add a WHERE clause:
where "date" >= date '2020-06-01'
and "date" < date '2020-07-01'

In general to avoid round trips between application and database, application code must be moved from application to database in stored code (stored procedures an stored functions) using a procedural language. This approach is sometimes called "thick database" in commercial databases like Oracle Database.
PostgreSQL default procedural language is pl/pgsql but you can use Java, Perl, Python, Javascript using PostgreSQL extensions that you would need to install in PostgreSQL.

aggregate multiple rows based on time ranges

i do have a customerand he use over a specific period of time different devices, tracked with a valid_from and valid_to date. but, every time something changes for this device there will be a new row written without any visible changes for the row based data, besides a new valid from/to.
what i'm trying to do is to aggregate the first two rows into one, same for row 3 and 4, while leaving 5 and 6 as they are. all my solutions i came up so far with are working for a usage history for the user not switching back to device a. everything keeps failing.
i'd really appreciate some help, thanks in advance!

If you know that the previous valid_to is the same as the current valid_from, then you can use lag() to identify where a new grouping starts. Then use a cumulative sum to calculate the grouping and finally aggregation:
select cust, act_dev, min(valid_from), max(valid_to)
from (select t.*,
sum(case when prev_valid_to = valid_from then 0 else 1 end) over (partition by cust order by valid_from) as grouping
from (select t.*,
lag(valid_to) over (partition by cust, act_dev order by valid_from) as prev_valid_to
from t
) t
) t
group by cust, act_dev, grouping;
Here is a db<>fiddle.

SQL Aggregation with only one table

So this problem has been bugging me a little for the last week or so. I'm working with a database which hasn't exactly been designed in a way that I like and I'm having to do a lot of work-arounds to get the queries to function in a way I would like.
Essentially, I'm trying to remove duplicate entries that occur as a result of an instance caused by a previous entry. For the sake of argument say that a customer places an order or issues a job (this only occurs once) but as a result of the interactions a series of other rows are created to represent, sub-orders or jobs. Essentially, all duplicate records should have the same finish time so what I'm trying to create is a query which will return the record which has the earliest start time and ignore all other records which have the same finish time. All this occurs within the same table.
Something like:
select starttime
, endtime
, description
, entrynumber
from table
where starttime = min
and endtime = endtime

Probably what you want is something like this:
;WITH OrderedTable AS
(
Select ROW_NUMBER() OVER (PARTITION BY endtime ORDER BY starttime) as rn, starttime, endtime, description, entrynumber
From Table
)
Select starttime, endtime, description, entrynumber
FROM OrderedTable
WHERE rn=1
What this does is group all the rows with the same end time, ordered by start time and give them an additional "row number" column starting at 1 and increasing. If you filter by rn = 1, you get only the earliest start time rows, ignoring the rest.

SQL find nearest date without going over, or return the oldest record

I have a view in SQL Server with prices of items over time. My users will be passing a date variable and I want to return the closest record without going over, or if no such record exists return the oldest record present. For example, with the data below, if the user passes April for item A it will return the March record and for item B it will return the June record.
I've tried a lot of variations with Union All and Order by but keep getting a variety of errors. Is there a way to write this using a Case Statement?
example:
case when min(Month)>Input Date then min(Month)
else max(Month) where Month <= Input Date?
Sincere apologies for attaching sample dataset as an image, I couldn't get it to format right otherwise.
Sample Dataset

You can use SELECT TOP (1) with order by DATE DESC + Item type + date comparison to get the latest. ORDER BY will order records by date, then you get the latest either this month (if exists) or earlier months.

Here's a rough outline of a query (without more of your table it's hard to be exact):
WITH CTE AS
(
SELECT
ITEM,
PRICE,
MIN(ACTUAL_DATE) OVER (PARTITION BY ITEM ORDER BY ITEM) AS MIN_DATE,
MAX(INPUT_DATE<=ACTUAL_DATE) OVER (PARTITION BY ITEM ORDER BY ITEM,ACTUAL_DATE) AS MATCHED_DATE
FROM TABLE
)
SELECT
CTE.ITEM,
CTE.PRICE,
CASE
WHEN
CTE.MATCHED_DATE IS NOT NULL
THEN
CTE.MATCHED_DATE
ELSE
CTE.MIN_DATE
END AS MOSTLY_MATCHED_DATE
FROM CTE
GROUP BY
CTE.ITEM,
CTE.PRICE
The idea is that in a Common Table Expression, you use the PARTITION BY function to identify the key date for each item, record by record, and then you do a test in aggregate to pull either your matched record or your default record.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Building a monthly Snapshot VIEW from historical data - sql

I have historical historical data that looks as following: I need to get a DB VIEW of how each of these Opportunities looked like at the end of the month e.g. What is the best way to implement this as a Snowflake VIEW?

Related

SQL - Update column based on date comparasion with previous occurrence

Avoiding roundtrips in the database caused by looping

aggregate multiple rows based on time ranges

SQL Aggregation with only one table

SQL find nearest date without going over, or return the oldest record

Categories

Resources