Need help joining incremental data to a fact table in an incremental manor - sql

TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?

Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.

Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is

Related

Calculate running count of events for continuous dates in SQL

I have data that can be summarized as follows:
eventid startdate enddate productkey date startGroup endGroup eventGroup
123 2020-01-01 2020-01-10 123456 2020-01-01 1 0 1
123 2020-01-01 2020-01-10 123456 2020-01-02 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-03 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-04 0 1 1
234 2020-01-05 2020-01-07 123456 2020-01-05 1 0 2
234 2020-01-05 2020-01-07 123456 2020-01-06 0 0 2
234 2020-01-05 2020-01-07 123456 2020-01-07 0 1 2
123 2020-01-01 2020-01-10 123456 2020-01-08 1 0 1
123 2020-01-01 2020-01-10 123456 2020-01-09 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-10 0 1 1
I store various events for products. Since they can be overlapping, I already have code to de-dup the data, but now, with some of the (de-duped) days missing, I need to put the data back together at an event level. In the example data, you see two events, 123 (running from 1/1 to 1/10) and 234 (running from 1/5 to 1/7). I already cut out the middle two days to get rid of overlaps and what I want output-wise, is three groups of events
1/1-1/4 (i.e. last column = 1)
1/5-1/7 (i.e. last column = 2)
1/8-1/10 (i.e. last column = 3)
I already have code to find the right start and end entries for each block of time, but don't know how to calculate the eventGroup column correctly. Current code for the last three columns is as follows:
CASE WHEN DATEADD(DAY, -1, date) = LAG(date) OVER (PARTITION BY eventid, productkey ORDER BY date) THEN 0 ELSE 1 END startGroup,
CASE WHEN DATEADD(DAY, +1, date) = LEAD(date) OVER (PARTITION BY eventid, productkey ORDER BY date) THEN 0 ELSE 1 END endGroup,
dense_rank() over (order by eventid, productkey) eventGroup
I already tried things like https://dba.stackexchange.com/questions/193680/group-rows-by-uninterrupted-dates, but still wasn't able to create the correct groups.
In Excel logic, it would be eventGroup = if ( startGroup = 0, eventGroup of previous row, eventGroup of previous row + 1), but not sure how to replicate that running counter here.
Can someone help please? Thanks!
To assign the groups, use a cumulative sum:
select t.*,
sum(startGroup) over (partition by eventId, productKey order by startdate)
from t;
Note: This assumes that you want to restart the numbering with event/product combination.

Query to find active days per year to find revenue per user per year

I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.

Oracle agregate by ID with time range

I'm sure I saw it somewhere, but I cannot find it.
Given this table Historic:
ID1
ID2
Event_Date
Label
1
1
2020-01-01
1
1
1
2020-01-02
1
1
1
2020-01-04
1
1
1
2020-01-08
1
1
1
2020-01-20
1
1
1
2020-12-30
1
1
1
2020-01-01
0
1
1
2020-01-02
1
1
1
2020-01-04
0
1
1
2020-01-08
1
1
1
2020-01-20
0
1
1
2020-12-30
1
1
2
2020-01-01
1
1
2
2020-01-02
1
1
2
2020-01-04
1
2
1
2020-01-08
1
2
1
2020-01-20
1
2
1
2020-12-30
1
And the table startingpoint
ID1
ID2
Event_Date
1
1
2020-01-01
1
1
2020-01-02
1
1
2020-01-05
1
1
2020-01-08
1
1
2020-01-21
1
1
2021-01-01
1
1
2020-01-01
1
1
2020-01-03
1
1
2020-01-06
1
1
2020-01-11
1
1
2020-01-20
1
1
2020-12-31
1
2
2020-01-03
1
2
2020-01-05
1
2
2020-01-08
2
1
2020-01-08
2
1
2020-01-21
2
1
2021-01-01
For each row in startingpoint, compute the number of rows in historic with the same ID1 and ID2, where Event_Date in historic is between StartingPoint.Event_date - n days (I make it n so that I can change for different values) and StartingPoint.Event_date - 2 days. Then use the same rules to compute the fraction of rows with label = 1.
I know I can do this with a join , but if historic and startingpoint are very large, this looks very inefficient (for every row in startingpoint it will create a large join, and in the end it will sumarize the same set of rows many times repetadly). From an abstract point, it looks to me like it would be better to first sumarize historic for every ID1, ID2, Event_date, and the join with the startingpoint and select the best, but I'm open to other solutions.
You can try below solution with subquery:
select * ,(select count(*) from historic h where h.id1=s.id1 and h.id2=s.id2 and h.event_date between dateadd(day,-30,s.event_date) and dateadd(day,-2,s.event_date) )from startingpoint s
You have to have some form of join; either joining directly, or with a scalar subquery, which is probably not going to be as efficient.
The simplest way to do this is probably just a plain join, if you only want to see rows which have historic data:
select sp.id1, sp.id2, sp.event_date,
count(h.event_date) as any_label,
count(case when h.label = 1 then h.label end) as label_1,
count(case when h.label = 1 then h.label end) / count(h.event_date) as fraction_1
from startingpoint sp
join historic h on h.id1 = sp.id1
and h.id2 = sp.id2
and h.event_date >= sp.event_date - 10
and h.event_date < sp.event_date - 2
group by sp.id1, sp.id2, sp.event_date
order by sp.id1, sp.id2, sp.event_date;
where n is 10; which with your data would give you:
ID1 ID2 EVENT_DATE ANY_LABEL LABEL_1 FRACTION_1
--- --- ---------- --------- ------- --------------------
1 1 2020-01-05 4 3 .75
1 1 2020-01-06 4 3 .75
1 1 2020-01-08 6 4 .6666666666666666667
1 1 2020-01-11 8 6 .75
1 2 2020-01-05 2 2 1
1 2 2020-01-08 3 3 1
Or if you want to see zero counts, you can use an outer join; though then the fraction calculation needs some logic to avoid a divide-by-zero error:
select sp.id1, sp.id2, sp.event_date,
count(h.event_date) as any_label,
count(case when h.label = 1 then h.label end) as label_1,
case when count(h.event_date) > 0 then
count(case when h.label = 1 then h.label end) / count(h.event_date)
end as fraction_1
from startingpoint sp
left join historic h on h.id1 = sp.id1
and h.id2 = sp.id2
and h.event_date >= sp.event_date - 10
and h.event_date < sp.event_date - 2
group by sp.id1, sp.id2, sp.event_date
order by sp.id1, sp.id2, sp.event_date;
which gets:
ID1 ID2 EVENT_DATE ANY_LABEL LABEL_1 FRACTION_1
--- --- ---------- --------- ------- --------------------
1 1 2020-01-01 0 0
1 1 2020-01-02 0 0
1 1 2020-01-03 0 0
1 1 2020-01-05 4 3 .75
1 1 2020-01-06 4 3 .75
1 1 2020-01-08 6 4 .6666666666666666667
1 1 2020-01-11 8 6 .75
1 1 2020-01-20 0 0
1 1 2020-01-21 0 0
1 1 2020-12-31 0 0
1 1 2021-01-01 0 0
1 2 2020-01-03 0 0
1 2 2020-01-05 2 2 1
1 2 2020-01-08 3 3 1
2 1 2020-01-08 0 0
2 1 2020-01-21 0 0
2 1 2021-01-01 0 0
db<>fiddle

How can I create a column which computes only the change of other column on redshift?

I have this dataset:
product customer date value buyer_position
A 123455 2020-01-01 00:01:01 100 1
A 123456 2020-01-02 00:02:01 100 2
A 523455 2020-01-02 00:02:05 100 NULL
A 323455 2020-01-03 00:02:07 100 NULL
A 423455 2020-01-03 00:09:01 100 3
B 100455 2020-01-01 00:03:01 100 1
B 999445 2020-01-01 00:04:01 100 NULL
B 122225 2020-01-01 00:04:05 100 2
B 993848 2020-01-01 10:04:05 100 3
B 133225 2020-01-01 11:04:05 100 NULL
B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn't have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before
A 123455 2020-01-01 00:01:01 100 1 0
A 123456 2020-01-02 00:02:01 100 2 1
A 523455 2020-01-02 00:02:05 100 NULL 2
A 323455 2020-01-03 00:02:07 100 NULL 2
A 423455 2020-01-03 00:09:01 100 3 2
B 100455 2020-01-01 00:03:01 100 1 0
B 999445 2020-01-01 00:04:01 100 NULL 1
B 122225 2020-01-01 00:04:05 100 2 1
B 993848 2020-01-01 10:04:05 100 3 2
B 133225 2020-01-01 11:04:05 100 NULL 3
B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won't get this "cumulative" information. So I'm kind of lost here.
This looks like a window count of non-null values of buyer_position over the preceding rows:
select t.*,
coalesce(count(buyer_position) over(
partition by product
order by date
rows between unbounded preceding and 1 preceding
), 0) as people_before
from mytable t
Hmmm . . . If I understand correctly, You want the max of the buyer position for the customer/product minus 1:
select t.*,
max(buyer_position) over (partition by customer, product order by date rows between unbounded preceding and current row) - 1
from t;

Insert multiple rows from result of Average by date and id

I have a table with 1 result per day like this :
id | item_id | date | amount
-------------------------------------
1 1 2019-01-01 1
2 1 2019-01-02 2
3 1 2019-01-03 3
4 1 2019-01-04 4
5 1 2019-01-05 5
6 2 2019-01-01 1
7 2 2019-01-01 2
8 2 2019-01-01 3
9 2 2019-01-01 4
10 2 2019-01-01 5
11 3 2019-01-01 1
12 3 2019-01-01 2
13 3 2019-01-01 3
14 3 2019-01-01 4
15 3 2019-01-01 5
First I was trying to average the column amount for each day.
SELECT
x.item_id AS id,avg(x.amount) AS result
FROM
(SELECT
il.item_id, il.amount,
ROW_NUMBER() OVER (PARTITION BY il.item_id ORDER BY il.date DESC) rn
FROM
item_prices il) x
WHERE
x.rn BETWEEN 1 AND 50
GROUP BY
x.item_id
The result is going to be the following if calculated on 2019-01-05
item_id | average
1 3
2 3
3 3
or, if calculated 2019-01-04
item_id | average
1 2.5
2 2.5
3 2.5
My goal is to run the Average query , every day that would update the average automatically and insert it in 5th column "average" :
id | item_id | date | amount | average
5 1 2019-01-05 5 3
10 2 2019-01-05 5 3
15 3 2019-01-05 5 3
Issue is that every example i can find with Insert the Select they only update one row and they are over another table there is also the most recent date issue...
Can someone point me in the right direction?
Perhaps you want to see running average every day. Storing the value as a separate column is bound to cause problems especially when the rows are updated/deleted, the column also needs to be updated and hence will require complex triggers.
Simply create a View and run whenever you want to check the average directly from that View.
CREATE OR REPLACE VIEW v_item_prices AS
SELECT t.*,avg(t.amount) OVER ( PARTITION BY item_id order by date)
AS average FROM item_prices t
order by item_id,date
DEMO