I have data that can be summarized as follows:
eventid startdate enddate productkey date startGroup endGroup eventGroup
123 2020-01-01 2020-01-10 123456 2020-01-01 1 0 1
123 2020-01-01 2020-01-10 123456 2020-01-02 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-03 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-04 0 1 1
234 2020-01-05 2020-01-07 123456 2020-01-05 1 0 2
234 2020-01-05 2020-01-07 123456 2020-01-06 0 0 2
234 2020-01-05 2020-01-07 123456 2020-01-07 0 1 2
123 2020-01-01 2020-01-10 123456 2020-01-08 1 0 1
123 2020-01-01 2020-01-10 123456 2020-01-09 0 0 1
123 2020-01-01 2020-01-10 123456 2020-01-10 0 1 1
I store various events for products. Since they can be overlapping, I already have code to de-dup the data, but now, with some of the (de-duped) days missing, I need to put the data back together at an event level. In the example data, you see two events, 123 (running from 1/1 to 1/10) and 234 (running from 1/5 to 1/7). I already cut out the middle two days to get rid of overlaps and what I want output-wise, is three groups of events
1/1-1/4 (i.e. last column = 1)
1/5-1/7 (i.e. last column = 2)
1/8-1/10 (i.e. last column = 3)
I already have code to find the right start and end entries for each block of time, but don't know how to calculate the eventGroup column correctly. Current code for the last three columns is as follows:
CASE WHEN DATEADD(DAY, -1, date) = LAG(date) OVER (PARTITION BY eventid, productkey ORDER BY date) THEN 0 ELSE 1 END startGroup,
CASE WHEN DATEADD(DAY, +1, date) = LEAD(date) OVER (PARTITION BY eventid, productkey ORDER BY date) THEN 0 ELSE 1 END endGroup,
dense_rank() over (order by eventid, productkey) eventGroup
I already tried things like https://dba.stackexchange.com/questions/193680/group-rows-by-uninterrupted-dates, but still wasn't able to create the correct groups.
In Excel logic, it would be eventGroup = if ( startGroup = 0, eventGroup of previous row, eventGroup of previous row + 1), but not sure how to replicate that running counter here.
Can someone help please? Thanks!
To assign the groups, use a cumulative sum:
select t.*,
sum(startGroup) over (partition by eventId, productKey order by startdate)
from t;
Note: This assumes that you want to restart the numbering with event/product combination.
Related
TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?
Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.
Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is
I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!
Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output
I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.
I'm sure I saw it somewhere, but I cannot find it.
Given this table Historic:
ID1
ID2
Event_Date
Label
1
1
2020-01-01
1
1
1
2020-01-02
1
1
1
2020-01-04
1
1
1
2020-01-08
1
1
1
2020-01-20
1
1
1
2020-12-30
1
1
1
2020-01-01
0
1
1
2020-01-02
1
1
1
2020-01-04
0
1
1
2020-01-08
1
1
1
2020-01-20
0
1
1
2020-12-30
1
1
2
2020-01-01
1
1
2
2020-01-02
1
1
2
2020-01-04
1
2
1
2020-01-08
1
2
1
2020-01-20
1
2
1
2020-12-30
1
And the table startingpoint
ID1
ID2
Event_Date
1
1
2020-01-01
1
1
2020-01-02
1
1
2020-01-05
1
1
2020-01-08
1
1
2020-01-21
1
1
2021-01-01
1
1
2020-01-01
1
1
2020-01-03
1
1
2020-01-06
1
1
2020-01-11
1
1
2020-01-20
1
1
2020-12-31
1
2
2020-01-03
1
2
2020-01-05
1
2
2020-01-08
2
1
2020-01-08
2
1
2020-01-21
2
1
2021-01-01
For each row in startingpoint, compute the number of rows in historic with the same ID1 and ID2, where Event_Date in historic is between StartingPoint.Event_date - n days (I make it n so that I can change for different values) and StartingPoint.Event_date - 2 days. Then use the same rules to compute the fraction of rows with label = 1.
I know I can do this with a join , but if historic and startingpoint are very large, this looks very inefficient (for every row in startingpoint it will create a large join, and in the end it will sumarize the same set of rows many times repetadly). From an abstract point, it looks to me like it would be better to first sumarize historic for every ID1, ID2, Event_date, and the join with the startingpoint and select the best, but I'm open to other solutions.
You can try below solution with subquery:
select * ,(select count(*) from historic h where h.id1=s.id1 and h.id2=s.id2 and h.event_date between dateadd(day,-30,s.event_date) and dateadd(day,-2,s.event_date) )from startingpoint s
You have to have some form of join; either joining directly, or with a scalar subquery, which is probably not going to be as efficient.
The simplest way to do this is probably just a plain join, if you only want to see rows which have historic data:
select sp.id1, sp.id2, sp.event_date,
count(h.event_date) as any_label,
count(case when h.label = 1 then h.label end) as label_1,
count(case when h.label = 1 then h.label end) / count(h.event_date) as fraction_1
from startingpoint sp
join historic h on h.id1 = sp.id1
and h.id2 = sp.id2
and h.event_date >= sp.event_date - 10
and h.event_date < sp.event_date - 2
group by sp.id1, sp.id2, sp.event_date
order by sp.id1, sp.id2, sp.event_date;
where n is 10; which with your data would give you:
ID1 ID2 EVENT_DATE ANY_LABEL LABEL_1 FRACTION_1
--- --- ---------- --------- ------- --------------------
1 1 2020-01-05 4 3 .75
1 1 2020-01-06 4 3 .75
1 1 2020-01-08 6 4 .6666666666666666667
1 1 2020-01-11 8 6 .75
1 2 2020-01-05 2 2 1
1 2 2020-01-08 3 3 1
Or if you want to see zero counts, you can use an outer join; though then the fraction calculation needs some logic to avoid a divide-by-zero error:
select sp.id1, sp.id2, sp.event_date,
count(h.event_date) as any_label,
count(case when h.label = 1 then h.label end) as label_1,
case when count(h.event_date) > 0 then
count(case when h.label = 1 then h.label end) / count(h.event_date)
end as fraction_1
from startingpoint sp
left join historic h on h.id1 = sp.id1
and h.id2 = sp.id2
and h.event_date >= sp.event_date - 10
and h.event_date < sp.event_date - 2
group by sp.id1, sp.id2, sp.event_date
order by sp.id1, sp.id2, sp.event_date;
which gets:
ID1 ID2 EVENT_DATE ANY_LABEL LABEL_1 FRACTION_1
--- --- ---------- --------- ------- --------------------
1 1 2020-01-01 0 0
1 1 2020-01-02 0 0
1 1 2020-01-03 0 0
1 1 2020-01-05 4 3 .75
1 1 2020-01-06 4 3 .75
1 1 2020-01-08 6 4 .6666666666666666667
1 1 2020-01-11 8 6 .75
1 1 2020-01-20 0 0
1 1 2020-01-21 0 0
1 1 2020-12-31 0 0
1 1 2021-01-01 0 0
1 2 2020-01-03 0 0
1 2 2020-01-05 2 2 1
1 2 2020-01-08 3 3 1
2 1 2020-01-08 0 0
2 1 2020-01-21 0 0
2 1 2021-01-01 0 0
db<>fiddle
I have this dataset:
product customer date value buyer_position
A 123455 2020-01-01 00:01:01 100 1
A 123456 2020-01-02 00:02:01 100 2
A 523455 2020-01-02 00:02:05 100 NULL
A 323455 2020-01-03 00:02:07 100 NULL
A 423455 2020-01-03 00:09:01 100 3
B 100455 2020-01-01 00:03:01 100 1
B 999445 2020-01-01 00:04:01 100 NULL
B 122225 2020-01-01 00:04:05 100 2
B 993848 2020-01-01 10:04:05 100 3
B 133225 2020-01-01 11:04:05 100 NULL
B 144225 2020-01-01 12:04:05 100 4
The dataset has the product the company sells and the customers who saw the product. A customer can see more than one product, but the combination product + customer doesn't have any repetition. I want to get how many people bought the product before the customer sees it.
This would be the perfect output:
product customer date value buyer_position people_before
A 123455 2020-01-01 00:01:01 100 1 0
A 123456 2020-01-02 00:02:01 100 2 1
A 523455 2020-01-02 00:02:05 100 NULL 2
A 323455 2020-01-03 00:02:07 100 NULL 2
A 423455 2020-01-03 00:09:01 100 3 2
B 100455 2020-01-01 00:03:01 100 1 0
B 999445 2020-01-01 00:04:01 100 NULL 1
B 122225 2020-01-01 00:04:05 100 2 1
B 993848 2020-01-01 10:04:05 100 3 2
B 133225 2020-01-01 11:04:05 100 NULL 3
B 144225 2020-01-01 12:04:05 100 4 3
As you can see, when the customer 122225 saw the product he wanted, two people have already bought it. In the case of customer 323455, two people have already bought the product A.
I think I should use some window function, like lag(). But lag() function won't get this "cumulative" information. So I'm kind of lost here.
This looks like a window count of non-null values of buyer_position over the preceding rows:
select t.*,
coalesce(count(buyer_position) over(
partition by product
order by date
rows between unbounded preceding and 1 preceding
), 0) as people_before
from mytable t
Hmmm . . . If I understand correctly, You want the max of the buyer position for the customer/product minus 1:
select t.*,
max(buyer_position) over (partition by customer, product order by date rows between unbounded preceding and current row) - 1
from t;