I'm very new to SQL and I've I simply cannot work out a method for the following:
I have this table with start dates of the codes
Row Code Start Date Product
1 A1 2020-01-01 X
2 A1 2020-05-15 Y
3 A2 2020-02-02 X
4 A3 2020-01-31 Z
5 A3 2020-02-15 Y
6 A3 2020-12-31 X
Ultimately I need to be able to query another table and find out what Product a code was on a certain date, so Code A1 on 2020-01-10 was = X, but Code A1 today is = Y
I think I can work out how to use a between statement in the where clause, but I cannot work out how to Alter the table to have an End Date so it looks like this:
Row Code Start_Date Product End_Date
1 A1 2020-01-01 X 2020-05-14
2 A1 2020-05-15 Y NULL
3 A2 2020-02-02 X NULL
4 A3 2020-01-31 Z 2020-02-14
5 A3 2020-02-15 Y 2020-12-30
6 A3 2020-12-31 X NULL
Please note the database does not have an End_Date field
I think you want lead():
select t.*,
dateadd(day, -1,
lead(start_date) over (partition by code order by start_date)
) as end_date
from t;
Note: I would recommend not subtracting one day for the end date, so the end date is non-inclusive. This makes the end date the same as the next start date, which I find is easier to ensure that there are no gaps or overlaps in the data.
Related
I have two tables. Let's call the first one A and the other B.
A is:
ID
Doc_ID
Date
1
1a
1-Jan-2020
1
1a
1-Feb-2020
1
1b
1-Mar-2020
2
1a
1-Jan-2020
B is:
ID
Doc2_ID
Date
1
2a
1-Mar-2020
1
2a
1-Apr-2020
2
2b
1-Feb-2020
2
2a
1-Mar-2020
Now using SQL, I want to create a table which has all the values in Table A and the difference between the date in table A and the closest date in table B. For eg. 1-Jan-2020 should be subtracted from 1-Mar-2020 and similarly, 1-Feb-2020 should be subtracted from 1-Mar-2020. Can you please help me with it?
I am using the query below in azure databricks:
%sql
SELECT a.ID, a.Doc_ID, DATEDIFF(b.DATE, a.DATE) as day FROM a
LEFT JOIN b
ON a.ID = b.ID
AND a.DATE < b.DATE
But this is generating more than one row in the results i.e. it is subtracting from all the dates in Table 3 which fulfils the where conditions (For eg. it is subtracting 1 Jan 2020 from 1 Mar 2020 and 1 Apr 2020 and it want it subtract only from the closest date in Table B i.e. 1 Mar 2020)
The expected outcome should be:
ID
Doc_ID
day
1
1a
59
1
1a
30
1
1b
0
2
1a
30
The day column for first two rows was obtained after subtracting the respective dates in Table A from 1-Mar-2020 i.e. closest value in Table B for ID 1
Currently, I am doing an ETL task from record data for process mining task. The goal is to make a "Directly Follow (DF)" Matrix based on the record data. This is the flow:
I have a record (event) data, for example:
ID ev_ID Act Complete
1 1 A 2020-01-13 11:46
2 1 B 2020-01-13 11:50
3 1 C 2020-01-13 11:55
4 1 D 2020-01-13 12:50
5 1 E 2020-01-13 12:52
6 2 A 2020-01-06 09:13
7 2 B 2020-01-06 09:15
8 2 C 2020-01-06 11:46
9 2 D 2020-01-06 11:46
10 3 A 2020-01-06 08:11
11 3 C 2020-01-06 08:10
12 3 B 2020-01-06 09:46
13 3 D 2020-01-06 11:23
14 3 E 2020-01-06 16:05
As I mentioned above, I want to create a DF matrix that shows the "direct follow relation" see here. However, I want to change the output with a table representation (not a matrix).
The (desired) output:
From To Frequency
A A 0
A B 3
A C 1
… … …
D E 2
… … …
E E 0
The idea is to calculate the frequency of "direct follow relation" for each activity per ev_id. For example:
We have ev_1 = [ABCD]
The ev_1 has direct follow relation: AB, BC, and CD.
So, we can calculate the direct follow frequency for each activity.
My question:
Is there anyone who can suggest how to make the output using a SQL query?
I am doing the task with PostgreSQL now.
Any help is appreciated. Thank you very much.
I tried by myself, but the result seems not correctly 100%.
This is my code:
with ev_data as (
select
ID as eid,
ev_ID as ci,
Act as ea,
Complete as ec
from
table_name
),
A0 as (
select
eid,
ci::int,
row_number() over (partition by ci order by ci, ec) as idx,
ea as act1,
ea as act2
from
ev_data
),
A1 as (
select
L1.ci as ci1,
L1.idx as idx1,
L1.act1 as afrom,
L2.ci as ci2,
L2.idx as idx2,
L2.act2 as ato
from A0 as L1
join A0 as L2
on L1.ci = L2.ci
and L2.idx = L1.idx + 1
)
select
afrom,
ato,
count(*) as count
from A1
group by afrom, ato
order by afrom
Let me assume that your goal is the first matrix. You have two issues:
Getting the adjacent counts.
Generating the rows with 0 values.
Neither is really difficult. The first uses lead() and aggregation. The second uses cross join:
select a_f.act_from, a_t.act_to,
count(t.id)
from (select distinct act as act_from from table_name
) a_f cross join
(select distinct act as act_to from table_name
) a_t left join
(select t.*,
lead(act) over (partition by ev_id order by complete) as next_act
from table_name t
) t
on t.act = a_f.act_from and
t.next_act = a_t.act_to
group by a_f.act_from, a_t.act_to;
TableA
ID
Counter
Value
1
1
10
1
2
28
1
3
34
1
4
22
1
5
80
2
1
15
2
2
50
2
3
39
2
4
33
2
5
99
TableB
StartDate
EndDate
2020-01-01
2020-01-11
2020-01-02
2020-01-12
2020-01-03
2020-01-13
2020-01-04
2020-01-14
2020-01-05
2020-01-15
2020-01-06
2020-01-16
TableC (output)
ID
Counter
StartDate
EndDate
Val
1
1
2020-01-01
2020-01-11
10
2
1
2020-01-01
2020-01-11
15
1
2
2020-01-02
2020-01-12
28
2
2
2020-01-02
2020-01-12
50
1
3
2020-01-03
2020-01-13
34
2
3
2020-01-03
2020-01-13
39
1
4
2020-01-04
2020-01-14
22
2
4
2020-01-04
2020-01-14
33
1
5
2020-01-05
2020-01-15
80
2
5
2020-01-05
2020-01-15
99
1
1
2020-01-06
2020-01-16
10
2
1
2020-01-06
2020-01-16
15
I am attempting to come up with some SQL to create TableC. What TableC is, it takes the data from TableB, in chronological order, and for each ID in tableA, it finds the next counter in the sequence, and assigns that to the Start/End date combination for that ID, and when it reaches the end of the counter, it will start back at 1.
Is something like this even possible with SQL?
Yes this is possible. Try to do the following:
Calculate maximal value for Counter in TableA using SELECT MAX(Counter) ... into max_counter.
Add identifier row_number to each row in TableB so it will be able to find matching Counter value using SELECT ROW_NUMBER() OVER() ....
Establish relation between row number in TableB and Counter in TableA like this ... FROM TableB JOIN TableA ON (COALESCE(NULLIF(TableB.row_number % max_counter = 0), max_counter)) = TableA.Counter.
Then gather all these queries using CTE (Common Table Expression) into one query as official documentation shows.
Consider below approach
select id, counter, StartDate, EndDate, value
from tableA
join (
select *, mod(row_number() over(order by StartDate) - 1, 5) + 1 as counter
from tableB
)
using (counter)
if applied to sample data in your question - output is
I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!
Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output
I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.