SQL Merge 3 tables by date where some dates are missing - sql

I have 3 tables that I want to merge, each with a different column of interest. I also have an id variable that I want to do separate merges "within" id. The idea is that I want to merge X, Y, and Z by date (within ID), and have missing values if that date does not exist for a particular variable.
Table X:
ID Date X
1 2012-01-01 101
1 2012-01-02 102
1 2012-01-03 103
1 2012-01-04 104
1 2012-01-05 105
2 2012-01-01 150
Table Y:
ID Date Y
1 2012-01-01 301
1 2012-01-02 302
1 2012-01-03 303
1 2012-01-11 311
2 2012-01-01 350
Table Z:
ID Date Z
1 2012-01-01 401
1 2012-01-03 403
1 2012-01-04 404
1 2012-01-11 411
1 2012-01-21 421
2 2012-01-01 450
Desired Result Table:
ID Date X Y Z
1 2012-01-01 101 301 401
1 2012-01-02 102 302 .
1 2012-01-03 103 303 403
1 2012-01-04 104 . 404
1 2012-01-05 105 . .
1 2012-01-11 . 311 411
1 2012-01-21 . . 421
2 2012-01-01 150 350 450
Any ideas how to write this SQL statement? I've tried messing around with "full joins" and where statements for cross products, but I keep getting duplicate values for some of my ID-date combinations, or sometimes no ID.
Any help would be appreciated.

Joins can be tricky things. My usual approach is to form the set of Keys first, and then use those keys to get what I want.
SELECT source.ID, source.Date, x.X, y.Y, z.Z
FROM
(
SELECT ID, Date
FROM TableX
UNION
SELECT ID, Date
FROM TableY
UNION
SELECT ID, Date
FROM TableZ
) as source
LEFT JOIN TableX x ON source.ID = x.ID AND source.Date = x.Date
LEFT JOIN TableY y ON source.ID = y.ID AND source.Date = y.Date
LEFT JOIN TableZ z ON source.ID = z.ID AND source.Date = z.Date
ORDER BY source.ID, source.Date

Related

Compare values in one table basing on previous dates

I have a problem which I can not resolve in dax. I have a table in report view -
Real table has other values and contains 10000+ records, but below is example what I would like to achieve:
(in real table the difference between dates in tales does not always equals 1)
user
username
salary
date
1
x
123
14-10-2022
2
y
455
11-10-2022
3
z
333
13-10-2022
4
t
222
12-10-2022
5
h
111
10-10-2022
desired output:
user
username
salary
date
salary (date-1 day)
salary (date-3 days)
1
x
123
14-10-2022
333
455
2
y
455
11-10-2022
111
3
z
333
13-10-2022
222
111
4
t
222
12-10-2022
455
5
h
111
10-10-2022
I know that the way could be self join like
on table1.user = table2.user and table1.date = table2.date - 1 and table1.date = table2.date - 3
but is there any other idea how to achieve desired tables without doing many joins?
Thanks you very much in advance

Match group of variables and values with the nearest datetime

I have a transaction table that looks like that:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:11:00 001 101 -45 Y
2021-03-01 10:11:00 001 105 -25 Y
2021-03-01 10:11:00 001 109 -40 Y
The table does not have an id column; the transaction_start for a given store_no will never be the same.
Whenever a transaction is post voided, the transaction is then repeated with the same store_no, item_no but with a negative/minus amount and an equal or higher transaction_start. Also, the column post_voided is then equal to 'Y'.
In the example above, the rows 1-3 have the same transaction_start and store_no, thus belonging to the same receipt, containing three different items (101, 105, 109). The same logic is applied to the other rows: rows 4-5 belong to a same receipt, and so on. In the example, 4 different receipts can be seen. The last receipt, given by the last three rows, is a post voided of the first receipt (rows 1-3).
What I want to do is to change the transaction_start for the post_voided = 'Y' transactions (in my example, only one receipt - represented by the last three rows - has it) to the next/closest datetime of a similar receipt that has the variables store_no, item_no and (negative) amount (but post_voided = 'N') (in my example, the similar ticket is given by the first three rows - store_no, all item_no and (positive) amount match). The transaction_start for the post voided receipt is always equal or higher than the "original" receipt.
Desired output:
transaction_start store_no item_no amount post_voided
2021-03-01 10:00:00 001 101 45 N
2021-03-01 10:00:00 001 105 25 N
2021-03-01 10:00:00 001 109 40 N
2021-03-01 10:05:00 002 103 35 N
2021-03-01 10:05:00 002 135 20 N
2021-03-01 10:08:00 001 140 2 N
2021-03-01 10:00:00 001 101 -45 Y
2021-03-01 10:00:00 001 105 -25 Y
2021-03-01 10:00:00 001 109 -40 Y
Here a link of the table: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=26142fa24e46acb4213b96c86f4eb94b
Thanks in advance!
Consider below
select a.* replace(ifnull(b.transaction_start, a.transaction_start) as transaction_start)
from `project.dataset.table` a
left join (
select * replace(-amount as amount)
from `project.dataset.table`
where post_voided = 'N'
) b
using (store_no, item_no)
if applied to sample data in your question - output is
Consider below for new / extended example (https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=91f9f180fd672e7c357aa48d18ced5fd)
select x.* replace(ifnull(y.original_transaction_start, x.transaction_start) as transaction_start)
from `project.dataset.table` x
left join (
select b.transaction_start, b.store_no, b.item_no, b.amount amount,
max(a.transaction_start) original_transaction_start
from `project.dataset.table` a
join `project.dataset.table` b
on a.store_no = b.store_no
and a.item_no = b.item_no
and a.amount = -b.amount
and a.post_voided = 'N'
and b.post_voided = 'Y'
and a.transaction_start < b.transaction_start
group by b.transaction_start, b.store_no, b.item_no, b.amount
) y
using (store_no, item_no, amount, transaction_start)
with output

Query to find active days per year to find revenue per user per year

I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.

Convert This SQL Query to ANSI SQL

I would like to convert this SQL query into ANSI SQL. I am having trouble wrapping my head around the logic of this query.
I use Snowflake Data Warehouse, but it does not understand this query because of the 'delete' statement right before join, so I am trying to break it down. From my understanding the row number column is giving me the order from 1 to N based on timestamp and placing it in C. Then C is joined against itself on the rows other than the first row (based on id) and placed in C1. Then C1 is deleted from the overall data, which leaves only the first row.
I may be understanding the logic incorrectly, but I am not used to seeing the 'delete' statement right before a join. Let me know if I got the logic right, or point me in the right direction.
This query was copy/pasted from THIS stackoverflow question which has the exact situation I am trying to solve, but on a much larger scale.
with C as
(
select ID,
row_number() over(order by DT) as rn
from YourTable
)
delete C1
from C as C1
inner join C as C2
on C1.rn = C2.rn-1 and
C1.ID = C2.ID
The specific problem I am trying to solve is this. Let's assume I have this table. I need to partition the rows by primary key combinations (primKey 1 & 2) while maintaining timestamp order.
ID primKey1 primKey2 checkVar1 checkVar2 theTimestamp
100 1 2 302 423 2001-07-13
101 3 6 506 236 2005-10-25
100 1 2 302 423 2002-08-15
101 3 6 506 236 2008-12-05
101 3 6 300 100 2010-06-10
100 1 2 407 309 2005-09-05
100 1 2 302 423 2012-05-09
100 1 2 302 423 2003-07-24
Once the rows are partitioned and the timestamp is ordered within each partition, I need to delete the duplicate checkVar combination (checkVar 1 & 2) rows until the next change. Thus leaving me with the earliest unique row. The rows with asterisks are the ones which need to be removed since they are duplicates.
ID primKey1 primKey2 checkVar1 checkVar2 theTimestamp
100 1 2 302 423 2001-07-13
*100 1 2 302 423 2002-08-15
*100 1 2 302 423 2003-07-24
100 1 2 407 309 2005-09-05
100 1 2 302 423 2012-05-09
101 3 6 506 236 2005-10-25
*101 3 6 506 236 2008-12-05
101 3 6 300 100 2010-06-10
This is the final result. As you can see for ID=100, even though the 1st and 3rd record are the same, the checkVar combination changed in between, which is fine. I am only removing the duplicates until the values change.
ID primKey1 primKey2 checkVar1 checkVar2 theTimestamp
100 1 2 302 423 2001-07-13
100 1 2 407 309 2005-09-05
100 1 2 302 423 2012-05-09
101 3 6 506 236 2005-10-25
101 3 6 300 100 2010-06-10
If you want to keep the earliest row for each id, then you can use:
delete from yourtable yt
where yt.dt > (select min(yt2.dt)
from yourtable yt
where yt2.id = yd.id
);
Your query would not do this, if that is your intent.

Select max date No result I want

SELECT
mat.matid,
MAX (to_date(to_char (matdatetable.matdateupdate,'yyyy-mm-dd'),'yyyy-mm-dd')),
mat.matuserid,
mat.matname,
mat.matprice
FROM
matdatetable
LEFT JOIN mat ON matdatetable.sourceid = mat.matid
RESULT
matid matdate update matuserid matname matprice
-------------------------------------------------------------
1 2012-01-01 0:0:0:0 0111-1 aaa 100
1 2012-08-01 0:0:0:0 0111-1 aaa 125
1 2013-08-30 0:0:0:0 0111-1 aaa 150
2 2012-01-01 0:0:0:0 0222-1 bbb 130
2 2012-08-21 0:0:0:0 0222-1 bbb 110
2 2013-07-30 0:0:0:0 0222-1 bbb 100
3 2012-01-01 0:0:0:0 0565-1 ccc 100
3 2013-09-30 0:0:0:0 0565-1 ccc 230
But I want to. Results
matid matdate update matuserid matname matprice
------------------------------------------------------------------
1 2013-08-30 0:0:0:0 0111-1 aaa 150
2 2013-07-30 0:0:0:0 0222-1 bbb 100
3 2013-09-30 0:0:0:0 0565-1 ccc 230
SELECT DISTINCT ON (1)
t.sourceid AS matid
,t.matdateupdate::date AS matdate_update
,m.matuserid
,m.matname
,m.matprice
FROM matdatetable t
LEFT JOIN mat m ON m.matid = t.sourceid
ORDER BY 1, t.matdateupdate DESC;
Gives you the latest (according to matdateupdate) entry per sourceid. Your question isn't clear what you want exactly.
Using sourceid rather than matid, since you have a LEFT JOIN and matid could be NULL. Or your use of LEFT JOIN is incorrect ...
Explanation for DISTINCT ON in this related answer:
Select first row in each GROUP BY group?
t.matdateupdate::date casts your timestamp (assuming for lack of information) to date. That seems to be what you want. If you really need the redundant time 00:00, use datetrunc('day', t.matdateupdate) instead.