SQL join table and limiting condition matching result - sql

I have a table of rates and transactions from which I want to find out conversion rate based on the latest updated currency rate (related to transaction timestamp)
Table - rates
('2018-04-01 00:00:00', 'EUR', 'RUB', '1.71'),
('2018-04-01 01:00:05', 'EUR', 'RUB', '1.82'),
('2018-04-01 00:00:00', 'USD', 'RUB', '0.71'),
('2018-04-01 00:00:05', 'USD', 'RUB', '0.82'),
('2018-04-01 00:01:00', 'USD', 'RUB', '0.92'),
('2018-04-01 01:02:00', 'USD', 'RUB', '0.62'),
Table - transactions
('2018-04-01 00:00:00', 1, 'EUR', 2.45),
('2018-04-01 01:00:00', 1, 'EUR', 8.45),
('2018-04-01 01:30:00', 1, 'USD', 3.5),
My attempt to limit those additional data
select * from transactions tr1
left outer join rates ex1
on tr1.ts >= ex1.ts
and tr1.currency = ex1.from_currency;
The result I'm getting contains all of the exchange rate update that has happened previously
2
2018-04-01 00:00:00 1 EUR 2.45 2018-04-01 00:00:00 EUR RUB 1.71 (correct)
2018-04-01 01:00:00 1 EUR 8.45 2018-04-01 00:00:00 EUR RUB 1.71 (correct)
2018-04-01 01:30:00 1 USD 3.5 2018-04-01 01:02:00 USD RUB 0.62 (only this should remain)
2018-04-01 01:30:00 1 USD 3.5 2018-04-01 00:01:00 USD RUB 0.92
2018-04-01 01:30:00 1 USD 3.5 2018-04-01 00:00:05 USD RUB 0.82
2018-04-01 01:30:00 1 USD 3.5 2018-04-01 00:00:00 USD RUB 0.71
I tried to define my own statement (my previous query):
where ex1.ts = (select max(ex2.ts) from rates ex2
where ex2.from_currency=ex1.from_currency
and ex2.to_currency=ex1.from_currency);
But that doesn't return anything...

Postgres has the very handy distinct on for getting one row per group. This should do what you want:
select t.*, r.*
from transactions t left join
(select distinct on (from_currency, to_currency) r.*
from rates r
order by from_currency, to_currency, ts desc
) r
on r.from_currency = t.currency and
r.to_currency = 'RUB';
EDIT:
If you want the latest date for each transaction, then use a lateral join:
select t.*, r.*
from transactions t left join lateral
(select r.*
from rates r
where r.from_currency = t.currency and
r.ts <= t.ts
order by r.ts desc
limit 1
) r
on 1=1;

You can solve this using Window/Analytic functions. In the partition, we order by ex1.ts in descending order, so that we get the rate closest to the transaction timestamp ts. tr1.ts >= ex.ts condition in the Join ensures that we are only getting the exchange rates on or before the transaction time.
select dt.*
from
(
select tr1.*, ex1.*,
row_number() over (partition by ex1.from_currency order by ex1.ts desc) as rn
from transactions tr1
left outer join exchange_rates ex1
on tr1.ts >= ex1.ts
and tr1.currency = ex1.from_currency
) as dt
where dt.rn = 1

Related

List of IDs for from_date to end_date (other columns differ)

Yesterday I posted a question regarding a problem I had in oracle sql.
But I still have one problem, which i forgot in my first question. My table also includes other columns for rows with the same DESC. Like this:
ID
Desc
FromDate
ToDate
Color
ID_01
A
08.2017
10.2020
Red
ID_02
B
02.2019
09.2029
Blue
ID_03
C
02.2014
02.2019
Black
ID_04
D
04.2010
01.2019
Yellow
ID_05
D
01.2019
09.2029
Green
This is the reason why I still get both IDs (4 and 5) with the posted sql. How can I prevent this in the best way?
Thank you!
select "Date"
,ID
,"Desc"
from (
select "Date"
,ID
,"Desc"
,row_number() over(partition by "Desc" order by ID) as rn
from t
unpivot
("Date" for FDTD in ("FromDate","ToDate"))
where "Date" between date '2019-01-01' and date '2022-09-01'
) t
where rn = 1
Date
ID
Desc
01-OCT-20
ID_01
A
01-FEB-19
ID_02
B
01-FEB-19
ID_03
C
01-JAN-19
ID_04
D
Fiddle
Given the sample data:
CREATE TABLE table_name (ID, "DESC", FromDate, ToDate, Colour) AS
--SELECT 'ID_01', 'A', DATE '2017-08-01', DATE '2020-10-01', 'Red' FROM DUAL UNION ALL
--SELECT 'ID_02', 'B', DATE '2019-02-01', DATE '2029-09-01', 'Blue' FROM DUAL UNION ALL
--SELECT 'ID_03', 'C', DATE '2014-02-01', DATE '2019-02-01', 'Black' FROM DUAL UNION ALL
SELECT 'ID_04', 'D', DATE '2010-04-01', DATE '2019-01-01', 'Yellow' FROM DUAL UNION ALL
SELECT 'ID_05', 'D', DATE '2019-01-01', DATE '2019-09-01', 'Green' FROM DUAL;
(and just focusing on the D values for DESC)
Then the accepted answer to the linked question:
WITH calendar (month) AS (
SELECT ADD_MONTHS(DATE '2019-01-01', LEVEL - 1)
FROM DUAL
CONNECT BY ADD_MONTHS(DATE '2019-01-01', LEVEL - 1) <= DATE '2019-09-01'
)
SELECT month,
MIN(id) AS id,
"DESC"
FROM calendar c
INNER JOIN table_name t
ON (c.month BETWEEN t.fromdate and t.todate)
GROUP BY month, "DESC"
ORDER BY month, id;
Only outputs a single DESC values for each day and there is only the corresponding minimum ID:
MONTH
ID
DESC
2019-01-01 00:00:00
ID_04
D
2019-02-01 00:00:00
ID_05
D
2019-03-01 00:00:00
ID_05
D
2019-04-01 00:00:00
ID_05
D
2019-05-01 00:00:00
ID_05
D
2019-06-01 00:00:00
ID_05
D
2019-07-01 00:00:00
ID_05
D
2019-08-01 00:00:00
ID_05
D
2019-09-01 00:00:00
ID_05
D
The reason that you get ID_04 for 2019-01-01 and then ID_05 for the rest of the days is that ID_04 has a toDate value of 2019-01-01 so it cannot be included in the range from 2019-02-01 through to 2019-09-01 as those months are outside the upper bound of its range. Instead, you get ID_05 for those months because they are within its range.
fiddle

UNNEST an interval of all dates without grouping by column

I want to UNNEST an interval of all dates within the min and max of a date column without grouping by any other column.
Building off the answer in this post, I can get close, but instead of getting the date range grouped by each job_name, I want to generate the total date range across all job_name values. So each job_name should be exploded to have 3-rows for 2021-08-20 through 2021-08-22.
WITH
dataset AS (
SELECT *
FROM
( VALUES
('A', DATE '2021-08-21'), ('A', DATE '2021-08-22'),
('B', DATE '2021-08-20'), ('B', DATE '2021-08-21')
) AS d (job_name, run_date)),
nested_dates AS (
select job_name, sequence(min(run_date), max(run_date), interval '1' day) seq
from dataset
group by job_name)
SELECT job_name, dates
FROM nested_dates
CROSS JOIN UNNEST(seq) AS t(dates)
Current output:
# job_name dates
1 A 2021-08-21 00:00:00.000
2 A 2021-08-22 00:00:00.000
3 B 2021-08-20 00:00:00.000
4 B 2021-08-21 00:00:00.000
Desired output:
# job_name dates
1 A 2021-08-20 00:00:00.000
2 A 2021-08-21 00:00:00.000
3 A 2021-08-22 00:00:00.000
3 B 2021-08-20 00:00:00.000
4 B 2021-08-21 00:00:00.000
5 B 2021-08-22 00:00:00.000
One approach can be using windows functions and distinct select:
-- sample data
WITH dataset(job_name, run_date) AS (
VALUES ('A', DATE '2021-08-21'),
('A', DATE '2021-08-22'),
('B', DATE '2021-08-20'),
('B', DATE '2021-08-21')),
nested_dates AS (
select distinct job_name, max (run_date) over() max_run_date, min (run_date) over() min_run_date
from dataset)
-- query
select job_name, dates
from nested_dates,
unnest (sequence(min_run_date, max_run_date, interval '1' day)) AS t(dates)
order by job_name, dates;
Output:
job_name
dates
A
2021-08-20
A
2021-08-21
A
2021-08-22
B
2021-08-20
B
2021-08-21
B
2021-08-22

Count total customer_id's not partitioned by column

I would like to calculate the total number of customers without adding an additional subquery. The count should be partitioned by country but rather by the month_ column.
EDIT:
I updated the query to use GROUPING SETS
Current query:
select date_trunc('month',date_) as month_,
country,
count(distinct customer_id) as total_customers
GROUP BY GROUPING SETS (
(date_trunc('month',date_), country),
(date_trunc('month',date_))
from table_a
Current output
month_ country total_customers_per_country
2020-01-01 US 320
2020-01-01 GB 360
2020-01-01 680
2020-02-01 US 345
2020-02-01 GB 387
2020-02-01 732
Desired output:
month_ country total_customers_per_country total_customers
2020-01-01 US 320 680
2020-01-01 GB 360 680
2020-02-01 US 345 732
2020-02-01 GB 387 732
This may depend on the version of sql server you are using but you are likely looking for "window" functions.
I believe something along the lines of the following will give you the result you are looking for:
select date_trunc('month',date_) as month_,
country,
count(distinct customer_id) as total_customers_by_country,
count(distinct customer_id) OVER (partition by date_trunc('month',date_)) as total_customers
from table_a
https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15
You can perform subquery to group by month-country pair and then use sum over window partitioned by month:
-- sample data
WITH dataset (id, date, country) AS (
VALUES (1, date '2020-01-01', 'US'),
(2, date '2020-01-01', 'US'),
(1, date '2020-01-01', 'GB'),
(3, date '2020-01-02', 'US'),
(1, date '2020-01-02', 'GB'),
(1, date '2020-02-01', 'US')
)
--query
select *,
sum(total_customers_per_country) over (partition by month) total_customers
from (
select date_trunc('month', date) as month,
country,
count(distinct id) total_customers_per_country
from dataset
group by 1, country
)
order by month, country desc
Output:
month
country
total_customers_per_country
total_customers
2020-01-01
US
3
4
2020-01-01
GB
1
4
2020-02-01
US
1
1

Oracle Sort of gaps and island query

Instead of writing long sentences and paragraphs let me show the data and what I want to achieve :
create table ssb_price (itm_no varchar2(10), price number, price_code varchar2(10), valid_from_dt date, valid_to_dt date);
insert into ssb_price values ('A001', 83, 'AB', '01-JAN-21', '05-JAN-21');
insert into ssb_price values ('A001', 83, 'AB', '06-JAN-21', '12-JAN-21');
insert into ssb_price values ('A001', 98, 'SPQ', '13-JAN-21', '17-JAN-21');
insert into ssb_price values ('A001', 83, 'AB', '19-JAN-21', '24-JAN-21');
insert into ssb_price values ('A001', 83, 'DE', '25-JAN-21', '30-JAN-21');
insert into ssb_price values ('A001', 83, 'DE', '31-JAN-21', '04-FEB-21');
insert into ssb_price values ('A001', 77, 'XY', '07-FEB-21', '12-FEB-21');
insert into ssb_price values ('A001', 77, 'XY', '15-FEB-21', '20-FEB-21');
insert into ssb_price values ('A001', 62, 'SD', '23-FEB-21', '26-FEB-21');
insert into ssb_price values ('A001', 59, 'SD', '26-FEB-21', '03-MAR-21');
For particular itm_no and price if the from and to dates are continuous then I should get that value. For price 77 there is a gap of 2 days (13th and 14th) between to date and the next from date so its not continuous. Lemme paste what the desired output should look like :(taken the snip from excel)
I have asked this question clubbed with another post. But that post was old and haven't got any feedback so creating this. Please let me know if I should merge this post with the previous one.
This is basically a gaps-and-islands problem. But instead of aggregating to reduce the number of rows, you want to use window functions at the last step.
In your data, the time frames neatly tile. That suggests using lag() and a cumulative sum to define the groups:
select p.*,
min(valid_from_dt) over (partition by itm_no, price, price_code, grp) as new_valid_from_dt,
max(valid_to_dt) over (partition by itm_no, price, price_code, grp) as new_valid_to_dt
from (select p.*,
sum(case when valid_from_dt = prev_valid_to_dt + interval '1' day then 0 else 1 end) over
(partition by itm_no, price, price_code order by valid_from_dt) as grp
from (select p.*,
lag(valid_to_dt) over (partition by itm_no, price, price_code order by valid_from_dt) as prev_valid_to_dt
from ssb_price p
) p
) p
order by itm_no, valid_from_dt;
Here is a db<>fiddle.
From Oracle 12, you can use MATCH_RECOGNIZE:
SELECT itm_no,
price,
price_code,
valid_from_dt,
valid_to_dt,
MIN( valid_from_dt ) OVER ( PARTITION BY itm_no, mnum ) AS new_valid_from_dt,
MAX( valid_to_dt ) OVER ( PARTITION BY itm_no, mnum ) AS new_valid_to_dt
FROM ssb_price
MATCH_RECOGNIZE(
PARTITION BY itm_no
ORDER BY valid_from_dt, valid_to_dt
MEASURES
MATCH_NUMBER() AS mnum
ALL ROWS PER MATCH
PATTERN ( start_range continued_range* )
DEFINE
continued_range AS (
valid_from_dt = PREV( valid_to_dt ) + 1
AND price = PREV( price )
)
)
and, from Oracle 10g, you can use the MODEL clause:
SELECT itm_no,
price,
price_code,
valid_from_dt,
valid_to_dt,
mn,
MIN( valid_from_dt ) OVER ( PARTITION BY itm_no, mn ) AS new_valid_from_dt,
MAX( valid_to_dt ) OVER ( PARTITION BY itm_no, mn ) AS new_valid_to_dt
FROM (
SELECT *
FROM (
SELECT s.*,
ROW_NUMBER() OVER ( PARTITION BY itm_no ORDER BY valid_from_dt ) AS rn
FROM ssb_price s
)
MODEL
PARTITION BY ( itm_no )
DIMENSION BY ( rn )
MEASURES ( price, price_code, valid_from_dt, valid_to_dt, 1 AS mn )
RULES (
mn[rn>1] = mn[cv(rn)-1]
+
CASE
WHEN valid_from_dt[cv(rn)] = valid_to_dt[cv(rn)-1] + 1
AND price[cv(rn)] = price[cv(rn) - 1]
THEN 0
ELSE 1
END
)
)
Which, for the sample data:
create table ssb_price (itm_no, price, price_code, valid_from_dt, valid_to_dt) AS
SELECT 'A001', 83, 'AB', DATE '2021-01-01', DATE '2021-01-05' FROM DUAL UNION ALL
SELECT 'A001', 83, 'AB', DATE '2021-01-06', DATE '2021-01-12' FROM DUAL UNION ALL
SELECT 'A001', 98, 'SPQ', DATE '2021-01-13', DATE '2021-01-17' FROM DUAL UNION ALL
SELECT 'A001', 83, 'AB', DATE '2021-01-19', DATE '2021-01-24' FROM DUAL UNION ALL
SELECT 'A001', 83, 'DE', DATE '2021-01-25', DATE '2021-01-30' FROM DUAL UNION ALL
SELECT 'A001', 83, 'DE', DATE '2021-01-31', DATE '2021-02-04' FROM DUAL UNION ALL
SELECT 'A001', 77, 'XY', DATE '2021-02-07', DATE '2021-02-12' FROM DUAL UNION ALL
SELECT 'A001', 77, 'XY', DATE '2021-02-15', DATE '2021-02-20' FROM DUAL UNION ALL
SELECT 'A001', 62, 'SD', DATE '2021-02-23', DATE '2021-02-26' FROM DUAL UNION ALL
SELECT 'A001', 59, 'SD', DATE '2021-02-26', DATE '2021-03-03' FROM DUAL;
Outputs:
ITM_NO
PRICE
PRICE_CODE
VALID_FROM_DT
VALID_TO_DT
NEW_VALID_FROM_DT
NEW_VALID_TO_DT
A001
83
AB
2021-01-01 00:00:00
2021-01-05 00:00:00
2021-01-01 00:00:00
2021-01-12 00:00:00
A001
83
AB
2021-01-06 00:00:00
2021-01-12 00:00:00
2021-01-01 00:00:00
2021-01-12 00:00:00
A001
98
SPQ
2021-01-13 00:00:00
2021-01-17 00:00:00
2021-01-13 00:00:00
2021-01-17 00:00:00
A001
83
AB
2021-01-19 00:00:00
2021-01-24 00:00:00
2021-01-19 00:00:00
2021-02-04 00:00:00
A001
83
DE
2021-01-25 00:00:00
2021-01-30 00:00:00
2021-01-19 00:00:00
2021-02-04 00:00:00
A001
83
DE
2021-01-31 00:00:00
2021-02-04 00:00:00
2021-01-19 00:00:00
2021-02-04 00:00:00
A001
77
XY
2021-02-07 00:00:00
2021-02-12 00:00:00
2021-02-07 00:00:00
2021-02-12 00:00:00
A001
77
XY
2021-02-15 00:00:00
2021-02-20 00:00:00
2021-02-15 00:00:00
2021-02-20 00:00:00
A001
62
SD
2021-02-23 00:00:00
2021-02-26 00:00:00
2021-02-23 00:00:00
2021-02-26 00:00:00
A001
59
SD
2021-02-26 00:00:00
2021-03-03 00:00:00
2021-02-26 00:00:00
2021-03-03 00:00:00
db<>fiddle here

BigQuery: how to do semi left join?

I couldn't come up with a good title for this question. Sorry about that.
I have two tables A and B. Both have timestamps and shares a common ID between them. Here are schemas of both tables:
Table A:
========
a_id int,
common_id int,
ts timestamp
...
Table B:
========
b_id int,
common_id int,
ts timestamp,
temperature int
Table A is more like device data whenever it changes its status. Table B is more IoT data which contains a temperature of a device every minute or so.
What I want to do is to create a Table C from these two tables. Table C would be in essence Table A + its temperature in closest time from table B.
How can I do this purely in BigQuery SQL? The temperature info doesn't need to be precise.
Below option (for BigQuery Standard SQL) assumes that in addition of temperature from table b you still need all the rest of values from respective row
#standardSQL
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
I smoke-tested it with below generated dummy data
#standardSQL
WITH `project.dataset.table_a` AS (
SELECT CAST(1000000 * RAND() AS INT64) a_id, common_id, ts
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 45*60 + 27 SECOND)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
), `project.dataset.table_b` AS (
SELECT CAST(1000000 * RAND() AS INT64) b_id, common_id, ts, CAST(60 + 40 * RAND() AS INT64) temperature
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2018-01-01 00:00:00', '2018-01-01 23:59:59', INTERVAL 1 MINUTE)) ts
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 10)) common_id
)
SELECT
ARRAY_AGG(
STRUCT(a_id, a.common_id, a.ts, b_id, b.ts AS b_ts, temperature)
ORDER BY ABS(TIMESTAMP_DIFF(a.ts, b.ts, SECOND))
LIMIT 1
)[SAFE_OFFSET(0)].*
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
ON a.common_id = b.common_id
AND ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30
GROUP BY TO_JSON_STRING(a)
with example of few rows from output:
Row a_id common_id ts b_id b_ts temperature
1 276623 1 2018-01-01 00:00:00 UTC 166995 2018-01-01 00:00:00 UTC 74
2 218354 1 2018-01-01 00:45:27 UTC 464901 2018-01-01 00:45:00 UTC 87
3 265634 1 2018-01-01 01:30:54 UTC 565385 2018-01-01 01:31:00 UTC 87
4 758075 1 2018-01-01 02:16:21 UTC 55894 2018-01-01 02:16:00 UTC 84
5 306355 1 2018-01-01 03:01:48 UTC 844429 2018-01-01 03:02:00 UTC 92
6 348502 1 2018-01-01 03:47:15 UTC 375859 2018-01-01 03:47:00 UTC 90
7 774920 1 2018-01-01 04:32:42 UTC 438164 2018-01-01 04:33:00 UTC 61
Here - I set table_b to have temperature for each minute for 10 devices during the whole day of '2018-01-01' and in table_a I set status changed each 45 min 27 sec for same 10 devices during same day. a_id and b_id - just random numbers between 0 and 999999
Note: ABS(TIMESTAMP_DIFF(a.ts, b.ts, MINUTE)) < 30 clause in JOIN controls period that you can consider ok to look for closest ts (in case if some IoT entries are absent from table_b
Measuring the closest time by TIMESTAMP_DIFF(a.ts,b.ts, SECOND) - by its absolute value to get the closest in any direction:
WITH a AS (
SELECT 1 id, TIMESTAMP('2018-01-01 11:01:00') ts
UNION ALL SELECT 1, ('2018-01-02 10:00:00')
UNION ALL SELECT 2, ('2018-01-02 10:00:00')
)
, b AS (
SELECT 1 id, TIMESTAMP('2018-01-01 12:01:00') ts, 43 temp
UNION ALL SELECT 1, TIMESTAMP('2018-01-01 12:06:00'), 47
)
SELECT *,
(SELECT temp
FROM b
WHERE a.id=b.id
ORDER BY ABS(TIMESTAMP_DIFF(a.ts,b.ts, SECOND))
LIMIT 1) temp
FROM a