SQL - Fuzzy JOIN on Timestamp columns within X amount of time - sql

Say I have two tables:
a:
timestamp
precipitation
2015-08-03 21:00:00 UTC
3
2015-08-03 22:00:00 UTC
3
2015-08-04 3:00:00 UTC
4
2016-02-04 18:00:00 UTC
4
and b:
timestamp
loc
2015-08-03 21:23:00 UTC
San Francisco
2016-02-04 16:04:00 UTC
New York
I want to join to get a table who has fuzzy joined entries where every row in b tries to get joined to a row in a. Criteria:
The time is within 60 minutes. If a match does not exist within 60 minutes, do not include that row in the output.
In the case of a tie where some row in b could join onto two rows in a, pick the closest one in terms of time.
Example Output:
timestamp
loc
precipitation
2015-08-03 21:00:00 UTC
San Francisco
3

What you need is an ASOF join. I don't think there is an easy way to do this with BigQuery. Other databases like Kinetica (and I think Clickhouse) support ASOF functions that can be used to perform 'fuzzy' joins.
The syntax for Kinetica would be something like the following.
SELECT *
FROM a
LEFT JOIN b
ON ASOF(a.timestamp, b.timestamp, INTERVAL '0' MINUTES, INTERVAL '60' MINUTES, MIN)
The ASOF function above sets up an interval of 60 minutes within which to look for matches on the right side table. When there are multiple matches, it selects the one that is closest (MAX would pick the one that is farthest away).

As per my understanding and based on the data you provided I think the below query should work for your use case.
create temporary table a as(
select TIMESTAMP('2015-08-03 21:00:00 UTC') as ts, 3 as precipitation union all
select TIMESTAMP('2015-08-03 22:00:00 UTC'), 3 union all
select TIMESTAMP('2015-08-04 3:00:00 UTC'), 4 union all
select TIMESTAMP('2016-02-04 18:00:00 UTC'), 4
);
create temporary table b as(
select TIMESTAMP('2015-08-03 21:23:00 UTC') as ts,'San Francisco ' as loc union all
select TIMESTAMP('2016-02-04 14:04:00 UTC') as ts,'New York ' as loc
);
select b_ts,a_ts,loc,precipitation,diff_time_sec
from(
select b.ts b_ts,a.ts a_ts,
ABS(TIMESTAMP_DIFF(b.ts,a.ts, SECOND)) as diff_time_sec,
*
from b
inner join a on b.ts between date_sub(a.ts, interval 60 MINUTE) and date_add(a.ts, interval 60 MINUTE)
)
qualify RANK() OVER(partition by b_ts ORDER BY diff_time_sec) = 1

Related

How to average values in one table based on the condition involving another table in SQL?

I have two tables. One defines time intervals (beginning and end). Time intervals are not equal in length. Another contains product ID, start and end date of the product.
TableOne:
Interval StartDateTime EndDateTime
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00
TableTwo
ProductID ProductStartDateTime ProductEndDateTime
ASSDWE1 2018-01-04 00:12:00 2020-04-10 20:00:30
ADFGHER 2020-01-05 00:11:30 2020-01-19 00:00:00
ASDFVBN 2017-10-10 00:12:10 2020-02-23 00:23:23
I need to compute the average length of the products from TableTwo that existed during time intervals defined in TableOne. If the product existed throughout the time interval from TableOne, then the length of the product during this time interval is defined as it length since its start date till the end of the time interval.
I tried the following
select
a.*,
(select
AVG(datediff(day, b.ProductStartDateTime, IIF (b.ProductEndDateTime> a.EndDateTime, a.EndDateTime
,b.ProductEndDateTime))) --compute average length of the products
FROM #TableTwo b
WHERE ( not (b.ProductEndDateTime <= a.StartDateTime ) and not (b.ProductStartDateTime >= a.EndDateTime) )
-- select products that existed during interval from #TableOne
) as AverageProductLength
from #TableOne a
I get the mistake "Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression."
The result I want:
Interval StartDateTime EndDateTime AverageProductLength
202020201 2020-01-01 00:00:00 2020-02-10 00:00:00 23
202020202 2020-02-10 00:00:00 2020-02-20 00:00:00 34.5
Is there a way I can do the averaging?

SQL Find Last 30 Days records count grouped by

I am trying to retrieve the count of customers daily per each status in a dynamic window - Last 30 days.
The result of the query should show each day how many customers there are per each customer status (A,B,C) for the Last 30 days (i.e today() - 29 days). Every customer can have one status at a time but change from one status to another within the customer lifetime. The purpose of this query is to show customer 'movement' across their lifetime. I've generated a series of date ranging from the first date a customer was created until today.
I've put together the following query but it appears that something I'm doing is incorrect because the results depict most days as having the same count across all statuses which is not possible, each day new customers are created. We checked with another simple query and confirmed that the split between statuses is not equal.
I tried to depict below the data and the SQL I use to reach the optimal result.
Starting point (example table customer_statuses):
customer_id | status | created_at
---------------------------------------------------
abcdefg1234 B 2019-08-22
abcdefg1234 C 2019-01-17
...
abcdefg1234 A 2018-01-18
bcdefgh2232 A 2017-09-02
ghijklm4950 B 2018-06-06
statuses - A,B,C
There is no sequential order for the statuses, a customer can have any status at the start of the business relationship and switch between statuses throughout their lifetime.
table customers:
id | f_name | country | created_at |
---------------------------------------------------------------------
abcdefg1234 Michael FR 2018-01-18
bcdefgh2232 Sandy DE 2017-09-02
....
ghijklm4950 Daniel NL 2018-06-06
SQL - current version:
WITH customer_list AS (
SELECT
DISTINCT a.id,
a.created_at
FROM
customers a
),
dates AS (
SELECT
generate_series(
MIN(DATE_TRUNC('day', created_at)::DATE),
MAX(DATE_TRUNC('day', now())::DATE),
'1d'
)::date AS day
FROM customers a
),
customer_statuses AS (
SELECT
customer_id,
status,
created_at,
ROW_NUMBER() OVER
(
PARTITION BY customer_id
ORDER BY created_at DESC
) col
FROM
customer_status
)
SELECT
day,
(
SELECT
COUNT(DISTINCT id) AS accounts
FROM customers
WHERE created_at::date BETWEEN day - 29 AND day
),
status
FROM dates d
LEFT JOIN customer_list cus
ON d.day = cus.created_at
LEFT JOIN customer_statuses cs
ON cus.id = cs.customer_id
WHERE
cs.col = 1
GROUP BY 1,3
ORDER BY 1 DESC,3 ASC
Currently what the results from the query look like:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1230 B
2020-01-24 1230 A
2020-01-23 1200 C
2020-01-23 1200 B
2020-02-23 1200 A
2020-02-22 1150 C
2020-02-22 1150 B
...
2017-01-01 50 C
2017-01-01 50 B
2017-01-01 50 A
Two things I've noticed from the results above - most of the time the results show the same count across all statuses in a given day. The second observation, there are days that only two statuses appear - which should not be the case. If now new accounts are created in a given day with a certain status, the count of the previous day should be carried over - right? or is this the problem with the query I created or with the logic I have in mind??
Perhaps I'm expecting a result that will not happen logically?
Required result:
day | count | status
-------------------------
2020-01-24 1230 C
2020-01-24 1000 B
2020-01-24 2500 A
2020-01-23 1200 C
2020-01-23 1050 B
2020-02-23 2450 A
2020-02-22 1160 C
2020-02-22 1020 B
2020-02-22 2400 A
...
2017-01-01 10 C
2017-01-01 4 B
2017-01-01 50 A
Thank You!
Your query seems overly complicated. Here is another approach:
Use lead() to get when the status ends for each customer status record.
Use generate_series() to generate the days.
The rest is just filtering and aggregation:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
I've altered the query a bit because I've noticed that I get duplicate records on the days a customer changes a status - one record with the old status and one records for the new day.
For example output with #Gordon's query:
dte | status
---------------------------
2020-02-12 B
... ...
01.02.2020 A
01.02.2020 B
31.01.2020 A
30.01.2020 A
I've adapted the query, see below, while the results depict the changes between statuses correctly (no duplicate records on the day of change), however, the records continue up until now()::date - interval '1day' and not include now()::date (as in today). I'm not sure why and can't find the correct logic to ensure all of this is how I want it.
Dates correctly depict the status of each customer and the status returned include today.
Adjusted query:
select gs.dte, cs.status, count(*)
from (select cs.*,
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1day' as next_ca
from customer_statuses cs
) cs cross join lateral
generate_series(cs.created_at, cs.next_ca, interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'
The two adjustments:
The adjustments also seem counter-intuitive as it seems i'm taking the interval day away from one part of the query only to add it to another (which to me seems to yield the same result)
a - added the decrease of 1 day from the lead function (line 3)
lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1 day' as next_ca
b - removed the decrease of 1 day from the next_ca variable (line 6)
generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day')
Example of the output with the adjusted query:
dte | status
---------------------------
2020-02-11 B
... ...
01.02.2020 B
31.01.2020 A
30.01.2020 A
Thanks for your help!

Join tables with dates within intervals of 5 min (get avg)

I want to join two tables based on timestamp, the problem is that both tables didn't had the exact same timestamp so i want to join them using a near timestamp using a 5 minute interval.
This query needs to be done using 2 Common table expressions, each common table expression needs to get the timestamps and group them by AVG so they can match
Freezer | Timestamp | Temperature_1
1 2018-04-25 09:45:00 10
1 2018-04-25 09:50:00 11
1 2018-04-25 09:55:00 11
Freezer | Timestamp | Temperature_2
1 2018-04-25 09:46:00 15
1 2018-04-25 09:52:00 13
1 2018-04-25 09:59:00 12
My desired result would be:
Freezer | Timestamp | Temperature_1 | Temperature_2
1 2018-04-25 09:45:00 10 15
1 2018-04-25 09:50:00 11 13
1 2018-04-25 09:55:00 11 12
The current query that i'm working on is:
WITH Temperatures_1 (
SELECT Freezer, Temperature_1, Timestamp
FROM TABLE_A
),
WITH Temperatures_2 (
SELECT Freezer, Temperature_2, Timestamp
FROM TABLE_B
)
SELECT A.Freezer, A.Timestamp, Temperature_1, Temperature_2
FROM Temperatures_1 as A
RIGHT JOIN Temperatures_2 as B
ON A.FREEZER = B.FREEZER
WHERE A.Timestamp = B.Timestamp
You should may want to modify your join criteria instead of filtering the output. Use BETWEEN to bracket your join value on the timestamps. I chose +/- 150 seconds because that's half of 2-1/2 minutes to either side (5-minute range to match). You may need something different.
;WITH Temperatures_1 (
SELECT Freezer, Temperature_1, Timestamp
FROM TABLE_A
),
WITH Temperatures_2 (
SELECT Freezer, Temperature_2, Timestamp
FROM TABLE_B
)
SELECT A.Freezer, A.Timestamp, Temperature_1, Temperature_2
FROM Temperatures_1 as A
RIGHT JOIN Temperatures_2 as B
ON A.FREEZER = B.FREEZER
AND A.Timestamp BETWEEN (DATEADD(SECOND, -150, B.Timestamp)
AND (DATEADD(SECOND, 150, B.Timestamp)
You should change the key of join two table by adding the timestamp. The timestamp you should need to approximate the datetime on both side tables A and B tables.
First you should check if the value of the left table (A) datetime is under 2.5 minutes then approximate to the near 5 min. If it is greater the approximate to the next 5 minutes. The same thing you should do on the right table (B). Or you can do this on the CTE and the right join remains the same as your query.

postgresql query to get counts between 12:00 and 12:00

I have the following query that works fine, but it is giving me counts for a single, whole day (00:00 to 23:59 UTC). For example, it's giving me counts for all of January 1 2017 (00:00 to 23:59 UTC).
My dataset lends itself to be queried from 12:00 UTC to 12:00 UTC. For example, I'm looking for all counts from Jan 1 2017 12:00 UTC to Jan 2 2017 12:00 UTC.
Here is my query:
SELECT count(DISTINCT ltg_data.lat), cwa, to_char(time, 'MM/DD/YYYY')
FROM counties
JOIN ltg_data on ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR'
AND time BETWEEN '1987-06-01'
AND '1992-08-1'
GROUP BY cwa, to_char(time, 'MM/DD/YYYY');
FYI...I'm changing the format of the time so I can use the results more readily in javascript.
And a description of the dataset. It's thousands of point data that occurs within various polygons every second. I'm determining if the points are occurring withing the polygon "cwa = MFR" and then counting them.
Thanks for any help!
I see two approaches here.
first, join generate_series(start_date::timestamp,end_date,'12 hours'::interval) to get count in those generate_series. this would be more correct I believe. But it has a major minus - you have to lateral join it against existing data set to use min(time) and max(time)...
second, a monkey hack itself, but much less coding and less querying. Use different time zone to make 12:00 a start of the day, eg (you did not give the sample, so I generate content of counties with generate_series with 2 hours interval as sample data):
t=# with counties as (select generate_series('2017-09-01'::timestamptz,'2017-09-04'::timestamptz,'2 hours'::interval)
g)
select count(1),to_char(g,'MM/DD/YYYY') from counties
group by to_char(g,'MM/DD/YYYY')
order by 2;
count | to_char
-------+------------
12 | 09/01/2017
12 | 09/02/2017
12 | 09/03/2017
1 | 09/04/2017
(4 rows)
so for UTC time zone there are 12 two hours interval rows for days above, due to inclusive nature of generate_series in my sample, 1 row for last days. in general: 37 rows.
Now a monkey hack:
t=# with counties as (select generate_series('2017-09-01'::timestamptz,'2017-09-04'::timestamptz,'2 hours'::interval)
g)
select count(1),to_char(g at time zone 'utc+12','MM/DD/YYYY') from counties
group by to_char(g at time zone 'utc+12','MM/DD/YYYY')
order by 2;
count | to_char
-------+------------
6 | 08/31/2017
12 | 09/01/2017
12 | 09/02/2017
7 | 09/03/2017
(4 rows)
I select same dates for different time zone, switching it exactly 12 hours, getting first day starting at 31 Aug middday, not 1 Sep midnight, and the count changes, still totalling 37 rows, but grouping your requested way...
update
for your query I'd try smth like:
SELECT count(DISTINCT ltg_data.lat), cwa, to_char(time at time zone 'utc+12', 'MM/DD/YYYY')
FROM counties
JOIN ltg_data on ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR'
AND time BETWEEN '1987-06-01'
AND '1992-08-1'
GROUP BY cwa, to_char(time at time zone 'utc+12', 'MM/DD/YYYY');
also if you want to apply +12 hours logic to where clause - add at time zone 'utc+12' to "time" comparison as well

SQL - Get column value based on another column's average between select rows

I've got a table something like..
[DateValueField][Hour][Value]
2014-09-01 1 200
...
2014-09-01 24 400
2014-09-02 1 220
...
2014-09-02 24 200
...
I need the same value for each DateValueField based on the average Value for Hour between 6-12 for example but have that display for all hours, not just 6-12. For instance...
[DateValueField][Hour][Value]
2014-09-01 1 300
...
2014-09-01 24 300
2014-09-02 1 190
...
2014-09-02 24 190
...
Query I'm trying is...
select DateValueField, Hour,
(select avg(Value) as Value from MyTable where Hour
between 6 and 12) as Value from MyTable
where DateValueField between '2014' and '2015'
group by DateValueField, Hour
order by DateValueField, Hour
But it gives me the Value as an average of ALL Values but I need it averaged out for that particular day between the hours I specify.
I'd appreciate some help/advice. Thanks!
You can use a derived table to get the average value between hours 6 and 12 grouped by date and then join that to your original table
select t1.DateValueField, t1.Hour, t2.avg_value
from MyTable t1
join (
select DateValueField, avg(Value) avg_value
from MyTable
where hour between 6 and 12
group by DateValueField
) t2 on t2.DateValueField = t1.DateValueField
order by t1.DateValueField, t1.Hour
Note: You may want to use a left join if some of your dates don't have values between hours 6 and 12 but you still want to retrieve all rows from MyTable.