Is there a way to do LEFT JOIN LATERAL with BigQuery? - google-bigquery

Given some rows with duplicate names and different timestamps, I would like to select the row with the newest timestamp, if the duplicate name occurs within say, 45 minutes, of the first timestamp.
Here's what worked in PostgreSQL:
SELECT i.ts AS base_timestamp, j.ts AS newer_timestamp, i.name
FROM tbl i
LEFT JOIN LATERAL
(SELECT j.ts
FROM tbl j
WHERE i.name = j.name
AND j.ts > i.ts
AND j.ts < (i.ts + INTERVAL '45 minutes')
) j ON TRUE
WHERE j.ts is NULL
Great explanation of LATERAL here:
https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral
LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.
So it's like a correlated subquery, but in the join.
Then I simply take only the rows where there is no newer timestamp (WHERE j.ts is NULL).
How can I do this in BigQuery?
EDIT: I've created an example of the PostgreSQL grouping on SQLFiddle as requested in the comments.
Input:
('Duplication Example','2019-06-22 19:10:25'),
('Duplication Example','2019-06-22 23:58:31'),
('Duplication Example','2019-06-23 00:08:00')
Output (middle row having timestamp 23:58:31 removed):
base_timestamp newer_timestamp name
2019-06-22T19:10:25Z (null) Duplication Example
2019-06-23T00:08:00Z (null) Duplication Example

Your case looks kind of like a task for window functions. But since you seem to be interested in lateral joins more than in solving the problem you presented:
In BigQuery there is afaik only an implicit version of lateral joins: when joining with unnested arrays.
This showcases the idea:
WITH t AS (
SELECT 'a' as id, [2,3] as arr
UNION ALL SELECT 'b', [56, 7]
)
SELECT * EXCEPT(arr)
FROM t LEFT JOIN UNNEST(arr)

This can be archived with a WINDOW function.
SELECT
name,
MAX(timestamp) AS timestamp_new
FROM
(
SELECT
i.name,
COUNT(*) OVER (PARTITION BY i.name ORDER BY i.ts RANGE BETWEEN 45 * 60 * 1000 PRECEDING AND CURRENT ROW) as 45_window_count,
i.ts AS timestamp
FROM
tbl i
)
WHERE 45_window_count > 1
GROUP BY user

Related

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.
You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Grouping records on consecutive dates

If I have following table in Postgres:
order_dtls
Order_id Order_date Customer_name
-------------------------------------
1 11/09/17 Xyz
2 15/09/17 Lmn
3 12/09/17 Xyz
4 18/09/17 Abc
5 15/09/17 Xyz
6 25/09/17 Lmn
7 19/09/17 Abc
I want to retrieve such customer who has placed orders on 2 consecutive days.
In above case Xyz and Abc customers should be returned by query as result.
There are many ways to do this. Use an EXISTS semi-join followed by DISTINCT or GROUP BY, should be among the fastest.
Postgres syntax:
SELECT DISTINCT customer_name
FROM order_dtls o
WHERE EXISTS (
SELEST 1 FROM order_dtls
WHERE customer_name = o.customer_name
AND order_date = o.order_date + 1 -- simple syntax for data type "date" in Postgres!
);
If the table is big, be sure to have an index on (customer_name, order_date) to make it fast - index items in this order.
To clarify, since Oto happened to post almost the same solution a bit faster:
DISTINCT is an SQL construct, a syntax element, not a function. Do not use parentheses like DISTINCT (customer_name). Would be short for DISTINCT ROW(customer_name) - a row constructor unrelated to DISTINCT - and just noise for the simple case with a single expression, because Postgres removes the pointless row wrapper for a single element automatically. But if you wrap more than one expression like that, you get an actual row type - an anonymous record actually, since no row type is given. Most certainly not what you want.
What is a row constructor used for?
Also, don't confuse DISTINCT with DISTINCT ON (expr, ...). See:
Select first row in each GROUP BY group?
Try something like...
SELECT `order_dtls`.*
FROM `order_dtls`
INNER JOIN `order_dtls` AS mirror
ON `order_dtls`.`Order_id` <> `mirror`.`Order_id`
AND `order_dtls`.`Customer_name` = `mirror`.`Customer_name`
AND DATEDIFF(`order_dtls`.`Order_date`, `mirror`.`Order_date`) = 1
The way I would think of it doing it would be to join the table the date part with itselft on the next date and joining it with the Customer_name too.
This way you can ensure that the same customer_name done an order on 2 consecutive days.
For MySQL:
SELECT distinct *
FROM order_dtls t1
INNER JOIN order_dtls t2 on
t1.Order_date = DATE_ADD(t2.Order_date, INTERVAL 1 DAY) and
t1.Customer_name = t2.Customer_name
The result you should also select it with the Distinct keyword to ensure the same customer is not displayed more than 1 time.
For postgresql:
select distinct(Customer_name) from your_table
where exists
(select 1 from your_table t1
where
Customer_name = your_table.Customer_name and Order_date = your_table.Order_date+1 )
Same for MySQL, just instead of your_table.Order_date+1 use: DATE_ADD(your_table.Order_date , INTERVAL 1 DAY)
This should work:
SELECT A.customer_name
FROM order_dtls A
INNER JOIN (SELECT customer_name, order_date FROM order_dtls) as B
ON(A.customer_name = B.customer_name and Datediff(B.Order_date, A.Order_date) =1)
group by A.customer_name

Oracle Left Join not returning all rows

I am using the following CTE. The first part collects all unique people and the second left joins the unique people with events during a particular time frame. I am expecting that all the rows be returned from my unique people table even if they don't have an event within the time frame. But this doesn't appear to be the case.
WITH DISTINCT_ATTENDING(ATTENDING) AS
(
SELECT DISTINCT ATTENDING
FROM PEOPLE
WHERE ATTENDING IS NOT NULL
), -- returns 62 records
EVENT_HISTORY(ATTENDING, TOTAL) AS
(
SELECT C.ATTENDING,
COUNT(C.ID)
FROM DISTINCT_ATTENDING D
LEFT JOIN PEOPLE C
ON C.ATTENDING = D.ATTENDING
AND TO_DATE(C.DATE, 'YYYYMMDD') < TO_DATE('20140101', 'YYYYMMDD')
GROUP BY C.ATTENDING
ORDER BY C.ATTENDING
)
SELECT * FROM EVENT_HISTORY; -- returns 49 rows
What am I doing wrong here?
Jonny
The problem is inthe column "C.ATTENDING", just change for "D.ATTENDING"
SELECT D.ATTENDING,
COUNT(C.ID)
FROM DISTINCT_ATTENDING D
LEFT JOIN PEOPLE C
ON C.ATTENDING = D.ATTENDING
AND TO_DATE(C.DATE, 'YYYYMMDD') < TO_DATE('20140101', 'YYYYMMDD')
GROUP BY D.ATTENDING
ORDER BY D.ATTENDING
Your query seems too complicated. I think the following does the same thing:
SELECT P.ATTENDING,
SUM(CASE WHEN TO_DATE(P.DATE, 'YYYYMMDD') < TO_DATE('20140101', 'YYYYMMDD')
THEN 1 ELSE 0 END)
FROM PEOPLE P
WHERE P.ATTENDING IS NOT NLL
GROUP BY P.ATTENDING
ORDER BY P.ATTENDING ;
Your problem is that you are aggregating by a column in the second table of a left join. This is NULL when there is no match.

Record returned from function has columns concatenated

I have a table which stores account changes over time. I need to join that up with two other tables to create some records for a particular day, if those records don't already exist.
To make things easier (I hope), I've encapsulated the query that returns the correct historical data into a function that takes in an account id, and the day.
If I execute "Select * account_servicetier_for_day(20424, '2014-08-12')", I get the expected result (all the data returned from the function in separate columns). If I use the function within another query, I get all the columns joined into one:
("2014-08-12 14:20:37",hollenbeck,691,12129,20424,69.95,"2Mb/1Mb 20GB Limit",2048,1024,20.000)
I'm using "PostgreSQL 9.2.4 on x86_64-slackware-linux-gnu, compiled by gcc (GCC) 4.7.1, 64-bit".
Query:
Select
'2014-08-12' As day, 0 As inbytes, 0 As outbytes, acct.username, acct.accountid, acct.userid,
account_servicetier_for_day(acct.accountid, '2014-08-12')
From account_tab acct
Where acct.isdsl = 1
And acct.dslservicetypeid Is Not Null
And acct.accountid Not In (Select accountid From dailyaccounting_tab Where Day = '2014-08-12')
Order By acct.username
Function:
CREATE OR REPLACE FUNCTION account_servicetier_for_day(_accountid integer, _day timestamp without time zone) RETURNS setof account_dsl_history_info AS
$BODY$
DECLARE _accountingrow record;
BEGIN
Return Query
Select * From account_dsl_history_info
Where accountid = _accountid And timestamp <= _day + interval '1 day - 1 millisecond'
Order By timestamp Desc
Limit 1;
END;
$BODY$ LANGUAGE plpgsql;
Generally, to decompose rows returned from a function and get individual columns:
SELECT * FROM account_servicetier_for_day(20424, '2014-08-12');
As for the query:
Postgres 9.3 or newer
Cleaner with JOIN LATERAL:
SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, f.* -- but avoid duplicate column names!
FROM account_tab a
, account_servicetier_for_day(a.accountid, '2014-08-12') f -- <-- HERE
WHERE a.isdsl = 1
AND a.dslservicetypeid IS NOT NULL
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
ORDER BY a.username;
The LATERAL keyword is implicit here, functions can always refer earlier FROM items. The manual:
LATERAL can also precede a function-call FROM item, but in this
case it is a noise word, because the function expression can refer to
earlier FROM items in any case.
Related:
Insert multiple rows in one table based on number in another table
Short notation with a comma in the FROM list is (mostly) equivalent to a CROSS JOIN LATERAL (same as [INNER] JOIN LATERAL ... ON TRUE) and thus removes rows from the result where the function call returns no row. To retain such rows, use LEFT JOIN LATERAL ... ON TRUE:
...
FROM account_tab a
LEFT JOIN LATERAL account_servicetier_for_day(a.accountid, '2014-08-12') f ON TRUE
...
Also, don't use NOT IN (subquery) when you can avoid it. It's the slowest and most tricky of several ways to do that:
Select rows which are not present in other table
I suggest NOT EXISTS instead.
Postgres 9.2 or older
You can call a set-returning function in the SELECT list (which is a Postgres extension of standard SQL). For performance reasons, this is best done in a subquery. Decompose the (well-known!) row type in the outer query to avoid repeated evaluation of the function:
SELECT '2014-08-12' AS day, 0 AS inbytes, 0 AS outbytes
, a.username, a.accountid, a.userid
, (a.rec).* -- but be wary of duplicate column names!
FROM (
SELECT *, account_servicetier_for_day(a.accountid, '2014-08-12') AS rec
FROM account_tab a
WHERE a.isdsl = 1
AND a.dslservicetypeid Is Not Null
AND NOT EXISTS (
SELECT FROM dailyaccounting_tab
WHERE day = '2014-08-12'
AND accountid = a.accountid
)
) a
ORDER BY a.username;
Related answer by Craig Ringer with an explanation, why it's better not to decompose on the same query level:
How to avoid multiple function evals with the (func()).* syntax in an SQL query?
Postgres 10 removed some oddities in the behavior of set-returning functions in the SELECT:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Use the function in the from clause
Select
'2014-08-12' As day,
0 As inbytes,
0 As outbytes,
acct.username,
acct.accountid,
acct.userid,
asfd.*
From
account_tab acct
cross join lateral
account_servicetier_for_day(acct.accountid, '2014-08-12') asfd
Where acct.isdsl = 1
And acct.dslservicetypeid Is Not Null
And acct.accountid Not In (Select accountid From dailyaccounting_tab Where Day = '2014-08-12')
Order By acct.username

Returning the lowest integer not in a list in SQL

Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;