Avoid correlated subqueries error in BigQuery - google-bigquery

I have a simple query to obtain the currency rate in use at the time a transaction was created:
SELECT t.orderid, t.date,
(SELECT rate FROM sources.currency_rates r WHERE currencyid=1 AND
r.date>=t.date ORDER BY date LIMIT 1) rate
FROM sources.transactions t
This triggers an error:
Error: Correlated subqueries that reference other tables are not
supported unless they can be de-correlated, such as by transforming
them into an efficient JOIN.'
I've tried with several types of joins and named subqueries, but none seem to work. What is the best way to accomplish this? Seems like a very common scenario that should be quite straightforward to implement in BQ's Standard Sql.

Below is for BigQuery Standard SQL
#standardSQL
SELECT
t.orderid AS orderid,
t.date AS date,
ARRAY_AGG(r.rate ORDER BY r.date LIMIT 1)[SAFE_OFFSET(0)] AS rate
FROM `sources.transactions` AS t
JOIN `sources.currency_rates` AS r
ON currencyid = 1
AND r.date >= t.date
GROUP BY orderid, date

I've noticed similar behavior with other correlated subqueries. They are useful, but can't always be automatically modeled to JOINs by BigQuery.
Similar case which works:
#standardSQL
SELECT name, (
SELECT AVG(temp)
FROM `bigquery-public-data.noaa_gsod.gsod2017` b
WHERE a.usaf=b.stn
) temp
FROM `bigquery-public-data.noaa_gsod.stations` a
LIMIT 10
Doesn't work:
#standardSQL
SELECT name, (
SELECT temp
FROM `bigquery-public-data.noaa_gsod.gsod2017` b
WHERE a.usaf=b.stn
ORDER BY da
LIMIT 1
) temp
FROM `bigquery-public-data.noaa_gsod.stations` a
LIMIT 10
Fix:
#standardSQL
SELECT name, ARRAY_AGG(temp ORDER BY da LIMIT 1) temp
FROM `bigquery-public-data.noaa_gsod.stations` a
JOIN `bigquery-public-data.noaa_gsod.gsod2017` b
ON a.usaf=b.stn
GROUP BY 1
LIMIT 10
(give me a public dataset, and I'll write a query that works with your data)

Related

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

Optimize Query (remove subquery)

Can you help me to optimize this Query ?. I need to remove the subquery because the performance is awful.
select LICENSE,
(select top 1 SERVICE_KEY
from SERVICES
where SERVICES.LICENSE = VEHICLE.LICENSE
order by DATE desc, HOUR desc)
from VEHICLE
The problem is that I can have two SERVICES on the same DATE and HOUR, so I haven't been able to code an equivalent SQL avoiding the subquery.
The query runs on a Legacy database where I can't modify its metadata, and it doesn't have any index at all. That's the reason to look for a solution that can avoid a correlated query.
Thank you.
You can express your query using ROW_NUMBER() without the need for a correlated subquery. Try the following query and see how the peformance is:
SELECT t.LICENSE, t.SERVICE_KEY
FROM
(
SELECT t1.LICENSE, t1.SERVICE_KEY
ROW_NUMBER() OVER (PARTITION BY t1.LICENSE
ORDER BY t2.DATE DESC, t2.HOUR DESC) rn
FROM VEHICLE t1
INNER JOIN SERVICES t2
ON t1.LICENSE = t2.LICENSE
) t
WHERE t.rn = 1
The performance of this query would depend, among other things, on having indices on the join columns of your two tables.

PostgreSQL Query Time

SELECT *
FROM vehicles t1
WHERE (SELECT COUNT(*) FROM vehicles t2
WHERE t1.pump_number = t2.pump_number
AND t1.updated_at < t2.updated_at
) < 4
AND t1.updated_at >= ?
And I supply '1970-01-01 00:00:00.000000' for the parameter ?.
I have around 10k records in the vehicles table and no index is added. Above query takes around 10-20 seconds in execution.
How I can optimize it to decrease execution time?
Postgres provide nice admin tool which has option EXPLAIN to see query execution plan .
It will give great insights . here is the link for pgadmin in detail
http://www.pgadmin.org/docs/1.4/query.html
Also use joins in your query instead of select that will increase your query performance
Try this( the columns in select and group by statement need to be replaced by your own):
SELECT
t1.id,
t1.updated_at,
t1.other_columns
FROM vehicles t1
INNER JOIN vehicles t2
ON t1.pump_number = t2.pump_number
AND t1.updated_at < t2.updated_at
WHERE t1.updated_at >= '1970-01-01 00:00:00.000000'
GROUP BY
t1.id,
t1.updated_at,
t1.other_columns
having count(*)< 4
After this change, you could try to add a index on column pump_number to see if it helps.
This is your query:
SELECT *
FROM vehicles t1
WHERE (SELECT Count(*)
FROM vehicles t2
WHERE t1.pump_number = t2.pump_number AND
t1.updated_at < t2.updated_at
) < 4 AND
t1.updated_at >= ? " , "1970-01-01 00:00:00.000000")]
I would start by writing this using window functions:
select v.*
from (select v.*, row_number() over (partition by pump_number order by updated_at) as seqnum
from vehicles v
) v
where v.seqnum < 4 and t1.updated_at >= ?;
For this query, I would suggest indexes on vehicles(pump_number, updted_at) and vehicles(updated_at).
To get an equivalent query, use the window function rank(), not row_number() here:
SELECT *
FROM (
SELECT *
, rank() OVER (PARTITION BY pump_number ORDER BY updated_at DESC) AS rnk
FROM vehicles t1
) sub
WHERE rnk < 4
AND updated_at >= '1970-01-01 0:0';
And it has to be ORDER BY updated_at DESC, to exclude rows that have more than three older peers for the same pump_number. In other words:
"Get the three oldest rows per pump_number - or more if there are ties on updated_at".
Indexes are not going to help while you read most or all of the table anyway.
Further optimize performance
If (pump_number, updated_at) is unique or / and there are relatively few distinct values for pump_number in vehicles, you can probably optimize further. There is not enough information in your question.

Compare SQL groups against eachother

How can one filter a grouped resultset for only those groups that meet some criterion compared against the other groups? For example, only those groups that have the maximum number of constituent records?
I had thought that a subquery as follows should do the trick:
SELECT * FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t HAVING Records = MAX(Records);
However the addition of the final HAVING clause results in an empty recordset... what's going on?
In MySQL (Which I assume you are using since you have posted SELECT *, COUNT(*) FROM T GROUP BY X Which would fail in all RDBMS that I know of). You can use:
SELECT T.*
FROM T
INNER JOIN
( SELECT X, COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
) T2
ON T2.X = T.X
This has been tested in MySQL and removes the implicit grouping/aggregation.
If you can use windowed functions and one of TOP/LIMIT with Ties or Common Table expressions it becomes even shorter:
Windowed function + CTE: (MS SQL-Server & PostgreSQL Tested)
WITH CTE AS
( SELECT *, COUNT(*) OVER(PARTITION BY X) AS Records
FROM T
)
SELECT *
FROM CTE
WHERE Records = (SELECT MAX(Records) FROM CTE)
Windowed Function with TOP (MS SQL-Server Tested)
SELECT TOP 1 WITH TIES *
FROM ( SELECT *, COUNT(*) OVER(PARTITION BY X) [Records]
FROM T
)
ORDER BY Records DESC
Lastly, I have never used oracle so apolgies for not adding a solution that works on oracle...
EDIT
My Solution for MySQL did not take into account ties, and my suggestion for a solution to this kind of steps on the toes of what you have said you want to avoid (duplicate subqueries) so I am not sure I can help after all, however just in case it is preferable here is a version that will work as required on your fiddle:
SELECT T.*
FROM T
INNER JOIN
( SELECT X
FROM T
GROUP BY X
HAVING COUNT(*) =
( SELECT COUNT(*) AS Records
FROM T
GROUP BY X
ORDER BY Records DESC
LIMIT 1
)
) T2
ON T2.X = T.X
For the exact question you give, one way to look at it is that you want the group of records where there is no other group that has more records. So if you say
SELECT taxid, COUNT(*) as howMany
GROUP by taxid
You get all counties and their counts
Then you can treat that expressions as a table by making it a subquery, and give it an alias. Below I assign two "copies" of the query the names X and Y and ask for taxids that don't have any more in one table. If there are two with the same number I'd get two or more. Different databases have proprietary syntax, notably TOP and LIMIT, that make this kind of query simpler, easier to understand.
SELECT taxid FROM
(select taxid, count(*) as HowMany from flats
GROUP by taxid) as X
WHERE NOT EXISTS
(
SELECT * from
(
SELECT taxid, count(*) as HowMany FROM
flats
GROUP by taxid
) AS Y
WHERE Y.howmany > X.howmany
)
Try this:
SELECT * FROM (
SELECT *, MAX(Records) as max_records FROM (
SELECT *, COUNT(*) AS Records
FROM T
GROUP BY X
) t
) WHERE Records = max_records
I'm sorry that I can't test the validity of this query right now.

Complex SQL pagination Query

I am doing pagination for my data using the solution to this question.
I need to be using this solution for a more complex query now. Ie. the SELECT inside the bracket has joins and aggregate functions.
This is that solution I'm using as a reference:
;WITH Results_CTE AS
(
SELECT
Col1, Col2, ...,
ROW_NUMBER() OVER (ORDER BY SortCol1, SortCol2, ...) AS RowNum
FROM Table
WHERE <whatever>
)
SELECT *
FROM Results_CTE
WHERE RowNum >= #Offset
AND RowNum < #Offset + #Limit
The query that I need to incorporate into the above solution:
SELECT users.indicator, COUNT(*) as 'queries' FROM queries
INNER JOIN calls ON queries.call_id = calls.id
INNER JOIN users ON calls.user_id = users.id
WHERE queries.isresolved=0 AND users.indicator='ind1'
GROUP BY users.indicator ORDER BY queries DESC
How can I achieve this? So far I've made it work by removing the ORDER BY queries DESC part and putting that in the line ROW_NUMBER() OVER (ORDER BY ...) AS RowNum, but when I do this it doesn't allow me to order by that column ("Invalid column name 'queries'.").
What do I need to do to get it to order by this column?
edit: using SQL Server 2008
Try ORDER BY COUNT(*) DESC . It works on MySQL ... not sure about SQL Server 2008
I think queries your alias name for count(*) column
then use like this
SELECT users.indicator, COUNT(*) as 'queries' FROM queries
INNER JOIN calls ON queries.call_id = calls.id
INNER JOIN users ON calls.user_id = users.id
WHERE queries.isresolved=0 AND users.indicator='ind1'
GROUP BY users.indicator ORDER BY COUNT(*) DESC
http://oops-solution.blogspot.com/2011/11/string-handling-in-javascript.html