Bigquery query JOIN unusually slow - sql

I'm trying to run this query but it is, to my limited level of comprehension, absurdingly slow.
Here is the query :
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
[server.Request] req
LEFT JOIN EACH
server.Response resp
ON
req.request_id = resp.request_id
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), req.date) < 3
GROUP EACH BY
day,
hour
ORDER BY
day,
hour
What bugs me the most is that this exact same query works perfectly fine on the Production project which has the same datasets, tables and fields (with the same data types and names). The only difference is that the Production has more data than the Dev.
I'm not in any case an expert in SQL and I'd enjoy to be told where I could improve the query.
Thank you in advance.
EDIT: Hi, solved the issue.
It was caused by a great number of request_id being duplicates in server.Response which slowed "a little bit" the query.

Try pushing your WHERE clause down inside the join.
BigQuery's optimizer does not (yet) push predicates inside joins, so the query you posted joins all of your data and then filters it, instead of just joining the parts you care about. If you have a date field on both request and response, put filters on both sides of the join!
If you can't filter both sides of the join, then switch sides so that the smaller (filtered) table is on the right. Because of how BQ joins are implemented, they typically perform better if the smaller table is on the right.
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
server.Response resp
RIGHT JOIN EACH
(
SELECT *
FROM
[server.Request]
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), date) < 3
) req
ON
req.request_id = resp.request_id
GROUP EACH BY
day,
hour
ORDER BY
day,
hour

Related

Bigquery runs indefinitely

I have a query like this:
WITH A AS (
SELECT id FROM db1.X AS d
WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
SELECT id
FROM db2.Y as t
WHERE
t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
AND t.end >= TIMESTAMP(current_date())
)
SELECT * FROM A as d JOIN B as t on d.id = t.id;
db1.X has 1.6 Billion rows.
db2.Y has 15K rows.
db1.X is a materialized view on a bigger table.
db2.Y is a table with source as a google sheet.
Issue
The query keeps running indefinitely.
I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.
The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)
What's tried
Removed join to use a WHERE IN. Still never ending.
No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)
The above behaivour is replicated even when not using subqueries.
A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)
The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely
JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."
It turns out that the table size is the problem.
I created a smaller table and ran exactly the same queries, and that works.
This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.
In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.
The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows
Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.
Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
LIMIT 10
)
...

comparing usage of inner join and where in

with two tables - all_data and selected_place_day_hours
all_data has place_id, day, hour, metric
selected_place_day_hours has fields place_id, day, hour
I need to subset all_data such that only records with place_id, day, hour in selected_place_day_hours are selected.
I can go two ways about it
1.Use inner join
select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;
2.Use where in
select *
from all_data
where
place_id in (select place_id from selected_place_day_hours)
and day in (select day from selected_place_day_hours)
and hour in (select day from selected_place_day_hours)
;
I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?
One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.
The two are semantically different.
The IN does a semi-join, meaning that it returns one from all_data regardless of how many rows are matched in selected_place_day_hours.
The JOIN can return multiple rows.
So, the first piece of advice is to use the version that is correct for what you want to accomplish.
Assuming the data in select_place_day_hours guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN is optimized at least as well as IN, so that would usually be a safe choice.
These days, SQL tends to ignore what you say and do its own thing.
This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.
In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.
The reasons to prefer the join syntax over the older where syntax are:
to look cool: you don’t want anybody catching you with code that is old-fashioned
the join syntax is easy to adapt to outer joins
the join syntax allows you to separate the join part from additional filter by distinguishing between join and where
The reasons do not include whether one is better, because the interpreter will handle that.
These are some more notes that are too long for a comment.
First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)
For example:
all_data
place_id day hour other_cols...
1 4 3 ....
selected_place_day_hours
place_id day hour
1 4 9
4444 4444 6
Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)
One more note is that if (place_id, day, hour) is unique, your first query is in same purpose of following query
SELECT *
FROM all_data
WHERE
(place_id, day, hour) IN (
SELECT place_id, day, hour
FROM selected_place_day_hours
);

SQL for Next/Prior Business Day from Calendar table (in MS Access)

I have a Calendar table pulled from our mainframe DBs and saved as a local Access table. The table has history back to the 1930s (and I know we use back to the 50s in at least one place), resulting in 31k records. This Calendar table has 3 fields of interest:
Bus_Dt - every day, not just business days. Primary Key
Bus_Day_Ind - indicates if the day was a valid business day for the stock market.
Prir_Bus_Dt - the prior business day. Contains some errors (about 50), all old.
I have written a query to retrieve the first business day on or after the current calendar day, but it runs supremely slowly. (5+ minutes) I have examined the showplan output and see it is being run via an x-join, which between 30k+ record tables gives a solution space (and date comparisons) in the order of nearly 10 million. However, the actual task is not hard, and could be preformed comfortably by excel in minimal time using a simple sort.
My question is thus, is there any way to fix the poor performance of the query, or is this an inherent failing of SQL? (DB2 run on the mainframe also is slow, though not crushingly so. Throwing cycles at the problem and all that.) Secondarily, if I were to trust prir_bus_dt, can I get there better? Or restrict the date range (aka, "cheat"), or any other tricks I didn't think of yet?
SQL:
SELECT TE2Clndr.BUS_DT AS Cal_Dt
, Min(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
, TE2Clndr AS TE2Clndr_1
WHERE TE2Clndr_1.BUS_DAY_IND="Y" AND
TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]
GROUP BY TE2Clndr.BUS_DT;
Showplan:
Inputs to Query
Table 'TE2Clndr'
Table 'TE2Clndr'
End inputs to Query
01) Restrict rows of table TE2Clndr
by scanning
testing expression "TE2Clndr_1.BUS_DAY_IND="Y""
store result in temporary table
02) Inner Join table 'TE2Clndr' to result of '01)'
using X-Prod join
then test expression "TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]"
03) Group result of '02)'
Again, the question is, can this be made better (faster), or is this already as good as it gets?
I have a new query that is much faster for the same job, but it depends on the prir_bus_dt field (which has some errors). It also isn't great theory since prior business day is not necessarily available on everyone's calendar. So I don't consider this "the" answer, merely an answer.
New query:
SELECT TE2Clndr.BUS_DT as Cal_Dt
, Max(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
INNER JOIN TE2Clndr AS TE2Clndr_1
ON TE2Clndr.PRIR_BUS_DT = TE2Clndr_1.PRIR_BUS_DT
GROUP BY TE2Clndr.BUS_DT;
What about this approach
select min(bus_dt)
from te2Clndr
where bus_dt >= date()
and bus_day_ind = 'Y'
This is my reference for date() representing the current date

How to adapt this query to use window functions

When I started tackling this problem, I thought, "This will be a great query to learn about Window Functions." I wasn't able to end up getting it to work with window functions, but I was able to get what I wanted using a join.
How would you adapt this query to use window functions:
SELECT
day,
COUNT(i.project) as num_open
FROM generate_series(0, 364) as t(day)
LEFT JOIN issues i on (day BETWEEN i.closed_days_ago AND i.created_days_ago)
GROUP BY day
ORDER BY day;
The query above takes a list of issues that have a range represented by created_days_ago and closed_days ago and for the last 365 days, it'll count the number of issues that were created but not yet closed for that specific day.
http://sqlfiddle.com/#!15/663f6/2
The issues table looks like:
CREATE TABLE issues (
id SERIAL,
project VARCHAR(255),
created_days_ago INTEGER,
closed_days_ago INTEGER);
What I was thinking was that the partition for a given day should include all the rows in issues where day is between the created and closed days ago. Something like SELECT day, COUNT(i.project) OVER (PARTITION day BETWEEN created_days_ago AND closed_days_ago) ...
I've never used window functions before, so I might be missing something basic, but it seemed like this was just the type of query that makes window functions so awesome.
The fact that you use generate_series() to create a full range of days, including those days with no changes, and thus no rows in table issues, does not rule out the use of window functions.
In fact, this query runs 50 times faster than the query in the Q in my local test:
SELECT t.day
, COALESCE(sum(a.created) OVER (ORDER BY t.day DESC), 0)
- COALESCE(sum(b.closed) OVER (ORDER BY t.day DESC), 0) AS open_tickets
FROM generate_series(0, 364) t(day)
LEFT JOIN (SELECT created_days_ago AS day, count(*) AS created
FROM issues GROUP BY 1) a USING (day)
LEFT JOIN (SELECT closed_days_ago AS day, count(*) AS closed
FROM issues GROUP BY 1) b USING (day)
ORDER BY 1;
It is also correct, as opposed to the query in the question, which results in 17 open tickets on day 0, although all of them have been closed.
The error is due to BETWEEN in your join condition, which includes upper and lower border. This way tickets are still counted as "open" on the day they are closed.
Each row in the result reflects the number of open tickets at the end of the day.
Explain
The query combines window functions with aggregate functions.
Subquery a counts the number of created tickets per day. This results in a single row per day, making the rest easier.
Subquery b does the same for closed tickets.
Use LEFT JOINs to join to the generated list of days in subquery t.
Be wary of joining to multiple unaggregated tables! That could trigger a CROSS JOIN among the joined tables for multiple matches per row, generating incorrect results. Compare:
Two SQL LEFT JOINS produce incorrect result
Finally use two window functions to compute the running total of created versus closed tickets.
An alternative would be to use this in the outer SELECT
sum(COALESCE(a.created, 0)
- COALESCE(b.closed, 0)) OVER (ORDER BY t.day DESC) AS open_tickets
Performs the same in my tests.
-> SQLfiddle demo.
Aside: I would never store "days_ago" in a table, but the absolute date / timestamp. Looks like a simplification for the purpose of this question.

Optimizing CTE Queries

Yesterday I asked a question about CTEs and running total calculations;
Calculating information by using values from previous line
I came up with a solution, however when I went to apply it to my actual database (over 4.5 million records) it seems to be taking forever. It ran for over 3 hours before I stopped it. I then tried to run it on a subset (CTEtest as (select top 100)) and its been going for an hour and a half. Is this because it still needs to run through the whole thing before selecting the top 100? Or should I assume that if this query is taking 2 hours for 100 records, it will take days for 4.5 million? How can I optimize this?
Is there any way to see how much time is remaining on the query?
I think you are better off doing the running sum as a correlated subquery. This will allow you to better manage indexes for performance:
select memberid,
(select sum(balance - netamt) as runningsum
from txn_by_month t2
where t2.memberid = t.memberid and
t2.accountid <= t.accountid
) as RunningSum
from txn_by_month t
With this structure, an index on txn_by_month(memberid, accountid, balance, netamt) should be able to satisfy this part of the query, without going back to the original data.