Bigquery runs indefinitely - google-bigquery

I have a query like this:
WITH A AS (
SELECT id FROM db1.X AS d
WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
SELECT id
FROM db2.Y as t
WHERE
t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
AND t.end >= TIMESTAMP(current_date())
)
SELECT * FROM A as d JOIN B as t on d.id = t.id;
db1.X has 1.6 Billion rows.
db2.Y has 15K rows.
db1.X is a materialized view on a bigger table.
db2.Y is a table with source as a google sheet.
Issue
The query keeps running indefinitely.
I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.
The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)
What's tried
Removed join to use a WHERE IN. Still never ending.
No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)
The above behaivour is replicated even when not using subqueries.
A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)
The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely
JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."

It turns out that the table size is the problem.
I created a smaller table and ran exactly the same queries, and that works.
This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.
In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.
The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows

Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.
Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
LIMIT 10
)
...

Related

comparing usage of inner join and where in

with two tables - all_data and selected_place_day_hours
all_data has place_id, day, hour, metric
selected_place_day_hours has fields place_id, day, hour
I need to subset all_data such that only records with place_id, day, hour in selected_place_day_hours are selected.
I can go two ways about it
1.Use inner join
select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;
2.Use where in
select *
from all_data
where
place_id in (select place_id from selected_place_day_hours)
and day in (select day from selected_place_day_hours)
and hour in (select day from selected_place_day_hours)
;
I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?
One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.
The two are semantically different.
The IN does a semi-join, meaning that it returns one from all_data regardless of how many rows are matched in selected_place_day_hours.
The JOIN can return multiple rows.
So, the first piece of advice is to use the version that is correct for what you want to accomplish.
Assuming the data in select_place_day_hours guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN is optimized at least as well as IN, so that would usually be a safe choice.
These days, SQL tends to ignore what you say and do its own thing.
This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.
In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.
The reasons to prefer the join syntax over the older where syntax are:
to look cool: you don’t want anybody catching you with code that is old-fashioned
the join syntax is easy to adapt to outer joins
the join syntax allows you to separate the join part from additional filter by distinguishing between join and where
The reasons do not include whether one is better, because the interpreter will handle that.
These are some more notes that are too long for a comment.
First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)
For example:
all_data
place_id day hour other_cols...
1 4 3 ....
selected_place_day_hours
place_id day hour
1 4 9
4444 4444 6
Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)
One more note is that if (place_id, day, hour) is unique, your first query is in same purpose of following query
SELECT *
FROM all_data
WHERE
(place_id, day, hour) IN (
SELECT place_id, day, hour
FROM selected_place_day_hours
);

Bigquery query JOIN unusually slow

I'm trying to run this query but it is, to my limited level of comprehension, absurdingly slow.
Here is the query :
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
[server.Request] req
LEFT JOIN EACH
server.Response resp
ON
req.request_id = resp.request_id
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), req.date) < 3
GROUP EACH BY
day,
hour
ORDER BY
day,
hour
What bugs me the most is that this exact same query works perfectly fine on the Production project which has the same datasets, tables and fields (with the same data types and names). The only difference is that the Production has more data than the Dev.
I'm not in any case an expert in SQL and I'd enjoy to be told where I could improve the query.
Thank you in advance.
EDIT: Hi, solved the issue.
It was caused by a great number of request_id being duplicates in server.Response which slowed "a little bit" the query.
Try pushing your WHERE clause down inside the join.
BigQuery's optimizer does not (yet) push predicates inside joins, so the query you posted joins all of your data and then filters it, instead of just joining the parts you care about. If you have a date field on both request and response, put filters on both sides of the join!
If you can't filter both sides of the join, then switch sides so that the smaller (filtered) table is on the right. Because of how BQ joins are implemented, they typically perform better if the smaller table is on the right.
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
server.Response resp
RIGHT JOIN EACH
(
SELECT *
FROM
[server.Request]
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), date) < 3
) req
ON
req.request_id = resp.request_id
GROUP EACH BY
day,
hour
ORDER BY
day,
hour

How to adapt this query to use window functions

When I started tackling this problem, I thought, "This will be a great query to learn about Window Functions." I wasn't able to end up getting it to work with window functions, but I was able to get what I wanted using a join.
How would you adapt this query to use window functions:
SELECT
day,
COUNT(i.project) as num_open
FROM generate_series(0, 364) as t(day)
LEFT JOIN issues i on (day BETWEEN i.closed_days_ago AND i.created_days_ago)
GROUP BY day
ORDER BY day;
The query above takes a list of issues that have a range represented by created_days_ago and closed_days ago and for the last 365 days, it'll count the number of issues that were created but not yet closed for that specific day.
http://sqlfiddle.com/#!15/663f6/2
The issues table looks like:
CREATE TABLE issues (
id SERIAL,
project VARCHAR(255),
created_days_ago INTEGER,
closed_days_ago INTEGER);
What I was thinking was that the partition for a given day should include all the rows in issues where day is between the created and closed days ago. Something like SELECT day, COUNT(i.project) OVER (PARTITION day BETWEEN created_days_ago AND closed_days_ago) ...
I've never used window functions before, so I might be missing something basic, but it seemed like this was just the type of query that makes window functions so awesome.
The fact that you use generate_series() to create a full range of days, including those days with no changes, and thus no rows in table issues, does not rule out the use of window functions.
In fact, this query runs 50 times faster than the query in the Q in my local test:
SELECT t.day
, COALESCE(sum(a.created) OVER (ORDER BY t.day DESC), 0)
- COALESCE(sum(b.closed) OVER (ORDER BY t.day DESC), 0) AS open_tickets
FROM generate_series(0, 364) t(day)
LEFT JOIN (SELECT created_days_ago AS day, count(*) AS created
FROM issues GROUP BY 1) a USING (day)
LEFT JOIN (SELECT closed_days_ago AS day, count(*) AS closed
FROM issues GROUP BY 1) b USING (day)
ORDER BY 1;
It is also correct, as opposed to the query in the question, which results in 17 open tickets on day 0, although all of them have been closed.
The error is due to BETWEEN in your join condition, which includes upper and lower border. This way tickets are still counted as "open" on the day they are closed.
Each row in the result reflects the number of open tickets at the end of the day.
Explain
The query combines window functions with aggregate functions.
Subquery a counts the number of created tickets per day. This results in a single row per day, making the rest easier.
Subquery b does the same for closed tickets.
Use LEFT JOINs to join to the generated list of days in subquery t.
Be wary of joining to multiple unaggregated tables! That could trigger a CROSS JOIN among the joined tables for multiple matches per row, generating incorrect results. Compare:
Two SQL LEFT JOINS produce incorrect result
Finally use two window functions to compute the running total of created versus closed tickets.
An alternative would be to use this in the outer SELECT
sum(COALESCE(a.created, 0)
- COALESCE(b.closed, 0)) OVER (ORDER BY t.day DESC) AS open_tickets
Performs the same in my tests.
-> SQLfiddle demo.
Aside: I would never store "days_ago" in a table, but the absolute date / timestamp. Looks like a simplification for the purpose of this question.

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

SQL Select statement Where time is *:00

I'm attempting to make a filtered table based off an existing table. The current table has rows for every minute of every hour of 24 days based off of locations (tmcs).
I want to filter this table into another table that has rows for just 1 an hour for each of the 24 days based off the locations (tmcs)
Here is the sql statement that i thought would have done it...
SELECT
Time_Format(t.time, '%H:00') as time, ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time FROM traffic t
GROUP BY time, tmc, Date
The problem is i still get 247,000 rows effected...and according to simple math I should only have:
Locations (TMCS): 14
Hours in a day: 24
Days tracked: 24
Total = 14 * 24 * 24 = 12,096
My original table has 477,277 rows
When I make a new table off this query i get right around 247,000 which makes no sense, so my query must be wrong.
The reason I did this method instead of a where clause is because I wanted to find the average speed(avg)per hour. This is not mandatory so I'd be fine with using a Where clause for time, but I just don't know how to do this based off *:00
Any help would be much appreciated
Fix the GROUP BY so it's standard, rather then the random MySQL extension
SELECT
Time_Format(t.time, '%H:00') as time,
ROUND(AVG(t.avg), 0) as avg,
tmc, Date, Date_Time
FROM traffic t
GROUP BY
Time_Format(t.time, '%H:00'), tmc, Date, Date_Time
Run this with SET SESSION sql_mode = 'ONLY_FULL_GROUP_BY'; to see the errors that other RDBMS will give you and make MySQL work properly