Optimizing CTE Queries - sql

Yesterday I asked a question about CTEs and running total calculations;
Calculating information by using values from previous line
I came up with a solution, however when I went to apply it to my actual database (over 4.5 million records) it seems to be taking forever. It ran for over 3 hours before I stopped it. I then tried to run it on a subset (CTEtest as (select top 100)) and its been going for an hour and a half. Is this because it still needs to run through the whole thing before selecting the top 100? Or should I assume that if this query is taking 2 hours for 100 records, it will take days for 4.5 million? How can I optimize this?
Is there any way to see how much time is remaining on the query?

I think you are better off doing the running sum as a correlated subquery. This will allow you to better manage indexes for performance:
select memberid,
(select sum(balance - netamt) as runningsum
from txn_by_month t2
where t2.memberid = t.memberid and
t2.accountid <= t.accountid
) as RunningSum
from txn_by_month t
With this structure, an index on txn_by_month(memberid, accountid, balance, netamt) should be able to satisfy this part of the query, without going back to the original data.

Related

Bigquery runs indefinitely

I have a query like this:
WITH A AS (
SELECT id FROM db1.X AS d
WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
SELECT id
FROM db2.Y as t
WHERE
t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
AND t.end >= TIMESTAMP(current_date())
)
SELECT * FROM A as d JOIN B as t on d.id = t.id;
db1.X has 1.6 Billion rows.
db2.Y has 15K rows.
db1.X is a materialized view on a bigger table.
db2.Y is a table with source as a google sheet.
Issue
The query keeps running indefinitely.
I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.
The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)
What's tried
Removed join to use a WHERE IN. Still never ending.
No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)
The above behaivour is replicated even when not using subqueries.
A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)
The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely
JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."
It turns out that the table size is the problem.
I created a smaller table and ran exactly the same queries, and that works.
This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.
In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.
The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows
Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.
Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
LIMIT 10
)
...

comparing usage of inner join and where in

with two tables - all_data and selected_place_day_hours
all_data has place_id, day, hour, metric
selected_place_day_hours has fields place_id, day, hour
I need to subset all_data such that only records with place_id, day, hour in selected_place_day_hours are selected.
I can go two ways about it
1.Use inner join
select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;
2.Use where in
select *
from all_data
where
place_id in (select place_id from selected_place_day_hours)
and day in (select day from selected_place_day_hours)
and hour in (select day from selected_place_day_hours)
;
I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?
One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.
The two are semantically different.
The IN does a semi-join, meaning that it returns one from all_data regardless of how many rows are matched in selected_place_day_hours.
The JOIN can return multiple rows.
So, the first piece of advice is to use the version that is correct for what you want to accomplish.
Assuming the data in select_place_day_hours guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN is optimized at least as well as IN, so that would usually be a safe choice.
These days, SQL tends to ignore what you say and do its own thing.
This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.
In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.
The reasons to prefer the join syntax over the older where syntax are:
to look cool: you don’t want anybody catching you with code that is old-fashioned
the join syntax is easy to adapt to outer joins
the join syntax allows you to separate the join part from additional filter by distinguishing between join and where
The reasons do not include whether one is better, because the interpreter will handle that.
These are some more notes that are too long for a comment.
First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)
For example:
all_data
place_id day hour other_cols...
1 4 3 ....
selected_place_day_hours
place_id day hour
1 4 9
4444 4444 6
Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)
One more note is that if (place_id, day, hour) is unique, your first query is in same purpose of following query
SELECT *
FROM all_data
WHERE
(place_id, day, hour) IN (
SELECT place_id, day, hour
FROM selected_place_day_hours
);

Bigquery query JOIN unusually slow

I'm trying to run this query but it is, to my limited level of comprehension, absurdingly slow.
Here is the query :
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
[server.Request] req
LEFT JOIN EACH
server.Response resp
ON
req.request_id = resp.request_id
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), req.date) < 3
GROUP EACH BY
day,
hour
ORDER BY
day,
hour
What bugs me the most is that this exact same query works perfectly fine on the Production project which has the same datasets, tables and fields (with the same data types and names). The only difference is that the Production has more data than the Dev.
I'm not in any case an expert in SQL and I'd enjoy to be told where I could improve the query.
Thank you in advance.
EDIT: Hi, solved the issue.
It was caused by a great number of request_id being duplicates in server.Response which slowed "a little bit" the query.
Try pushing your WHERE clause down inside the join.
BigQuery's optimizer does not (yet) push predicates inside joins, so the query you posted joins all of your data and then filters it, instead of just joining the parts you care about. If you have a date field on both request and response, put filters on both sides of the join!
If you can't filter both sides of the join, then switch sides so that the smaller (filtered) table is on the right. Because of how BQ joins are implemented, they typically perform better if the smaller table is on the right.
SELECT
STRFTIME_UTC_USEC(req.date, "%Y-%m-%d") AS day,
HOUR(req.date) AS hour,
10000*(COUNT(req.request_id) - COUNT(resp.request_id)) AS nb_bid_requests,
COUNT(resp.request_id) AS nb_bid_responses,
FROM
server.Response resp
RIGHT JOIN EACH
(
SELECT *
FROM
[server.Request]
WHERE
DATEDIFF(CURRENT_TIMESTAMP(), date) < 3
) req
ON
req.request_id = resp.request_id
GROUP EACH BY
day,
hour
ORDER BY
day,
hour

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

SQL: Calculating system load statistics

I have a table like this that stores messages coming through a system:
Message
-------
ID (bigint)
CreateDate (datetime)
Data (varchar(255))
I've been asked to calculate the messages saved per second at peak load. The only data I really have to work with is the CreateDate. The load on the system is not constant, there are times when we get a ton of traffic, and times when we get little traffic. I'm thinking there are two parts to this problem: 1. Determine ranges of time that are considered peak load, 2. Calculate the average messages per second during these times.
Is this the right approach? Are there things in SQL that can help with this? Any tips would be greatly appreciated.
I agree, you have to figure out what Peak Load is first before you can start to create reports on it.
The first thing I would do is figure out how I am going to define peak load. Ex. Am I going to look at an hour by hour breakdown.
Next I would do a group by on the CreateDate formated in seconds (no milleseconds). As part of the group by I would do an avg based on number of records.
I don't think you'd need to know the peak hours; you can generate them with SQL, wrapping a the full query and selecting the top 20 entries, for example:
select top 20 *
from (
[...load query here...]
) qry
order by LoadPerSecond desc
This answer had a good lesson about averages. You can calculate the load per second by looking at the load per hour, and dividing by 3600.
To get a first glimpse of the load for the last week, you could try (Sql Server syntax):
select datepart(dy,createdate) as DayOfYear,
hour(createdate) as Hour,
count(*)/3600.0 as LoadPerSecond
from message
where CreateDate > dateadd(week,-7,getdate())
group by datepart(dy,createdate), hour(createdate)
To find the peak load per minute:
select max(MessagesPerMinute)
from (
select count(*) as MessagesPerMinute
from message
where CreateDate > dateadd(days,-7,getdate())
group by datepart(dy,createdate),hour(createdate),minute(createdate)
)
Grouping by datepart(dy,...) is an easy way to distinguish between days without worrying about month borders. It works until you select more that a year back, but that would be unusual for performance queries.
warning, these will run slow!
this will group your data into "second" buckets and list them from the most activity to least:
SELECT
CONVERT(char(19),CreateDate,120) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY CONVERT(Char(19),CreateDate,120)
ORDER BY 2 Desc
this will group your data into "minute" buckets and list them from the most activity to least:
SELECT
LEFT(CONVERT(char(19),CreateDate,120),16) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY LEFT(CONVERT(char(19),CreateDate,120),16)
ORDER BY 2 Desc
I'd take those values and calculate what they want