BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries - google-bigquery

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.

How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

Related

SQL Server - Weird Index Usage

So here is the original query I'm working with
SELECT TOP(10) *
FROM Orders o
WHERE (o.DateAdded >= DATEADD(DAY, - 30, getutcdate())) AND (o.DateAdded <= GETUTCDATE())
ORDER BY o.DateAdded ASC,
o.Price ASC
o.Quantity DESC
Datatype:
DateAdded - smalldatetime
Price - decimal(19,8)
Quantity - int
I have an index on the Orders table with the same 3 columns in the same order, so when I run this, it's perfect. Time < 0ms, Live Query Statistics shows it only reads the 10 rows. Awesome.
However, as soon as I add this line to the WHERE clause
AND o.Price BETWEEN convert(decimal(19,8), 0) AND #BuyPrice
It all goes to hell (and unfortunately I need that line). It also behaves the same if it's just o.Price <= #BuyPrice. Live Query Statistics shows the number of rows read is ~30k. It also shows that the o.Price comparison isn't being used as a seek predicate, and I'm having a hard time understanding why it isn't. I've verified #BuyPrice is the right datatype, as I found several articles that discuss issues with implicit conversions. At first I thought it was because I had two ranges: first the dateAdded then Price, but I have other queries doing with multi column indexes and multiple ranges and they all perform just fine. I'm absolutely baffled as to why this one has decided to be a burden. I've tried changing the order of columns in the index, changing them from ASC to DESC, but nada.
Would highly appreciate anyone telling me what I'm missing. Thanks
It is impossible for the optimizer to use two range predicates at the same time.
Think about it: It starts scanning from a certain spot in the index sorted by DateAdded. It now needs, within each individual DateAdded value to seek to a particular Price, start scanning, and stop at another Price, then jump to the next DateAdded.
This is called skip-scanning, it is only efficient when the first predicate is not very many values, otherwise it is inefficient, and because of this, only Oracle has implemented it, not SQL Server.
I think this is due to the TOP 10 which cannot take place before the ORDER BY.
And this ORDER BY must wait until the result set is ready.
Without your additional price range, the TOP 10 can be taken from the existing index directly. But adding the second range will force another operation to be run first.
In short:
First your filter must get the rows for the price range together with the date range.
The resulting set is sorted and the top 10 rows are taken.
Did you try to add a separate index on your price column? This should speed up the first filter.
We cannot predict the execution plan in many cases, but you might try to
write an intermediate set, filtered by the date range, into a temp table and proceed from there. You might even create an index on the price column there (Depends on the expected row count. Probably the best option).
use a CTE to define a set filtered by the the date range and use this set to apply your price range. But a CTE is not the same as a temp table. The final execution plan might be the same as before...
use two CTEs to define two sets (one per range) and use INNER JOIN as a way to get the same as with WHERE condition1 AND condition2.

Is it possible to create more rows than original data using GROUPBY and COUNT(*)?

I have the following query:
SELECT "domain", site_domain, pageurl, count (*) FROM impressions WHERE imp_date > '20150718' AND insertion_order_id = '227363'
GROUP BY 1,2,3
It was an incorrectly conceived query this I understand, but it took over 30 minutes to run, while just pulling the data without a count and groupby took just 20 seconds.
My question being is it possible that there are more rows created than the original data set?
Thanks!
The only time that an aggregation query will return more rows than in the original data set is when two conditions are true:
There are no matching rows in the query.
There is no group by.
In this case, the aggregation query returns one row; without the aggregation you would have no rows.
Otherwise, GROUP BY combines rows together, so the result cannot be larger than the original data.
When you are comparing time spent for returning a result set, you need to distinguish between time to first row and time to last row. When you execute a simple SELECT, you are inclined to measure the time to the first row returned. However, a group by needs to process all the data before returning any rows (under most circumstances). Hence, it would be better to compare the time to the last row returned by the simple query.

SQL: Minimising rows in subqueries/partitioning

So here's an odd thing. I have limited SQL access to a database - the most relevant restriction here being that if I create a query, a maximum of 10,000 rows is returned.
Anyway, I've been trying to have a query return individual case details, but only at busy times - say when 50+ cases are attended to in an hour. So, I inserted the following line:
COUNT(CaseNo) OVER (PARTITION BY DATEADD(hh,
DATEDIFF(hh, 0, StartDate), 0)) AS CasesInHour
... And then used this as a subquery, selecting only those cases where CasesInHour >= 50
However, it turns out that the 10,000 rows limit affects the partitioning - when I tried to run over a longer period nothing came up, as it was counting the cases in any given hour from only a (fairly random) much smaller selection.
Can anyone think of a way to get around this limit? The final total returned will be much lower than 10,000 rows, but it will be looking at far more than 10,000 as a starting point.
If this is really MySQL we're talking about, sql_big_selects and max_join_size affects the number of rows examined, not the number of rows "returned". So, you'll need to reduce the number of rows examined by being more selective and using proper indexes.
For example, the following query may be examining over 10,000 rows:
SELECT * FROM stats
To limit the selectivity, you might want to grab only the rows from the last 30 days:
SELECT * FROM stats
WHERE created > DATESUB(NOW(), INTERVAL 30 DAY)
However, this only reduces the number of rows examined if there is an index on the created column and the cardinality of the index is sufficient to reduce the rows examined.

Using Table Decorators on Big Query Web Interface

I saw the news about Table Decorators being available to limit the amount of data that is queried by specifying a time interval or limit. I did not see any examples on how to use the Table Decorators in the Big Query UI. Below is an example query that I'd like to run and only look at data that came in over the last 4hours. Any tips on how I can modify this query to utilize Table Decorators?
SELECT
foo,
count(*)
FROM [bigtable.201309010000]
GROUP BY 1
EDIT after trying example below
The first query above scans 180GB of data for the month of September (up through Sept 19th). I'd expect the query below to only scan data that came in during the time period specified. In this case 4hrs, so I'd expect the billing to be about 1.6GB not 180GB. Is there a way to set up ETL/query so we do not get billed for scanning the whole table?
SELECT
foo,
count(*)
FROM [bigtable.201309010000#-14400000]
GROUP BY 1
To use table decorators, you can either specify #timestamp or #timestamp-end_time. Timestamp can be negative, in which case it is relative; end_time can be empty, in which case it is the current time. You can use both of these special cases together, to get a time range relative to now. e.g. [table#-time_in_ms-]. So for your case, since 4 hours is 14400000 milliseconds, you can use:
SELECT foo, count(*) FROM [dataset.table#-14400000-] GROUP BY 1
This is a little bit confusing, we're intending to publish better documentation and examples soon.

Is there a way to handle immutability that's robust and scalable?

Since bigquery is append-only, I was thinking about stamping each record I upload to it with an 'effective date' similar to how peoplesoft works, if anybody is familiar with that pattern.
Then, I could issue a select statement and join on the max effective date
select UTC_USEC_TO_MONTH(timestamp) as month, sum(amt)/100 as sales
from foo.orders as all
join (select id, max(effdt) as max_effdt from foo.orders group by id) as latest
on all.effdt = latest.max_effdt and all.id = latest.id
group by month
order by month;
Unfortunately, I believe this won't scale because of the big query 'small joins' restriction, so I wanted to see if anyone else had thought around this use case.
Yes, adding a timestamp for each record (or in some cases, a flag that captures the state of a particular record) is the right approach. The small side of a BigQuery "Small Join" can actually return at least 8MB (this value is compressed on our end, so is usually 2 to 10 times larger), so for "lookup" table type subqueries, this can actually provide a lot of records.
In your case, it's not clear to me what the exact query you are trying to run is.. it looks like you are trying to return the most recent sales times of every individual item - and then JOIN this information with the SUM of sales amt per month of each item? Can you provide more info about the query?
It might be possible to do this all in one query. For example, in our wikipedia dataset, an example might look something like...
SELECT contributor_username, UTC_USEC_TO_MONTH(timestamp * 1000000) as month,
SUM(num_characters) as total_characters_used FROM
[publicdata:samples.wikipedia] WHERE (contributor_username != '' or
contributor_username IS NOT NULL) AND timestamp > 1133395200
AND timestamp < 1157068800 GROUP BY contributor_username, month
ORDER BY contributor_username DESC, month DESC;
...to provide wikipedia contributions per user per month (like sales per month per item). This result is actually really large, so you would have to limit by date range.
UPDATE (based on comments below) a similar query that finds "num_characters" for the latest wikipedia revisions by contributors after a particular time...
SELECT current.contributor_username, current.num_characters
FROM
(SELECT contributor_username, num_characters, timestamp as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL)
AS current
JOIN
(SELECT contributor_username, MAX(timestamp) as time FROM [publicdata:samples.wikipedia] WHERE contributor_username != '' AND contributor_username IS NOT NULL AND timestamp > 1265073722 GROUP BY contributor_username) AS latest
ON
current.contributor_username = latest.contributor_username
AND
current.time = latest.time;
If your query requires you to use first build a large aggregate (for example, you need to run essentially an accurate COUNT DISTINCT) another option is to break this query up into two queries. The first query could provide the max effective date by month along with a count and save this result as a new table. Then, could run a sum query on the resulting table.
You could also store monthly sales records in separate tables, and only query the particular table for the months you are interested in, simplifying your monthly sales summaries (this could also be a more economical use of BigQuery). When you need to find aggregates across all tables, you could run your queries with multiple tables listed after the FROM clause.