Resources exceeded BigQuery - google-bigquery

When running the following query I got the error:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 158% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%
select *, row_number() over(PARTITION BY Column_A ORDER BY Column_B)
from
(SELECT
*
FROM
Table_1 UNION ALL
SELECT
*
FROM
Table_2 UNION ALL
SELECT
*
FROM
Table_3
)
Can someone help me how to change this query or is there possibility that we can change the memory limit in bigquery?

Welcome Aaron,
This error means BigQuery is unable to process the whole query due to memory limits, the ORDER BY function is pretty memory intensive, try removing it and I would expect your query to run fine.
If you need results ordered, try writing the unordered query out to a table then running a new query on this table to order the results.
If you're interested. here's an interesting article on how and BigQuery executes in memory:
https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery
I don't believe you can override or change this memory limit, but happy to be proven wrong.

Make sure your ORDER BY is being executed in real last step, additionally, consider to use a LIMIT clause to avoid “Resources Exceeded” or “Response too large” fails.

My primary recommendation here is to make sure to use partitioning and clustering.
Partitions apply to date field so if your Table_1, Table_2... has one, partition on it.
Clustering also greatly helps the memory cost of OVER clauses with ORDER BY because it sorts storage blocks (BigQuery docs)
To make the most of both of the above, I would also replace your UNION ALL sub-query with a temporary table.
Storing the result of the UNION ALL to memory, doing the partitioning+clustering of the resulting dataset and only then computing the rank is much more efficient in terms of memory and storage (Medium article)
Your final statement should look something like:
CREATE TEMP TABLE tmp
PARTITION BY date
CLUSTER BY column_A, column_B
AS
SELECT
*
FROM
Table_1 UNION ALL
SELECT
*
FROM
Table_2 UNION ALL
SELECT
*
FROM
Table_3
;
select *, row_number() over(PARTITION BY Column_A ORDER BY Column_B) from tmp

I've encountered this before and turns out I was trying to partition by a column with "NULL" values. Removing the NULL records worked!

You can try OVER without using ORDER BY

Related

Reverse initial order of SELECT statement

I want to run a SQL query in Postgres that is exactly the reverse of the one that you'd get by just running the initial query without an order by clause.
So if your query was:
SELECT * FROM users
Then
SELECT * FROM users ORDER BY <something here to make it exactly the reverse of before>
Would it just be this?
ORDER BY Desc
You are building on the incorrect assumption that you would get rows in a deterministic order with:
SELECT * FROM users;
What you get is really arbitrary. Postgres returns rows in any way it sees fit. For simple queries typically in order of their physical storage, which typically is the order in which rows were entered. But there are no guarantees, and the order may change any time between two calls. For instance after any UPDATE (writing a new physical row version), or when any background process reorders rows - like VACUUM. Or a more complex query might return rows according to an index or a join. Long story short: there is no reliable order for table rows in a relational database unless you specify it with ORDER BY.
That said, assuming you get rows from the above simple query in the order of physical storage, this would get you the reverse order:
SELECT * FROM users
ORDER BY ctid DESC;
ctid is the internal tuple ID signifying physical order. Related:
In-order sequence generation
How list all tables with data changes in the last 24 hours?
here is a tsql solution, thid might give you an idea how to do it in postgres
select * from (
SELECT *, row_number() over( order by (select 1)) rowid
FROM users
) x
order by rowid desc

Recursive CTE is very slow despite minimal records

I'm trying to write a recursive CTE query similar to this and when limiting my CTE records to an extremely small samples size the results are correct. I'm assuming it remains correct with more records. However, if I limit the CTE to ~5000 records (using a where clause) I'm getting 11 second execution times overall. If I increase that to ~24,000 records (using a where clause) that jumps up to 3 minute execution times.
I haven't run the query for longer than that because I don't want to eat up system resources.
I know I can avoid using a CTE all together for this query, but the intention is for this particular query to be apart of a larger query, so it would be more readable if I could use a CTE. I'm also pretty confident recursive CTE's are capable of handling much larger data sets, so I'm curious if someone notices something I'm missing.
Here is the query (tables and fields name have been changed):
WITH TEMP (COL1, COL2, CURR, PREV) AS (
SELECT COL1,
COL2,
ROW_NUMBER() OVER (PARTITION BY COL1 ORDER BY COL2) AS CURR,
ROW_NUMBER() OVER (PARTITION BY COL1 ORDER BY COL2) -1 AS PREV
FROM MYLIB.MYTABLE
WHERE COLDATE > 20150101 -- Produces about 5000 records
-- WHERE COLDATE > 20140101 -- Produces about 24000 records
)
SELECT COL1, MAX(TRIM(L ',' FROM CAST(SYS_CONNECT_BY_PATH(COL2, ',') AS VARCHAR(256))))
FROM TEMP
START WITH CURR = 1
CONNECT BY COL1 = PRIOR COL1
AND PREV = PRIOR CURR
GROUP BY COL1;
Note*: COLDATE is only being used to limit the records in the CTE for testing purposes.
The CTE itself doesn't seem to be the issue, I can do a SELECT * FROM TEMP; and it is instantaneous. I believe I may be using SYS_CONNECT_BY_PATH and/or CONNECT BY incorrectly (IE: I can modify the query to be more efficient)
The link I provided goes into more detail, but what I'm trying to achieve is to turn this:
Into this:
On the fly.
Which is what it is doing at the moment, albeit very slow.
Any insight is greatly appreciated.
I noticed two things and managed to work out a solution for my use-case.
First: I realized after executing the same exact query twice, all subsequent queries were near instantaneous. So I suppose it just had to build a query plan and then everything was fine. If I modified the query just the slightest bit, it would take a while though because it had to re-build the query plan.
Second: I realized I did not need to perform the query on the entire table. I only had to tailor this for a single value of COL1. Once I converted this into a UDF I was then able to make use of this individually for each record rather than tackle all records at once.

Using COUNT() inside CTE is more expensive than outside of CTE?

I'm doing paging with SQL Server and I'd like to avoid duplication by counting the total number of results as part of my partial resultset, rather than getting that resultset and then doing a separate query to get the count afterwards. However, the trouble is, it seems to be increasing execution time. For example, if I check with SET STATISTICS TIME ON, this:
WITH PagedResults AS (
SELECT
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
COUNT(PK_MatrixItemId) OVER() AS TotalRowCount,
*
FROM [MyTable] myTbl WITH(NOLOCK)
)
SELECT * FROM PagedResults
WHERE RowNumber BETWEEN 3 AND 4810
... or this (whose execution plan is identical):
SELECT * FROM (
SELECT TOP (4813)
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
COUNT(PK_MatrixItemId) OVER() AS TotalRowCount,
*
FROM [MyTable] myTbl WITH(NOLOCK)
) PagedResults
WHERE PagedResults.RowNumber BETWEEN 3 AND 4810
... seems to be averaging a CPU time (all queries added up) of 1.5 to 2 times as much as this:
SELECT * FROM (
SELECT TOP (4813)
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
*
FROM [MyTable] myTbl WITH(NOLOCK)
) PagedResults
WHERE PagedResults.RowNumber BETWEEN 3 AND 4810
SELECT COUNT(*) FROM [MyTable] myTbl WITH(NOLOCK)
Obviously I'd rather use the former than the latter because the latter redundantly repeats the FROM clause (and would repeat any WHERE clauses if I had any), but its execution time is so much better I really have to use it. Is there a way I can get the former's execution time down at all?
CTE's are inlined into the query plan. They perform exactly the same as derived tables do.
Derived tables do not correspond to physical operations. They do not "materialize" the result set into a temp table. (I believe MySQL does this, but MySQL is about the most primitive mainstream RDBMS there is.)
Using OVER() does indeed manifest itself in the query plan as buffering to a temp table. It is not at all clear why that would be faster here than just re-reading the underlying table. Buffering is rather slow because writes are more CPU intensive than reads in SQL Server. We can just read twice from the original table. That's probably why the latter option is faster.
If you want to avoid repeating parts of a query, use a view or table-valued function. Granted, these are not great options for one-off queries. You can also generate SQL in the application layer and reuse strings. ORMs also make this a lot easier.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

Max and Min Time query

how to show max time in first row and min time in second row for access using vb6
What about:
SELECT time_value
FROM (SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
)
ORDER BY time_value DESC;
That should do the job unless there are no rows in SomeTable (or your DBMS does not support the notation).
Simplifying per suggestion in comments - thanks!
SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
ORDER BY time_value DESC;
If you can get two values from one query, you may improve the performance of the query using:
SELECT MIN(time_column) AS min_time,
MAX(time_column) AS max_time
FROM SomeTable;
A really good optimizer might be able to deal with both halves of the UNION version in one pass over the data (or index), but it is quite easy to imagine an optimizer tackling each half of the UNION separately and processing the data twice. If there is no index on the time column to speed things up, that could involve two table scans, which would be much slower than a single table scan for the two-value, one-row query (if the table is big enough for such things to matter).