I'm doing paging with SQL Server and I'd like to avoid duplication by counting the total number of results as part of my partial resultset, rather than getting that resultset and then doing a separate query to get the count afterwards. However, the trouble is, it seems to be increasing execution time. For example, if I check with SET STATISTICS TIME ON, this:
WITH PagedResults AS (
SELECT
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
COUNT(PK_MatrixItemId) OVER() AS TotalRowCount,
*
FROM [MyTable] myTbl WITH(NOLOCK)
)
SELECT * FROM PagedResults
WHERE RowNumber BETWEEN 3 AND 4810
... or this (whose execution plan is identical):
SELECT * FROM (
SELECT TOP (4813)
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
COUNT(PK_MatrixItemId) OVER() AS TotalRowCount,
*
FROM [MyTable] myTbl WITH(NOLOCK)
) PagedResults
WHERE PagedResults.RowNumber BETWEEN 3 AND 4810
... seems to be averaging a CPU time (all queries added up) of 1.5 to 2 times as much as this:
SELECT * FROM (
SELECT TOP (4813)
ROW_NUMBER() OVER (ORDER BY AggregateId ASC) AS RowNumber,
*
FROM [MyTable] myTbl WITH(NOLOCK)
) PagedResults
WHERE PagedResults.RowNumber BETWEEN 3 AND 4810
SELECT COUNT(*) FROM [MyTable] myTbl WITH(NOLOCK)
Obviously I'd rather use the former than the latter because the latter redundantly repeats the FROM clause (and would repeat any WHERE clauses if I had any), but its execution time is so much better I really have to use it. Is there a way I can get the former's execution time down at all?
CTE's are inlined into the query plan. They perform exactly the same as derived tables do.
Derived tables do not correspond to physical operations. They do not "materialize" the result set into a temp table. (I believe MySQL does this, but MySQL is about the most primitive mainstream RDBMS there is.)
Using OVER() does indeed manifest itself in the query plan as buffering to a temp table. It is not at all clear why that would be faster here than just re-reading the underlying table. Buffering is rather slow because writes are more CPU intensive than reads in SQL Server. We can just read twice from the original table. That's probably why the latter option is faster.
If you want to avoid repeating parts of a query, use a view or table-valued function. Granted, these are not great options for one-off queries. You can also generate SQL in the application layer and reuse strings. ORMs also make this a lot easier.
Related
When running the following query I got the error:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 158% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%
select *, row_number() over(PARTITION BY Column_A ORDER BY Column_B)
from
(SELECT
*
FROM
Table_1 UNION ALL
SELECT
*
FROM
Table_2 UNION ALL
SELECT
*
FROM
Table_3
)
Can someone help me how to change this query or is there possibility that we can change the memory limit in bigquery?
Welcome Aaron,
This error means BigQuery is unable to process the whole query due to memory limits, the ORDER BY function is pretty memory intensive, try removing it and I would expect your query to run fine.
If you need results ordered, try writing the unordered query out to a table then running a new query on this table to order the results.
If you're interested. here's an interesting article on how and BigQuery executes in memory:
https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery
I don't believe you can override or change this memory limit, but happy to be proven wrong.
Make sure your ORDER BY is being executed in real last step, additionally, consider to use a LIMIT clause to avoid “Resources Exceeded” or “Response too large” fails.
My primary recommendation here is to make sure to use partitioning and clustering.
Partitions apply to date field so if your Table_1, Table_2... has one, partition on it.
Clustering also greatly helps the memory cost of OVER clauses with ORDER BY because it sorts storage blocks (BigQuery docs)
To make the most of both of the above, I would also replace your UNION ALL sub-query with a temporary table.
Storing the result of the UNION ALL to memory, doing the partitioning+clustering of the resulting dataset and only then computing the rank is much more efficient in terms of memory and storage (Medium article)
Your final statement should look something like:
CREATE TEMP TABLE tmp
PARTITION BY date
CLUSTER BY column_A, column_B
AS
SELECT
*
FROM
Table_1 UNION ALL
SELECT
*
FROM
Table_2 UNION ALL
SELECT
*
FROM
Table_3
;
select *, row_number() over(PARTITION BY Column_A ORDER BY Column_B) from tmp
I've encountered this before and turns out I was trying to partition by a column with "NULL" values. Removing the NULL records worked!
You can try OVER without using ORDER BY
I have really complicated query:
select * from (
select * from tbl_user ...
where ...
and date_created between :date_from and :today
...
order by date_created desc
) where rownum <=50;
Currently query is fast enough because of where clause (only 3 month before today, date_from = today - 90 days).
I have to remove this clause, but it causes performance degradation.
What if first calculate date_from by `
SELECT MIN(date_created) where...
and then insert this value into main query? Set of data will be the same. Will it improve performance? Does it make sense?
Could anyone have any assumption about optimization?
Using an order by operation will of course cause the query to take a little longer to return. That being said, it is almost always faster to sort in the DB than it is to sort in your application logic.
It's hard to really optimize without the full query and schema information, but I'll take a stab at what seems like the most obvious to me.
Converting to Rank()
Your query could be a lot more efficient if you use a windowed rank() function. I've also converted it to use a common table expression (aka CTE). This doesn't improve performance, but does make it easier to read.
with cte as (
select
*
, rank() over (
partition by
-- insert what fields differentiate your rows here
-- unlike a group by clause, this doesn't need to be
-- every field
order by
date_created desc
)
from
tbl_user
...
where
...
and date_created between :date_from and :today
)
select
*
from
cte
where
rk <= 50
Indexing
If date_created is not indexed, it probably should be.
Take a look at your autotrace results. Figure out what filters have the highest cost. These are probably unindexed, and maybe should be.
If you post your schema, I'd be happy to make better suggestions.
In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".
I have a query:
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
where RowNum = 1
and I'm getting the following error:
Invalid column name 'RowNum'.
I did some search here and found that column aliasing is not available in WHERE.
so I tried the the following and it worked:
select *
from
(
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
) as t
where t.RowNum = 1
Is there a better way, from performance point of view, to make this query?
Thanks in advance.
That's just the way it is.
Column aliases can not be used on the same logical level where they were defined. You will have to use the derived table (sub-query) as you have found out.
If you are concerned about performance, then don't. The derived table is mere syntactical sugar, it won't make the query slower (compared to the solution you tried first).
An alternative to this specific query, which won't perform any different but is simpler to write:
SELECT TOP 1 <col list> FROM dbo.[Log] ORDER BY LogId;
As #a_horse explained, don't be concerned that because your second query looks like more code that it is more expensive. If you want to measure the efficiency of different queries that get the same results, compare their execution plans, not code complexity.
When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.