Poor performance on pagination using SQL Server ROW_NUMBER()

Poor performance on pagination using SQL Server ROW_NUMBER() - sql

I am trying to build a pagination mechanism. I am using a ORM that creates SQL looking like this:
SELECT * FROM
(SELECT t1.colX, t2.colY
ROW_NUMBER() OVER (ORDER BY t1.col3) AS row
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.col1=t2.col2
)a
WHERE row >= n AND row <= m
Table1 has >500k rows and Table2 has >10k records
I execute the queries directly in the SQL Server 2008 R2 Management Studio. The subquery takes 2-3sec to execute but the whole query takes > 2 min.
I know SQL Server 2012 accepts the OFFSET .. LIMIT .. option but I cannot upgrade the software.
Can anyone help me in improving the performance of the query or suggest other pagination mechanism that can be imposed through the ORM software.
Update:
Testing Roman Pekar's solution (see comments on the solution) proved that ROW_NUMBER() might not be the cause of the performance problems. Unfortunately the problems persist.
Thanks

As I understand your table structure from comments.
create table Table2
(
col2 int identity primary key,
colY int
)
create table Table1
(
col3 int identity primary key,
col1 int not null references Table2(col2),
colX int
)
That means that the rows returned from Table1 can never be filtered by the join to Table2 because Table1.col1 is not null. Neither can the join to Table2 add rows to the result since Table2.Col2 is the primary key.
You can then rewrite your query to generate row numbers on Table1 before the join to Table2. And the where clause is also applied before the join to Table2 meaning that you will only locate the rows in Table2 that is actually part of the result set.
select T1.colX,
T2.colY,
T1.row
from
(
select col1,
colX,
row_number() over(order by col3) as row
from Table1
) as T1
inner join Table2 as T2
on T1.col1 = T2.col2
where row >= #n and row <= #m
SQL Fiddle
I have no idea if you can make your ORM (Lightspeed by Mindscape) to generated the paging query like this instead of what you have now.
The query plan from this answer:
The query plan using the query in the question:
There is a huge difference in reads between the two.

Insert just the primary key column(s) of the paginated table into a temp table with an identity column, ordering by the ordered-by columns. (You may have to include the ordered-by columns to ensure the ordering comes out right.) Then, join back to the main table using the temp table as a key for the rows you want. If the data is fairly static, you could save the ordering data to a session-keyed permanent table instead of a temp table, and reuse it for a short period of time (so subsequent page requests within a few minutes are nearly instant).
Row_Number() tends to perform well with small sets of data, but it can hit serious performance snags once you get some serious rows, as you have with 500k.

I suggest you to check indexes on your tables. I think it'll help your query if you at least have index on col2 on table2. You could also try to rewrite your query like
;with cte1 as (
select top (#m) t1.colX, t2.colY, t1.col3
from Table1 as t1
inner join Table2 as t2 on t1.col1=t2.col2
order by t1.col3 asc
),
cte2 as (
select top (#m - #n + 1) *
from cte1
order by col3 desc
)
select *
from cte2 as t1
but it could still be slow if you don't have indexes

Related

Left Join Lateral and array aggregates

I'm using Postgres 9.3.
I have two tables T1 and T2 and a n:m relation T1_T2_rel between them. Now I'd like to create a view that in addition to the columns of T1 provides a column that, for each record in T1, contains an array with the primary key ids of all related records of T2. If there are no related entries in T2, corresponding fields of this column shall contain null-values.
An abstracted version of my schema would look like this:
CREATE TABLE T1 ( t1_id serial primary key, t1_data int );
CREATE TABLE T2 ( t2_id serial primary key );
CREATE TABLE T1_T2_rel (
t1_id int references T1( t1_id )
, t2_id int references T2( t2_id )
);
Corresponding sample data could be generated as follows:
INSERT INTO T1 (t1_data)
SELECT cast(random()*100 as int) FROM generate_series(0,9) c(i);
INSERT INTO T2 (t2_id) SELECT nextval('T2_t2_id_seq') FROM generate_series(0,99);
INSERT INTO T1_T2_rel
SELECT cast(random()*10 as int) % 10 + 1 as t1_id
, cast(random()*99+1 as int) as t2_id
FROM generate_series(0,99);
So far, I've come up with the following query:
SELECT T1.t1_id, T1.t1_data, agg
FROM T1
LEFT JOIN LATERAL (
SELECT t1_id, array_agg(t2_id) as agg
FROM T1_T2_rel
WHERE t1_id=T1.t1_id
GROUP BY t1_id
) as temp ON temp.t1_id=T1.t1_id;
This works. However, can it be simplified?
A corresponding fiddle can be found here: sql-fiddle. Unfortunately, sql-fiddle does not support Postgres 9.3 (yet) which is required for lateral joins.
[Update] As has been pointed out, a simple left join using a subquery in principle is enough. However, If I compare the query plans, Postgres resorts to sequential scans on the aggregated tables when using a left join whereas index scans are used in the case of the left join lateral.

As #Denis already commented: no need for LATERAL.
Also, your subquery selected the wrong column. This works:
SELECT t1.t1_id, t1.t1_data, t2_ids
FROM t1
LEFT JOIN (
SELECT t1_id, array_agg(t2_id) AS t2_ids
FROM t1_t2_rel
GROUP BY 1
) sub USING (t1_id);
-SQL fiddle.
Performance and testing
Concerning the ensuing sequential scan you mention: If you query the whole table, a sequential scan is often faster. Depends on the version you are running, your hardware, your settings and statistics of cardinalities and distribution of your data. Experiment with selective WHERE clauses like WHERE t1.t1_id < 1000 or WHERE t1.t1_id = 1000 and combine with planner settings to learn about choices:
SET enable_seqscan = off;
SET enable_indexscan = off;
To reset:
RESET enable_seqscan;
RESET enable_indexscan;
Only in your local session, mind you! This related answer on dba.SE has more instructions.
Of course, your setting may be off, too:
Keep PostgreSQL from sometimes choosing a bad query plan

SQL Optimizer / Execution Plan - Repeated Subqueries

NOTE: I'm currently running my queries in question on a sqlite3 DB, though answers from expertise in any other DBMS will be welcome insight...
I was wondering if the query optimizer makes any attempt to identify repeated queries/subqueries and run them only once if so.
Here is my example query:
SELECT *
FROM table1 AS t1
WHERE t1.fk_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
OR t1.fk_id =
(
SELECT local_id
FROM ID_MAP
WHERE remote_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
);
Will the nested query
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
be run only once (and its results cached for further access) ?
Its not a big deal in this example, since its a simple query that executes only twice, however I need it to run
about 4-5 more times (x2, twice for each child record, so 8-10 really) in my actual program (its grabbing all child records (table1)
associated to a parent record (table2), bound by a foreign key. Its also checking an id mapping table to make sure it queries
for both a locally generated id, as well as the real/updated/new key).
I really appreciate any help with this, thank you.

SQLite has a very simple query optimizer, and does not even try to detect identical subqueries:
> create table t(x);
> explain query plan
select * from t
where x in (select x from t) or
x in (select x from t);
0|0|0|SCAN TABLE t (~500000 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE t (~1000000 rows)
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE t (~1000000 rows)
The same applies to CTEs and views; if the performance actually matters, your best bet is to create a temporary table for the result of the subquery.

As you asked for insight from other DBs....
In Oracle DBMS, any independent subquery will be executed only once.
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111 -- The result will be the same for any row in t1.
Dependant subqueries will need to executed repeatedly, of course.
Example of dependent subquery:
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = t1.t2_id -- t1.t2_id will have different values for different rows in t1.

Should I avoid IN() because slower than EXISTS() [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?

Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.

Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)

If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.

SQLite table aliases effecting the performance of queries

How does SQLite internally treats the alias?
Does creating a table name alias internally creates a copy of the same table or does it just refers to the same table without creating a copy?
When I create multiple aliases of the same table in my code, performance of the query is severely hit!
In my case, I have one table, call it MainTable with namely 2 columns, name and value.
I want to select multiple values in one row as different columns. for example
Name: a,b,c,d,e,f
Value: p,q,r,s,t,u
such that a corresponds to p and so on.
I want to select values for names a,b,c and d in one row => p,q,r,s
So I write a query
SELECT t1.name, t2.name, t3.name, t4.name
FROM MainTable t1, MainTable t2, MainTable t3, MainTable t4
WHERE t1.name = 'a' and t2.name = 'b' and t3.name = 'c' and t4.name = 'd';
This way f writing the query kills the performance when size of the table increases as rightly pointed above by Larry.
Is there any efficient way to retrieve this result. I am bad at SQL queries :(

If you list the same table more than once in your SQL statement and do not supply conditions on which to JOIN the tables, you are creating a cartesian JOIN in your result set and it will be enormous:
SELECT * FROM MyTable A, MyTable B;
if MyTable has 1000 records, will create a result set with one million records. Any other selection criteria you include will then have to be evaluated across all one million records.
I'm not sure that's what you're doing (your question is very unclear), but it may be a start on solving your problem.
Updated answer now that the poster has added the query that is being executed.
You're going to have to get a little tricky to get the results you want. You need to use CASE and MAX and, unfortunately, the syntax for CASE is a little verbose:
SELECT MAX(CASE WHEN name='a' THEN value ELSE NULL END),
MAX(CASE WHEN name='b' THEN value ELSE NULL END),
MAX(CASE WHEN name='c' THEN value ELSE NULL END),
MAX(CASE WHEN name='d' THEN value ELSE NULL END)
FROM MainTable WHERE name IN ('a','b','c','d');
Please give that a try against your actual database and see what you get (of course, you want to make sure the column name is indexed).

Assuming you have table dbo.Customers with a million rows
SELECT * from dbo.Customers A
does not result in a copy of the table being created.
As Larry pointed out, the query as it stands is doing a cartesian product across your table four times which, as you has observed, kills your performance.
The updated ticket states the desire is to have 4 values from different queries in a single row. That's fairly simple, assuming this syntax is valid for sqllite
You can see that the following four queries when run in serial produce the desired value but in 4 rows.
SELECT t1.name
FROM MainTable t1
WHERE t1.name='a';
SELECT t2.name
FROM MainTable t2
WHERE t2.name='b';
SELECT t3.name
FROM MainTable t3
WHERE t3.name='c';
SELECT t4.name
FROM MainTable t4
WHERE t4.name='d';
The trick is to simply run them as sub queries like so there are 5 queries: 1 driver query, 4 sub's doing all the work. This pattern will only work if there is one row returned.
SELECT
(
SELECT t1.name
FROM MainTable t1
WHERE t1.name='a'
) AS t1_name
,
(
SELECT t2.name
FROM MainTable t2
WHERE t2.name='b'
) AS t2_name
,
(
SELECT t3.name
FROM MainTable t3
WHERE t3.name='c'
) AS t3_name
,
(
SELECT t4.name
FROM MainTable t4
WHERE t4.name='d'
) AS t4_name

Aliasing a table will result a reference to the original table that exists for the duration of the SQL statement.

Table "Diff" in Oracle

What is the best way to perform a "diff" between two structurally identical tables in Oracle? they're in two different schemas (visible to each other).
Thanks,

If you don't have a tool like PLSQL developer, you could full outer join the two tables. If they have a primary key, you can use that in the join. This will give you an instant view on records missing in either table.
Then, for the records that do exist in both tables, you can compare each of the fields. You should note that you cannot compare null with the regular = operator, so checking is table1.field1 = table2.field1 will return false if both fields are null. So you'll have to check for each field if it has the same value as in the other table, or if both are null.
Your query might look like this (to return records that don't match):
select
*
from
table1 t1
full outer join table2 t2 on t2.id = t1.id
where
-- Only one record exists
t1.id is null or t2.id is null or
( -- The not = takes care of one of the fields being null
not (t1.field1 = t2.field1) and
-- and they cannot both be null
(t1.field1 is not null or t2.field1 is not null)
)
You will have to copy that field1 condition for each of your fields. Of course you could write a function to compare field data to make your query easier, but you should keep in mind that that might decrease performance dramatically when you need to compare two large tables.
If your tables do not have a primary key, you will need to cross join them and perform these checks for each resulting record. You may speed that up a little by using full outer join on each mandatory field, because that cannot be null and can therefor be used in the join.

Assuming you want to compare the data (diff on entire rows) in the two tables:
SELECT *
FROM (SELECT 's1.t' "Row Source", a.*
FROM (SELECT col1, col2
FROM s1.t tbl1
MINUS
SELECT col1, col2
FROM s2.t tbl2) a
UNION ALL
SELECT 's2.t', b.*
FROM (SELECT col1, col2
FROM s2.t tbl2
MINUS
SELECT col1, col2
FROM s1.t tbl1) b)
ORDER BY 1;
More info about comparing two tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Poor performance on pagination using SQL Server ROW_NUMBER() - sql

Related

Left Join Lateral and array aggregates

SQL Optimizer / Execution Plan - Repeated Subqueries

Should I avoid IN() because slower than EXISTS() [duplicate]

SQLite table aliases effecting the performance of queries

Table "Diff" in Oracle

Categories

Resources