Query performance with sorting in postgres

Query performance with sorting in postgres - sql

I've a performance issue with a query on a table which has 33m rows. The query should return 6m rows.
I'm trying to achive that the response to the request to begin without any significant delay. It's required for data streaming in my app.
After the start, the data transfer may take longer. The difficult is the query has sorting.
So, I created an Index with fields that are used in the "order by" statement and in the "where" clause.
Example likes that:
CREATE TABLE Table1 (
Id SERIAL PRIMARY KEY,
Field1 INT NOT NULL,
Field2 INT NOT NULL,
Field3 INT NOT NULL,
Field4 VARCHAR(200) NOT NULL,
CreateDate TIMESTAMP,
CloseDate TIMESTAMP NULL
);
CREATE INDEX IX_Table1_SomeIndex ON Table1 (Field2, Field4);
And query likes that:
SELECT * FROM Table1 t
WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
ORDER BY t.Field2, t.Field4
It leads to the following:
when I add "LIMIT 1000" it retruns result immediately and builds the following plan:
the plan with 'LIMIT'
when I run without "LIMIT" it "thinks" for about a minute and returns data for about 16 minutes. And it builds the following plan:
the plan with 'LIMIT'
Why are plans different?
Could you help me to make souliton for streaming immediately (without LIMIT)?
Thanks!

You will need to use a server side cursor or something similar for this to work. Otherwise it runs the query to completion before returning any results. There is no "streaming" by default. How you do this depends on your client, which you don't mention.
If you simply DECLARE a cursor and then FETCH in chunks, then the setting cursor_tuple_fraction will control whether it chooses the plan with a faster start up cost (like what you get with the LIMIT), or a faster overall run cost (like you get without the LIMIT).

If "when I add LIMIT 1000 it returns result immediately" and you want to avoid latency then I would suggest that you run a slightly modified query many times in a loop with LIMIT 1000. An important benefit would be that there will be no long running transactions.
The query to run many times in a loop should return records starting after the largest value of (field2, field4) from the previous iteration run.
SELECT *
FROM table1 t
WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value)
ORDER BY t.field2, t.field4
LIMIT 1000;
last_run_largest_f2_value and last_run_largest_f4_value are parameters. Their values shall come from the last record returned by the previous iteration.
AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value) shall be omitted in the first iteration.
Important limitation
This is an alternative of OFFSET that will work correctly if (field2, field4) values are unique

Related

Optimize SQL query with pagination

I have a query running against a SQL Server database that is taking over 10 seconds to execute. The table being queried has over 14 million rows.
I want to display the Text column from a Notes table by a given ServiceUserId in date order. There could be thousands of entries so I want to limit the returned values to a manageable level.
SELECT Text
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY [DateDone]) AS RowNum, Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2') AS RowConstrainedResult
WHERE
RowNum >= 40 AND RowNum < 60
ORDER BY
RowNum
Below is the execution plan for the above query.
Nonclustered Index - nonclustered index on the ServiceUserId and DateDone columns in ascending order.
Key lookup - Primary key for the table which is the NoteId
If I run the same query a second time but with different row numbers then I get a response in milliseconds, I assume from a cached execution plan. The query ran for a different ServiceUserId will take ~10 seconds though.
Any suggestions for how to speed up this query?

You should look into Keyset Pagination.
It is far more performant than Rowset Pagination.
It differs fundamentally from it, in that instead of referencing a particular block of row numbers, instead you reference starting point to lookup the index key.
The reason it is much faster is that you don't care about how many rows are before a particular key, you just seek a key and move forward (or backward).
Say you are filtering by a single ServiceUserId, ordering by DateDone. You need an index as follows (you could leave out the INCLUDE if it's too big, it doesn't change the maths very much):
create index IX_DateDone on Notes (ServiceUserId, DateDone) INCLUDE (TEXT);
Now, when you select some rows, instead of giving the start and end row numbers, give the starting key:
SELECT TOP (20)
Text,
DateDone
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
AND DateDone > #startingDate
ORDER BY
DateDone;
On the next run, you pass the last DateDone value you received. This gets you the next batch.
The one small downside is that you cannot jump pages. However, it is much rarer than some may think (from a UI perspective) for a user to want to jump to page 327. So that doesn't really matter.
The key must be unique. If it is not unique you can't seek to exactly the next row. If you need to use an extra column to guarantee uniqueness, it gets a little more complicated:
WITH NotesFiltered AS
(
SELECT * FROM Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
)
SELECT TOP (20)
Text,
DateDone
FROM (
SELECT
Text,
DateDone,
0 AS ordering
FROM NotesFiltered
WHERE
DateDone = #startingDate AND NoteId > #startingNoteId
UNION ALL
SELECT
Text,
DateDone,
1 AS ordering
FROM NotesFiltered
WHERE
DateDone > #startingDate
) n
ORDER BY
ordering, DateDone, NoteId;
Side Note
In RDBMSs that support row-value comparisons, the multi-column example could be simplified back to the original code by writing:
WHERE (DateDone, NoteId) > (#startingDate, #startingNoteId)
Unfortunately SQL Server does not support this currently.
Please vote for the Azure Feedback request for this

I would suggest to use order by offset fetch :
it starts from row no x and fetch z next row, which can be parameterized
SELECT
Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
Order by DateDone
OFFSET 40 ROWS FETCH NEXT 20 ROWS ONLY
also make sure you have proper index for "DateDone" , maybe include it in the index you already have on "Notes" if you have not yet
you may need to include text column to you index :
create index IX_DateDone on Notes(DateDone) INCLUDE (TEXT,ServiceUserId)
however be noticed that adding such huge column to the index will effect your insert/update efficiency and of course It will need disk space

Conditional offset on union select

Consider we have complex union select from dozens of tables with different structure but similar fields meaning:
SELECT a1.abc as field1,
a1.bcd as field2,
a1.date as order_date,
FROM a1_table a1
UNION ALL
SELECT a2.def as field1,
a2.fff as field2,
a2.ts as order_date,
FROM a2_table a2
UNION ALL ...
ORDER BY order_date
Notice also that results in general are sorted by "synthetic" field order_date.
This query gives huge number of rows, and we want to work with pages from this set of rows. Each page is defined by two parameters:
page size
field2 value of last item from previous page
Most important thing that we can not change the way can page is defined. I.e. it is not possible to use row number of date of last item from previous page: only field2 value is acceptable.
Current algorithm of paging is implemented in quite ugly way:
1) query above is wrapped in additional select with row_number() additional column and then wrapped in stored procedure union_wrapper which returns appropriate
table ( field1 ..., field2 character varying),
2) then complex select performed:
RETURN QUERY
with tmp as (
select
rownum, field1, field2 from union_wrapper()
)
SELECT field1, field2
FROM tmp
WHERE rownum > (SELECT rownum
FROM tmp
WHERE field2 = last_field_id
LIMIT 1)
LIMIT page_size
The problem is that we have to build in memory full union-select results in order to later detect row number from which we want to cut new page. This is quite slow and takes unacceptable much time to perform.
Is any way to reconfigure this operations in order to significantly reduce query complexity and increase its speed?
And again: we can not change condition of paging, we can not change structure of the tables. Only way of rows retrieving.
UPD: I also can not use temp tables, because I'm working in read-replica of the database.

You have successfully maneuvered yourself into a tight spot. The query and its ORDER BY expression contradict your paging requirements.
ORDER BY order_date is not a deterministic sort order (there could be multiple rows with the same order_date) - which you need before you do anything else here. And field2 does not seem to be unique either. You need both: Define a deterministic sort order and a unique indicator for page end / start. Ideally, the indicator matches the sort order. Could be (order_date, field2), which both columns defined NOT NULL, and the combination UNIQUE. Your restriction "only field2 value is acceptable" contradicts your query.
That's all before thinking about how to get best performance ...
There are proven solutions with row values and multi-column indexes for paging:
Optimize query with OFFSET on large table
But drawing from a combination of multiple source tables complicates matters. Optimization depends on the details of your setup.
If you can't get the performance you need, your only remaining alternative is to materialize the query results somehow. Temp table, cursor, materialized view - the best tool depends on details of your setup.
Of course, general performance tuning might help, too.

Why does checking for null slow this query down?

I got this table containing 7,000 records
desc ARADMIN.V_PKGXMLCODE
Name Null Type
--------------------- -------- -------------
REQUEST_ID NOT NULL VARCHAR2(15)
AVAILABILITY VARCHAR2(69)
XML_CODE CLOB
PACKAGENAME_UNIQUE VARCHAR2(50)
CATALOG NUMBER(15)
CHILD VARCHAR2(255)
CLASSIFICATION_SYSTEM NUMBER(15)
E_MAIL VARCHAR2(69)
The query
SELECT COUNT(*) FROM ARADMIN.V_PKGXMLCODE WHERE (CATALOG <> 0 AND CATALOG <> 2) AND (NOT (CHILD IS NULL));
takes less than one second.
The query
SELECT COUNT(*) FROM ARADMIN.V_PKGXMLCODE WHERE (CATALOG IS NULL OR (CATALOG <> 0 AND CATALOG <> 2)) AND (NOT (CHILD IS NULL));
takes 23 seconds.
Explain plan however claims it should go real quick...
What can I do?

The only way I can think to get that kind of difference in execution speed would be to (a) have an index on field4, and (b) have a lot of empty data blocks; possibly from a high water mark set very high by repeated direct-path loads.
The first query would still use the index and perform as expected. But as null values are not indexed, the index cannot be used to check the or field4 is null condition, so it would fall back to a full table scan.
That in itself shouldn't be a problem here, as a full table scan of 7000 rows shouldn't take long. But since it is taking so long, something else is going on. A full table scan has to examine every data block allocated to the table to see if they contain any rows, and the time it's taking suggests there are a lot more blocks than you need to hold 7000 rows, even with inline CLOB storage.
The simplest way to get a lot of empty data blocks is to have a lot of data and then delete most of it. But I believe you said in a now-deleted comment on an earlier question that performance used to be OK and has got worse. That can happen if you do direct-path inserts, particularly if you 'refresh' data by deleting it and then inserting new data in direct-path mode. You could be doing that with inserts that have the /*+ append */ hint; or in parallel; or through SQL*Loader. Each time you did that the high water mark would move, as old empty blocks wouldn't be reused; and each time performance of the query that checks for nulls would degrade a little. After a lot of iterations that would really start to add up.
You can check the data dictionary to see how much space is allocated to your table (user_segments etc.), and compare that to the size of the data you think you actually have. You can reset the HWM by rebuilding the table, e.g by doing:
alter table mytable move;
(preferably in a maintenance window!)
As a demo I ran a cycle to direct-path insert and delete 7000 rows over a hundred times, and then ran both your queries. The first one took 0.06 seconds (much of which is SQL Devleoper overhead); the second took 1.260. (I also ran Gordon's, which got a similar time, as it still has to do a FTS). With more iterations the difference would become even more marked, but I ran out of space... I then did an alter table move and re-ran your second query, which then took 0.05 seconds.

That is interesting. I would expect the query to have the same performance, because Oracle has a good optimizer and shouldn't be confused by the NULL.
How does this version have better performance?
select x1.cnt + x2.cnt + x3.cnt
from (select count(*) as cnt
from MYTABLE
where field4 = 1 and child is not null
) x1 cross join
(select count(*) as cnt
from MYTABLE
where field4 = 4 and child is not null
) x2 cross join
(select count(*) as cnt
from MYTABLE
where field4 is null and child is not null
) x3;
This version should be able to take advantage of an index on MYTABLE(field4, child).

I was actually struggling with a similar issue. I had a condition where I needed to filter out all the NULL values from my query.
I started with:
ColumnName IS NOT NULL
This increased my query time manifolds, I tried multiple things after this, like functions where I would just return what I needed, although that was not working as well. Finally a small change did the trick, what I did was:
IsNull(ColumnName,'') <> ''
And it worked, I am not entirely sure what is the difference, although it worked.

IS NULL does not work with count. You get the error "Incorrect parameter count in the call to native function 'ISNULL'"

Can SQL return different results for two runs of the same query using ORDER BY?

I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?

Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.

If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.

I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.

Hard-Coding date string much faster than DateTime in SELECT?

I have a very large table (15 million rows, this is an audit table).
I need to run a query that checks for occurrences in the audit table that are after a certain date and meet certain criteria (I am looking for audit records that took place on current day only)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= '8/9/12'
The results come back fairly quick (a few seconds, not bad for 15M rows)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime
It takes 11-15 seconds and does a full table scan.
The actual field I am querying against is a DATETIME type, and the index is also on that field.

Sounds like you are stuck with a bad plan, probably because someone used a parameter at some point that selected enough of the table that a table scan was the most efficient way for that parameter value. Try running the query once this way:
SELECT ... FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTIme OPTION (RECOMPILE);
And then change your code this way:
SELECT ... FROM dbo.AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime;
Using the dbo. prefix will at the very least prevent different users with different schemas from polluting the plan cache with different versions of the plan. It will also disassociate future queries from the bad plan that is stored.
If you are going to vary between selecting recent rows (small %) and a lot of rows, I would probably just leave the OPTION (RECOMPILE) on there. Paying the minor CPU penalty in recompilation every time is going to be cheaper than getting stuck with a bad plan for most of your queries.
Another trick I've seen used to bypass parameter sniffing:
ALTER PROCEDURE dbo.whatever
#DateTime DATETIME
AS
BEGIN
SET NOCOUNT ON;
DECLARE #dt DATETIME;
SET #dt = #DateTime;
SELECT ... WHERE AUDIT_DATE >= #dt;
END
GO
It's kind of a dirty and unintuitive trick, but it gives the optimizer a better glimpse at the parameter value and a better chance to optimize for that value.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas