Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index - sql

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?

Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.

Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.

select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.

MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?

Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.

This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

SQL - Get specific row without a full table scan

I'm using Postgresql (cockroachdb) and I want to select a specific row. For example, there are thousands of records and I want to select row number 999.
In this case we would use LIMIT and OFFSET, SELECT * FROM table LIMIT 1 OFFSET 998;
However, using LIMIT and OFFSET can cause performance issue according to this post. So I'm wondering if there a way to get specific row without a full table scan.
I feel like it is possible because the database seems to sort data by primary key, that when I do SELECT * FROM table; it always show a sorted result. Since it is sorted by primary key, database can use index to access a specific row, right?

If you select rows based on the primary key (e.g. SELECT * FROM table WHERE <primary key> = <value>), no scans will be needed underneath the hood. The same is also true if you define a secondary index on the table and apply a WHERE clause that filters based on the column(s) in the secondary index.

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.

Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.

I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.

Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

In SQL Server, is TOP deterministic by default when used on a table with a clustered index?

So I was trying to explain to some people why this query is a bad idea:
SELECT z.ReportDate, z.Zipcode, SUM(z.Sales) AS Sales,
COALESCE(
(SELECT TOP (1) GroupName
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 'Unknown') AS GroupName,
COALESCE(
(SELECT TOP (1) GroupCode
FROM dbo.zipGroups
WHERE (Zipcode = z.Zipcode)), 0) AS GroupNumber
FROM dbo.Report_ByZipcode AS z
GROUP BY z.ReportDate, z.Zipcode
and suggesting a better way to write it, when my boss ended the discussion with, "Well, it's been returning the right data for the last year and we haven't had any problems with it, so it's fine."
At which point I thought to myself, how in the world is that even possible?
After some digging, I discovered these facts:
This query is supposed to group sales by Zipcode and date, and link those to the largest Group (by population size) that a Zipcode is assigned to by way of the zipGroups table.
Each Zipcode can be assigned to 0 to many Groups, and if a Zipcode is assigned to 0 Groups, it's simply not in the zipGroups table.
A Group is a geographical area, and the GroupNumbers are ranked by largest to smallest by population (for example, the group covering the NY-NJ-CT tri-state area is GroupNumber 1, and North Platte, Nebraska is GroupNumber 209).
The zipGroups table has not changed in at least 2 years.
The zipGroups table has a clustered index with Zipcode, GroupNumber (ascending) as the keys.
The combination of Zipcode, GroupNumber is unique in zipGroups.
So my question has 2 parts.
A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
B2) If that is not true, can you help me prove it?
Note: I've already re-written this to use joins, so I don't need the SQL to fix it, I need to get it into production so I stop worrying about it breaking.

SQL Server makes no guarantees about the ordering of records in the absence of ORDER BY. It might yield the correct results 999,999 times and then fail on the millionth try. Don't do it.

Always use an order by with a TOP statement. The order is not guaranteed to be in the order of the clustered index as demonstrate in this blog post (complete with a query that disproves it):
Without ORDER BY, there is no default sort order.
Even if it did go by the clustered index, I wouldn't write queries that depend on undocumented behavior of the DB engine and it is better to be explicit for readability.

If you're relying on a clustered index rather than the collation, then getting the right order is coincidental, not deterministic.
In the real world, indexes can be changed from one kind to another, for good reasons, bad reasons, or no reason at all. And, in the real world, you don't necessarily get to choose which index SQL Server will use in executing a query. (Or whether it will use an index at all.)
Technically, the collation can also be changed for good reasons, bad reasons, or no reason at all. But everybody knows changing the collation will change the sort order--that's its job, after all--so it's not a surprise. (Ever heard of "the principle of least surprise"?)

The link by JohnFx is good, although long and hard to follow. Here's a small snippet on it's own that will show the data returning in non-clustered index order.
CREATE TABLE t1 (x INT NOT NULL PRIMARY KEY CLUSTERED, z INT NOT NULL UNIQUE);
INSERT INTO t1 (x,z) VALUES (1,4);
INSERT INTO t1 (x,z) VALUES (3,3);
INSERT INTO t1 (x,z) VALUES (2,2);
INSERT INTO t1 (x,z) VALUES (4,1);
SELECT x, z FROM t1;
Output (you should get)
x z
----------- -----------
4 1
2 2
3 3
1 4
The execution plan shows it using the Unique (or other) index instead of the clustered index.
Even if the clustered index is chosen, it may not sort correctly if the data is being merged from parallelism, if the TOP N count is high enough.
Having said that, since you are only using TOP(1) and if the table has only one index available, it can be considered deterministic since it will only use that index and pick the first entry in the index pages.

A) Even though there are no ORDER BY clauses in those SELECT TOP queries, are they actually deterministic because the clustered index is basically providing it a default ORDER BY?
B1) If that is true, is the query, however precariously, actually doing what it's supposed to do?
When top is specified without ordering, the ordering is a side effect of the method of access chosen by the query optimizer. Since the query optimizer would use the clustered index to resolve this query, you get a quite nice side effect.
I wouldn't use the word deterministic, as the query optimizer might not be deterministic. However in the case where the optimizer choses the clustered index, yes - the query does what it is supposed to do.
ORDER should still be specified, so as to lock the correctness into the query. One should separate correctness ("What do you want") and implementation ("How do you get it") into query and optimizer plan, respectively.
B2) If that is not true, can you help me prove it?
Assuming there are more columns in the ZipGroups table, a Nonclustered index containing the only two relevant columns could be added that would be preferred over the clustered index. If the nonclustered index had a different ordering (Zipcode asc, GroupNumber desc), then the query would break.

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)

The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.

To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.

If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.

The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.

I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.

sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas