Why first query is faster than second? - sql

Sorry just clearing my questions. Extending this question Optimizing sqlite query
I have a table:
CREATE TABLE IF NOT EXISTS [app_status](
[id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
[status] TEXT DEFAULT NULL,
[type] INTEGER
)
I have two indexes. One on status and another on type. Which query will run faster and why?
SELECT COALESCE(min(type), 0)
FROM app_status
WHERE status IS NOT NULL
AND type IN (1,2) limit 1
QUERY PLAN o/p
0|0|0|SEARCH TABLE app_status USING INDEX idx_type (mailbox_type=?) (~10 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
Or...
SELECT type FROM
app_status WHERE
status IS NOT NULL
ORDER BY type limit 1
QUERY PLAN o/p
0|0|0|SCAN TABLE app_status USING INDEX idx_type (~500000 rows)

The first query returns zero or one row matching the criteria in the WHERE clause (where status is not null and type in (1,2) , in unspecified order.
The second query finds all the rows matching the criteria in the WHERE clause (where status is not null), sorts them by type and then returns zero or 1 row.
You should note that the two queries, while they may return the identical results, are not guaranteed to. In particular, the row returned by the second query will return the first row of the result set as ordered by type, regardless of what value that type is. If the lowest value for type where `status is not null is, say 157, that is the row you are going to get. The first query, in that case, will return 0 rows.
But assuming type and status are indexed and the query can use one or more of the indexes, then my suspicion is the first query would be faster as it can seek directly to the desired row(s).
But much depends on the shape of the data (how much data is there? How is it distributed? etc.), whether or not the index is 'covering' (if the index doesn't cover all the columns in the query, then it must do additional I/O to get the data page(s) required to cover all the columns.
edited to note Looking at the execution plans you posted (not knowing Sqllite), the first plan says it should return about 10 rows; the second about 50,000 rows. Which do you think might be faster?

You should:
CREATE INDEX idx_app_staus ON app_status (status, type)
In this way the database angine will not have to look up all of the rows it can find the exact rows what it needs by the where clause. I' don't know which query is faster becouse they aren't returning the same result set but with index from above all of these kind of queries will be fast. The other two index could be dropped.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Index is not getting used

This is excerpt from Tom Kyte's book.
"We’re using a SELECT COUNT(*) FROM T query (or something similar)
and we have a B*Tree index on table T. However, the optimizer is full
scanning the table, rather than counting the (much smaller) index
entries. In this case, the index is probably on a set of columns that
can contain Nulls. Since a totally Null index entry would never be
made, the count of rows in the index will not be the count of rows in
the table. Here the optimizer is doing the right thing—it would get
the wrong answer if it used the index to count rows."
As far as I know indexes come into picture when we use a WHERE clause. Why index come in the above scenario? Before countering him I wanted to know the facts.
"As far as i know indexes comes in picture when you used where clause. "
That's one use case for indexes, when we want quick access to rows identified by specific values of indexed column(s). But there are other uses.
Counting rows is one. To count the number of rows in a table Oracle actually has to count each row (because statistics may not be fresh enough), which means literally reading each block of storage and counting the rows in each block. Potentially that's a lot of reads.
However, an index on a NOT NULL column also has an entry for each row of the table. Indexes are much smaller than tables (typically only one column) so an Index block contains many more entries than a Table block. Consequently Oracle has to read far fewer Index blocks to get the count of rows than scanning the table would require. Reading fewer blocks is faster than reading more blocks.
This isn't true if the table only has indexes on nullable columns. Oracle doesn't index null values (unless the index is a composite index and at least one column is populated) so a count of the entries in an index couldn't guarantee to be the actual count of the table's rows.
Another common use case for reading indexes is to satisfy a SELECT statement where all the columns in a projection are in one index, and the index also services any WHERE conditions.
Oracle Database does not store NULLs in the B-tree index, see the documentation
Oracle Database does not index table rows in which all key columns are
null, except for bitmap indexes or when the cluster key column value
is null.
Because of this, if the index has been created on a column that may contain null values, the database cannot use this index in a query like: SELECT COUNT(*) FROM T. Even when the column does not contain any NULLs, the optimizer doesn't know this unless the column has ben marked as NOT NULL.
According to the documentation - FAST FULL INDEX SCAN
Fast Full Index Scan
A fast full index scan is a full index scan in
which the database accesses the data in the index itself without
accessing the table, and the database reads the index blocks in no
particular order.
Fast full index scans are an alternative to a full table scan when
both of the following conditions are met:
The index must contain all columns needed for the query.
A row containing all nulls must not appear in the query result set.
For this result to be guaranteed, at least one column in the index
must have either:
A NOT NULL constraint
A predicate applied to the column that prevents nulls from being
considered in the query result set
So if you know that the indexed column cannot contain NULL values, then mark this column as NOT NULL using ALTER TABLE table_name MODIFY column_name column_type NOT NULL; and the database will use that index in the query: SELECT COUNT(*) FROM T
If the colum can have nulls, and cannot be marked as NOT NULL, then use a solution from #Gordon Linoff's answer.
You can force the indexing of NULL values by including a constant in the index:
create index t_table_col on t(col, 0);
The 1 is a constant expression that is never NULL.

Does column of integers contains value

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?
You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.
You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...
If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.

Explanation of sqlite_stat1 table

I'm trying to diagnose why a particular query is slow against SQLite. There seems to be plenty of information on how the query optimizer works, but scant information on how to actually diagnose issues.
In particular, when I analyze the database I get the expected sqlite_stat1 table, but I don't know what the stat column is telling me. An example row is:
MyTable,ix_id,25112 1 1 1 1
What does the "25112 1 1 1 1" actually mean?
As a wider question, does anyone have any good resources on the best tools and techniques for diagnosing SQLite query performance?
Thanks
from analyze.c:
/* Store the results.
**
** The result is a single row of the sqlite_stmt1 table. The first
** two columns are the names of the table and index. The third column
** is a string composed of a list of integer statistics about the
** index. The first integer in the list is the total number of entires
** in the index. There is one additional integer in the list for each
** column of the table. This additional integer is a guess of how many
** rows of the table the index will select. If D is the count of distinct
** values and K is the total number of rows, then the integer is computed
** as:
**
** I = (K+D-1)/D
**
** If K==0 then no entry is made into the sqlite_stat1 table.
** If K>0 then it is always the case the D>0 so division by zero
** is never possible.
Remember that an index can be comprised of more than one column of a table. So, in the case of "25112 1 1 1 1", this would be described as a composite index that is made up of 4 columns of a table. The numbers mean as follows:
25112 is an estimate of the total number of rows in the index
The second integer (the first "1") is an estimate of the number of rows that have the same value in the 1st column on the index.
The third integer (the second "1") is an estimate of the number of rows that have the same value for the first TWO columns of the index. It is NOT the "distinctness" of column 2.
The forth integer (the third "1") is an estimate of the number of rows that have the same values for the first THREE columns on the index.
Same logic for the last integer..
The last integer should always be one. Consider a table that has two rows and two columns with a composite index made up of column1+column2. The data is the table is:
Apple,Red
Apple,Green
The stats would look like "2 2 1". Meaning, there are 2 rows in the index. There are two rows that would be returned if only using column1 of the index (Apple and Apple). And 1 unique row that would be returned using column1+column2 (Apple+Red is unique from Apple+Green)
Also, I = (K+D-1)/D means : K is supposed total number of rows, and D is distinct values for each column,
so if you table created with CREATE TABLE TEST (C1 INT, C2 TEXT, C3 INT, C4 INT);
and you create index like CREATE INDEX IDX on TEST(C1, C2)
Then you can manually INSERT or let sqlite automatically update the sqlite_stat1 table as:
"TEST"--> TABLE NAME, "IDX"--> INDEX NAME, "10000 1 1000", HERE, 10000 is your total number of rows in TABLE TEST, 1 means, for column C1, all the values seemed to be distinct, this sounds like C1 is something like IDs or whatever, 1000 means C2 has less distinct value, as you know, the higher the value is, the less distinct values the index refers to the specific column.
You can run ANALYZE or manually update the table. (Better do the first).
So what does the value uses for? sqlite will use these statistics, to find the best index they want to use, you can consider CREATE INDEX IDX2 ON TEST(C2)" AND the value in stat1 table is "10000 1, and CREATE INDEX IDX1 ON TEST(C1)" with value "10000 100";
Suppose we don't have index IDX which we defined before, when you issue
SELECT * FORM TEST WHERE C1=? AND C2=?, sqlite will choose IDX2, but not IDX1, why? It's simple, since IDX2 can minimize the query results but IDX1 not.
Clear?
Simply run explain QUERY PLAN + YOUR SQL STATEMENT, You shall find whether the tables referred on the statement uses the index you want, if not, try to rewrite the sql, if yes, figure out whether the correct index you want to use. More info plz refer to www.sqlite.org

Optimizing "ORDER BY" when the result set is very large and it can't be ordered by an index

How can I make an ORDER BY clause with a small LIMIT (ie 20 rows at a time) return quickly, when I can't use an index to satisfy the ordering of rows?
Let's say I would like to retrieve a certain number of titles from a table 'node' (simplified below). I'm using MySQL by the way.
node_ID INT(11) NOT NULL auto_increment,
node_title VARCHAR(127) NOT NULL,
node_lastupdated INT(11) NOT NULL,
node_created INT(11) NOT NULL
But I need to limit the rows returned to only those a particular user has access to. Many users have access large numbers of nodes. I have this information pre-calculated in a big lookup table (an attempt to make things easier) where the primary key covers both columns and the presence of a row means that usergroup has access to that node:
viewpermission_nodeID INT(11) NOT NULL,
viewpermission_usergroupID INT(11) NOT NULL
My query therefore contains something like
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
... and I also use a GROUP BY or a DISTINCT so that a node is only returned once even if two of the user's 'usergroups' both have access to that node.
My problem is that there seems to be no way for an ORDER BY clause which sorts results by created or last updated date to use an index, because the rows being returned depend on values in the other viewpermission table.
Therefore MySQL would need to find all rows which match the criteria, then sort them all itself. If there are one million rows for a particular user, and we want to view, say, the latest 100 or rows 100-200 when ordered by last update, the DB would need to figure out which one million rows the user can see, sort this whole result set itself, before it can return those 100 rows, right?
Is there any creative way to get around this? I've been thinking along the lines of:
Somehow add dates into the viewpermission lookup table so that I can build an index containing the dates as well as the permissions. It's a possibility I guess.
Edit: Simplified question
Perhaps I can simplify the question by rewriting it like this:
Is there any way to rewrite this query or create an index for the following such that an index can be used to do the ordering (not just to select the rows)?
SELECT nodeid
FROM lookup
WHERE
usergroup IN (2, 3)
GROUP BY
nodeid
An index on (usergroup) allows the WHERE part to be satisfied by an index, but the GROUP BY forces a temporary table and filesort on those rows. An index on (nodeid) does nothing for me, because the WHERE clause needs an index with usergroup as its first column. An index on (usergroup, nodeid) forces a temporary table and filesort because the GROUP BY is not the first column of the index that can vary.
Any solutions?
Can I answer my own question?
I believe I have found that the only way to do what I describe is for my lookup table to have rows for every possible combination of usergroups a person may want to be a member of.
To pick a simplified example, instead of doing this:
SELECT id FROM ids WHERE groups IN(1,2) ORDER BY id
If you need to use the index both to select rows and to order them, you have to abstract that IN(1,2) so that it is constant rather than a range, ie:
SELECT id FROM ids WHERE grouplist='1,2' ORDER BY id
Of course instead of using the string '1,2' you could have a foreign key there, etc. The point being that you'd have to have a row not just for each group but for each combination of multiple groups.
So, there is my answer.
Anyway, for my application, I feel that maintaining a lookup for all possible combinations of usergroups for each node is not worth it. For my purposes, I predict that most nodes are visible to most users, so I feel that it is acceptable to simply to make the GROUP BY use the index, as the filtering doesn't need it so badly.
In other words, the approach I'll take for my original query may be something like:
SELECT
<fields>
FROM
node
INNER JOIN viewpermission ON
viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
FORCE INDEX(node_created_and_node_ID)
GROUP BY
node_created, node_ID
GROUP BY can use an index if it starts at the left most column of the index and it is in the first non-const non-system table to be processed. The join then deals with the entire list (which is already ordered), and only those not visible to the current user (which will be a small proportion) are removed by the INNER JOIN.
Copy the value you are going to order by into to viewpermission table and add it to your index.
You could use a trigger to maintain that value from the other table.
select * from
(
select *
FROM node
INNER JOIN viewpermission
ON viewpermission_nodeID=node_ID
AND viewpermission_usergroupID IN (<...usergroups of current user...>)
) a
order by a.node_lastupdated desc
The inner query gives you the filtered subset, which I understand is substantially smaller than the whole set. Only the smaller has to be sorted.
MySQL has problems when you use GROUP BY and ORDER BY in the same query. That causes a filesort, and that's probably the biggest penalty for performance.
You can eliminate the need for a DISTINCT (or GROUP BY) by using a non-correlated subquery instead of a JOIN.
SELECT * FROM node
WHERE node_id IN (
SELECT viewpermission_nodeID
FROM viewpermission
WHERE viewpermissiong_usergroupID IN ( <...usergroups...> )
)
ORDER BY node_lastupdated DESC
LIMIT 100;
There's no need to sort or do a DISTINCT on the subquery, since IN (1, 1, 2, 3) is the same as IN (1, 3, 2).
Note that MySQL can use only one index per table in a given query, so it'll try to make the best choice between an index on node_id and an index on node_lastupdated. It can't use both, and even if you made a compound index it wouldn't help in this case.
Remember to analyze different solutions with EXPLAIN.