Explanation of sqlite_stat1 table - sql

I'm trying to diagnose why a particular query is slow against SQLite. There seems to be plenty of information on how the query optimizer works, but scant information on how to actually diagnose issues.
In particular, when I analyze the database I get the expected sqlite_stat1 table, but I don't know what the stat column is telling me. An example row is:
MyTable,ix_id,25112 1 1 1 1
What does the "25112 1 1 1 1" actually mean?
As a wider question, does anyone have any good resources on the best tools and techniques for diagnosing SQLite query performance?
Thanks

from analyze.c:
/* Store the results.
**
** The result is a single row of the sqlite_stmt1 table. The first
** two columns are the names of the table and index. The third column
** is a string composed of a list of integer statistics about the
** index. The first integer in the list is the total number of entires
** in the index. There is one additional integer in the list for each
** column of the table. This additional integer is a guess of how many
** rows of the table the index will select. If D is the count of distinct
** values and K is the total number of rows, then the integer is computed
** as:
**
** I = (K+D-1)/D
**
** If K==0 then no entry is made into the sqlite_stat1 table.
** If K>0 then it is always the case the D>0 so division by zero
** is never possible.

Remember that an index can be comprised of more than one column of a table. So, in the case of "25112 1 1 1 1", this would be described as a composite index that is made up of 4 columns of a table. The numbers mean as follows:
25112 is an estimate of the total number of rows in the index
The second integer (the first "1") is an estimate of the number of rows that have the same value in the 1st column on the index.
The third integer (the second "1") is an estimate of the number of rows that have the same value for the first TWO columns of the index. It is NOT the "distinctness" of column 2.
The forth integer (the third "1") is an estimate of the number of rows that have the same values for the first THREE columns on the index.
Same logic for the last integer..
The last integer should always be one. Consider a table that has two rows and two columns with a composite index made up of column1+column2. The data is the table is:
Apple,Red
Apple,Green
The stats would look like "2 2 1". Meaning, there are 2 rows in the index. There are two rows that would be returned if only using column1 of the index (Apple and Apple). And 1 unique row that would be returned using column1+column2 (Apple+Red is unique from Apple+Green)

Also, I = (K+D-1)/D means : K is supposed total number of rows, and D is distinct values for each column,
so if you table created with CREATE TABLE TEST (C1 INT, C2 TEXT, C3 INT, C4 INT);
and you create index like CREATE INDEX IDX on TEST(C1, C2)
Then you can manually INSERT or let sqlite automatically update the sqlite_stat1 table as:
"TEST"--> TABLE NAME, "IDX"--> INDEX NAME, "10000 1 1000", HERE, 10000 is your total number of rows in TABLE TEST, 1 means, for column C1, all the values seemed to be distinct, this sounds like C1 is something like IDs or whatever, 1000 means C2 has less distinct value, as you know, the higher the value is, the less distinct values the index refers to the specific column.
You can run ANALYZE or manually update the table. (Better do the first).
So what does the value uses for? sqlite will use these statistics, to find the best index they want to use, you can consider CREATE INDEX IDX2 ON TEST(C2)" AND the value in stat1 table is "10000 1, and CREATE INDEX IDX1 ON TEST(C1)" with value "10000 100";
Suppose we don't have index IDX which we defined before, when you issue
SELECT * FORM TEST WHERE C1=? AND C2=?, sqlite will choose IDX2, but not IDX1, why? It's simple, since IDX2 can minimize the query results but IDX1 not.
Clear?

Simply run explain QUERY PLAN + YOUR SQL STATEMENT, You shall find whether the tables referred on the statement uses the index you want, if not, try to rewrite the sql, if yes, figure out whether the correct index you want to use. More info plz refer to www.sqlite.org

Related

Is it advised to index the field if I envision retrieving all records corresponding to positive values in that field?

I have a table with definition somewhat like the following:
create table offset_table (
id serial primary key,
offset numeric NOT NULL,
... other fields...
);
The table has about 70 million rows in it.
I envision doing the following query many times
select * from offset_table where offset > 0;
For speed issues, I am wondering whether it would be advised to create an index like:
create index on offset_table(offset);
I am trying to avoid creation of unnecessary indices on this table as it is pretty big already.
As you mentioned in the comments, it would be ~70% of rows that match the offset > 0 predicate.
In that case the index would not be beneficial, since postgresql (and basically every other DBMS) would prefer a full table scan instead. It happens because it would be faster than jumping between reading the index consequently and the table randomly.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Does order of data in a column matter in SQL search?

Let's say I have a table named dict with the following data in which the column word is sorted alphabetically.
ID | word | definition
1 a a's def
2 b b's def
3 c c's def
And then I run a query SELECT * FROM dict WHERE word="a";
Let's say I have a million row in that table. By performance, would my query run faster if the data in word column are sorted alphabetically or does SQLeven care whether data are sorted or not and thus the speed is the same?
SQL tables represent unordered sets. So, there really isn't a concept of a table being in a particular order (at least until you learn what a clustered index is).
If you want your query to run faster, create an index on word:
create index idx_dict_word on dict(word)
This will lookup the word in the index (very fast). And then fetch the right words.
As for your question, you might start to get results faster if the word appears near the beginning of the table scan. However, the query has to go through the entire table, so the ordering does not matter with respect to the query completing.

Does column of integers contains value

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?
You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.
You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...
If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.

Why first query is faster than second?

Sorry just clearing my questions. Extending this question Optimizing sqlite query
I have a table:
CREATE TABLE IF NOT EXISTS [app_status](
[id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
[status] TEXT DEFAULT NULL,
[type] INTEGER
)
I have two indexes. One on status and another on type. Which query will run faster and why?
SELECT COALESCE(min(type), 0)
FROM app_status
WHERE status IS NOT NULL
AND type IN (1,2) limit 1
QUERY PLAN o/p
0|0|0|SEARCH TABLE app_status USING INDEX idx_type (mailbox_type=?) (~10 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
Or...
SELECT type FROM
app_status WHERE
status IS NOT NULL
ORDER BY type limit 1
QUERY PLAN o/p
0|0|0|SCAN TABLE app_status USING INDEX idx_type (~500000 rows)
The first query returns zero or one row matching the criteria in the WHERE clause (where status is not null and type in (1,2) , in unspecified order.
The second query finds all the rows matching the criteria in the WHERE clause (where status is not null), sorts them by type and then returns zero or 1 row.
You should note that the two queries, while they may return the identical results, are not guaranteed to. In particular, the row returned by the second query will return the first row of the result set as ordered by type, regardless of what value that type is. If the lowest value for type where `status is not null is, say 157, that is the row you are going to get. The first query, in that case, will return 0 rows.
But assuming type and status are indexed and the query can use one or more of the indexes, then my suspicion is the first query would be faster as it can seek directly to the desired row(s).
But much depends on the shape of the data (how much data is there? How is it distributed? etc.), whether or not the index is 'covering' (if the index doesn't cover all the columns in the query, then it must do additional I/O to get the data page(s) required to cover all the columns.
edited to note Looking at the execution plans you posted (not knowing Sqllite), the first plan says it should return about 10 rows; the second about 50,000 rows. Which do you think might be faster?
You should:
CREATE INDEX idx_app_staus ON app_status (status, type)
In this way the database angine will not have to look up all of the rows it can find the exact rows what it needs by the where clause. I' don't know which query is faster becouse they aren't returning the same result set but with index from above all of these kind of queries will be fast. The other two index could be dropped.