Does order of data in a column matter in SQL search? - sql

Let's say I have a table named dict with the following data in which the column word is sorted alphabetically.
ID | word | definition
1 a a's def
2 b b's def
3 c c's def
And then I run a query SELECT * FROM dict WHERE word="a";
Let's say I have a million row in that table. By performance, would my query run faster if the data in word column are sorted alphabetically or does SQLeven care whether data are sorted or not and thus the speed is the same?

SQL tables represent unordered sets. So, there really isn't a concept of a table being in a particular order (at least until you learn what a clustered index is).
If you want your query to run faster, create an index on word:
create index idx_dict_word on dict(word)
This will lookup the word in the index (very fast). And then fetch the right words.
As for your question, you might start to get results faster if the word appears near the beginning of the table scan. However, the query has to go through the entire table, so the ordering does not matter with respect to the query completing.

Related

Oracle multiple vs single column index

Imagine I have a table with the following columns:
Column: A (numer(10)) (PK)
Column: B (numer(10))
Column: C (numer(10))
CREATE TABLE schema_name.table_name (
column_a number(10) primary_key,
column_b number(10) ,
column_c number(10)
);
Column A is my PK.
Imagine my application now has a flow that queries by B and C. Something like:
SELECT * FROM SCHEMA.TABLE WHERE B=30 AND C=99
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
Q1 - If so, why should I create an index with those two columns?
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
The simple answers to your questions.
For this query:
SELECT *
FROM SCHEMA.TABLE
WHERE B = 30 AND C = 99;
The optimal index either (B, C) or (C, B). The order does matter because the two comparisons are =.
An index on either column can be used, but all the matching values will need to be scanned to compare to the second value.
If you have an index on (B, C), then this can be used for a query on WHERE B = 30. Oracle also implements a skip-scan optimization, so it is possible that the index could also be used for WHERE C = 99 -- but it probably would not be.
I think the documentation for MySQL has a good introduction to multi-column indexes. It doesn't cover the skip-scan but is otherwise quite applicable to Oracle.
Short answer: always check the real performance, not theoretical. It means, that my answer requires verification at real database.
Inside SQL (Oracle, Postgre, MsSql, etc.) the Primary Key is used for at least two purposes:
Ordering of rows (e.g. if PK is incremented only then all values will be appended)
Link to row. It means that if you have any extra index, it will contain whole PK to have ability to jump from additional index to other rows.
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
It depends. If your table is too small, Oracle can do just full scan of it. For large table Oracle can (and will do in common scenario) use index for column B and next do range scan. In this case Oracle check all values with B=30. Therefore, if you can only one row with B=30 then you can achieve good performance. If you have millions of such rows, Oracle will need to do million of reads. Oracle can get this information via statistic.
Q1 - If so, why should I create an index with those two columns?
It is needed to direct access to row. In this case Oracle requires just few jumps to find your row. Moreover, you can apply unique modifier to help Oracle. Then it will know, that not more than single row will be returned.
However if your table has other columns, real execution plan will include access to PK (to retrieve other rows).
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
Yes. Please check the details here. If index have several columns, than Oracle will sort them according to column ordering. E.g. if you create index with columns B, C then Oracle will able to use it to retrieve values like "B=30", e.g. when you restricted only B.
Well, it all depends.
If that table is tiny, you won't see any benefit regardless any indexes you might create - it is just too small and Oracle returns data immediately.
If the table is huge, then it depends on column's selectivity. There's no guarantee that Oracle will ever use that index. If optimizer decides (upon information it has - don't forget to regularly collect statistics!) that the index should not be used, then you created it in vain (though, you can choose to use a hint, but - unless you know what you're doing, don't do it).
How will you know what's going on? See the explain plan.
But, generally speaking, yes - indexes help.
Q1 - If so, why should I create an index with those two columns?
Which "two columns"? A? If it is a primary key column, Oracle automatically creates an index, you don't have to do that.
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If you are talking about a composite index (containing both B and C columns, respectively), and if query uses B column, then yes - index will (OK, might be used). But, if query uses only column C, then this index will be completely useless.
In spite of this question being answered and one answer being accepted already, I'll just throw in some more information :-)
An index is an offer to the DBMS that it can use to access data quicker in some situations. Whether it actually uses the index is a decision made by the DBMS.
Oracle has a built-in optimizer that looks at the query and tries to find the best execution plan to get the results you are after.
Let's say that 90% of all rows have B = 30 AND C = 99. Why then should Oracle laboriously walk through the index only to have to access almost every row in the table at last? So, even with an index on both columns, Oracle may decide not to use the index at all and even perform the query faster because of the decision against the index.
Now to the questions:
If I create an index only using the Column B, this will already improve my query right?
It may. If Oracle thinks that B = 30 reduces the rows it will have to read from the table imensely, it will.
If so, why should I create an index with those two columns?
If the combination of B = 30 AND C = 99 limits the rows to read from the table further, it's a good idea to use this index instead.
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If the index is on (B, C), i.e. B first, then Oracle may find it useful, yes. In the extreme case that there are only the two columns in the table, that would even be a covering index (i.e. containing all columns accessed in the query) and the DBMS wouldn't have to read any table row, as all the information is already in the index itself. If the index is (C, B), i.e. C first, it is quite unlikely that the index would be used. In some edge-case situations, Oracle might do so, though.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

SQL Server range indexing ideas

I need help understanding how to create proper indexing on a table for fast range selects.
I have a table with the following columns:
Column --- Type
frameidx --- int
u --- int
v --- int
x --- float(53)
y --- float(53)
z --- float(53)
None of these columns is unique.
There are to be approximately 30 million records in this table.
An average query would look something like this:
Select x, y, z from tablename
Where
frameidx = 4 AND
u between 34 AND 500
v between 0 AND 200
Pretty straight forward, no joins, no nested stuff. Just good ol' subset selection.
What sort of indexing should I do in MS SQL Server (2012) for this table in order to be able to fetch records (which can be in the thousands from this query) in (ideally) less than a 100ms, for example?
Thanks.
If you don't have indices, SQL Server needs to scan the whole table to find the required data. For such a big table (30M rows), that's time consuming.
If you have indices appropriate for your query, the SQL server will seek them (i.e. it will quickly find the required rows in the index, using the index structure). The index consists of the indexed column values, in the given index order, and pointers to the rows in the indexed table, so once the data is found in the index, the necessary data from the indexed table is recovered using those pointers.
SO, if you want to speed up thing, you need to create indexes for the columns which you're going to use to filter the ranges.
Adding indexes will improve the query response time, but will also take up more space, and make the insertions slower. So you shouldn't create a lot of indexes.
If you're going to use all the columns for filtering all the time, you should make only one index. And, ideally, that index should be the more selective, i.e. the one that has the most different values (the least number of repeated values). Only one index can be used for each query.
If you're going to use different sets of range filters, you should create more indexes.
Using a composite can be good or bad. In a composite key, the rows are ordered by all of the columns in the index. So, provided you index by A, B, C & D, filtering or ordering by A will give consecutive rows of the index, and it's a quick operation. And filtering by A, B, C & D, is ideal for this index. However, filtering or ordering only by D, is the worst case for this index, because it will need to recover data spread all over the index: remember that the data is ordered by A, then B, then C, then D, so the D info is spread all over the index. Depending on several factors (table stats, index selectivity, and so on), it's even possible that no index is used at all, and the table is scanned.
A final note on the clustered index: a clustered index defines the physical order in which the data is stored in the table. It doesn't need to be unique. If you're using one of the columns for filtering most of the times, it's a good idea to make that the table's clustered index, because, in this case, instead of seeking an index and finding the data in the indexed table using pointers, the table is sought directly, and that can improve performance.
So there is no simple answer, but I hope to know you have info to improve your query speed.
EDIT
Corrected info, according to a very interesting comment.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.

Explanation of sqlite_stat1 table

I'm trying to diagnose why a particular query is slow against SQLite. There seems to be plenty of information on how the query optimizer works, but scant information on how to actually diagnose issues.
In particular, when I analyze the database I get the expected sqlite_stat1 table, but I don't know what the stat column is telling me. An example row is:
MyTable,ix_id,25112 1 1 1 1
What does the "25112 1 1 1 1" actually mean?
As a wider question, does anyone have any good resources on the best tools and techniques for diagnosing SQLite query performance?
Thanks
from analyze.c:
/* Store the results.
**
** The result is a single row of the sqlite_stmt1 table. The first
** two columns are the names of the table and index. The third column
** is a string composed of a list of integer statistics about the
** index. The first integer in the list is the total number of entires
** in the index. There is one additional integer in the list for each
** column of the table. This additional integer is a guess of how many
** rows of the table the index will select. If D is the count of distinct
** values and K is the total number of rows, then the integer is computed
** as:
**
** I = (K+D-1)/D
**
** If K==0 then no entry is made into the sqlite_stat1 table.
** If K>0 then it is always the case the D>0 so division by zero
** is never possible.
Remember that an index can be comprised of more than one column of a table. So, in the case of "25112 1 1 1 1", this would be described as a composite index that is made up of 4 columns of a table. The numbers mean as follows:
25112 is an estimate of the total number of rows in the index
The second integer (the first "1") is an estimate of the number of rows that have the same value in the 1st column on the index.
The third integer (the second "1") is an estimate of the number of rows that have the same value for the first TWO columns of the index. It is NOT the "distinctness" of column 2.
The forth integer (the third "1") is an estimate of the number of rows that have the same values for the first THREE columns on the index.
Same logic for the last integer..
The last integer should always be one. Consider a table that has two rows and two columns with a composite index made up of column1+column2. The data is the table is:
Apple,Red
Apple,Green
The stats would look like "2 2 1". Meaning, there are 2 rows in the index. There are two rows that would be returned if only using column1 of the index (Apple and Apple). And 1 unique row that would be returned using column1+column2 (Apple+Red is unique from Apple+Green)
Also, I = (K+D-1)/D means : K is supposed total number of rows, and D is distinct values for each column,
so if you table created with CREATE TABLE TEST (C1 INT, C2 TEXT, C3 INT, C4 INT);
and you create index like CREATE INDEX IDX on TEST(C1, C2)
Then you can manually INSERT or let sqlite automatically update the sqlite_stat1 table as:
"TEST"--> TABLE NAME, "IDX"--> INDEX NAME, "10000 1 1000", HERE, 10000 is your total number of rows in TABLE TEST, 1 means, for column C1, all the values seemed to be distinct, this sounds like C1 is something like IDs or whatever, 1000 means C2 has less distinct value, as you know, the higher the value is, the less distinct values the index refers to the specific column.
You can run ANALYZE or manually update the table. (Better do the first).
So what does the value uses for? sqlite will use these statistics, to find the best index they want to use, you can consider CREATE INDEX IDX2 ON TEST(C2)" AND the value in stat1 table is "10000 1, and CREATE INDEX IDX1 ON TEST(C1)" with value "10000 100";
Suppose we don't have index IDX which we defined before, when you issue
SELECT * FORM TEST WHERE C1=? AND C2=?, sqlite will choose IDX2, but not IDX1, why? It's simple, since IDX2 can minimize the query results but IDX1 not.
Clear?
Simply run explain QUERY PLAN + YOUR SQL STATEMENT, You shall find whether the tables referred on the statement uses the index you want, if not, try to rewrite the sql, if yes, figure out whether the correct index you want to use. More info plz refer to www.sqlite.org