Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.
Related
There was a query in production which was running for several hours(5-6) hours. I looked into its execution plan, and found that it was ignoring a parallel hint on a huge table. Reason - it was using TABLE ACCESS BY INDEX ROWID. So after I added a /*+ full(huge_table) */ hint before the parallel(huge_table) hint, the query started running in parallel, and it finished in less than 3 minutes. What I could not fathom was the reason for this HUGE difference in performance.
The following are the advantages of parallel FTS I can think of:
Parallel operations are inherently fast if you have more idle CPUs.
Parallel operations in 10g are direct I/O which bypass
buffer cache which means there is no risk of "buffer busy waits" or
any other contention for buffer cache.
Sure there are the above advantages but then again the following disadvantages are still there:
Parallel operations still have to do I/O, and this I/O would be more than what we have for TABLE ACCESS BY INDEX ROWID as the entire table is scanned and is costlier(all physical reads)
Parallel operations are not very scalable which means if there aren't enough free resources, it is going to be slow
With the above knowledge at hand, I see only one reason that could have caused the poor performance for the query when it used ACCESS BY INDEX ROWID - some sort of contention like "busy buffer waits". But it doesn't show up on the AWR top 5 wait events. The top two events were "db file sequential read" and "db file scattered read". Is there something else that I have missed to take into consideration? Please enlighten me.
First, without knowing anything about your data volumes, statistics, the selectivity of your predicates, etc. I would guess that the major benefit you're seeing is from doing a table scan rather than trying to use an index. Indexes are not necessarily fast and table scans are not necessarily slow. If you are using a rowid from an index to access a row, Oracle is limited to doing single block reads (sequential reads in Oracle terms) and that it's going to have to read the same block many times if the block has many rows of interest. A full table scan, on the other hand, can do nice, efficient multiblock reads (scattered reads in Oracle terms). Sure, an individual single block read is going to be more efficient than a single multiblock read but the multiblock read is much more efficient per byte read. Additionally, if you're using an index, you've potentially got to read a number of blocks from the index periodically to find out the next rowid to read from the table.
You don't actually need to read all that much data from the table before a table scan is more efficient than an index. Depending on a host of other factors, the tipping point is probably in the 10-20% range (that's a very, very rough guess). Imagine that you had to get a bunch of names from the phone book and that the phone book had an index that included the information you're filtering on and the page that the entry is on. You could use an index to find the name of a single person you want to look at, flip to the indicated page, record the information, flip back to the index, find the next name, flip back, etc. Or you could simply start at the first name, scan until you find a name of interest, record the information, and continue the scan. It doesn't take too long before you're better off ignoring the index and just reading from the table.
Adding parallelism doesn't reduce the amount of work your query does (in fact, adding in parallel query coordination means that you're doing more work). It's just that you're doing that work over a shorter period of elapsed time by using more of the server's available resources. If you're running the query with 6 parallel slaves, that could certainly allow the query to run 5 times faster overall (parallel query obviously scales a bit less than linearly because of overheads). If that's the case, you'd expect that doing a table scan made the query 20 times faster and adding parallelism added another factor of 5 to get your 100x improvement.
Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.
I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.
So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.
Using SQLite, Got a table with ~10 columns. Theres ~25million rows.
That table has an INDEX on 'sid, uid, area, type'.
I run a select like so:
SELECT sid from actions where uid=1234 and area=1 and type=2
That returns me 1571 results, and takes 4 minutes to complete.
Is that sane?
I'm far from an SQL expert, so hopefully someone can fill me in on what I'm missing. Why could this possibly take 4+ minutes with everything indexed?
Any recommended resources to learn about achieving high SQL performance? I feel like a lot of the Google results just give me opinions or anecdotes, I wouldn't mind a solid book.
Create uid+area+type index instead, or uid+area+type+sid
Since the index starts with the sid column, it must do a scan (start at the beginning, read to the end) of either the index or the table to find your data matching the other 3 columns. This means it has to read all 25 million rows to find the answer. Even if it's reading just the rows of the index rather than the table, that's a lot of work.
Imagine a phone book of the greater New York metropolitan area, organized by (with an 'index' on) Last Name, First Name.
You submit SELECT [Last Name] FROM NewYorkPhoneBook WHERE [First Name] = 'Thelma'
It has to read all 25 million entries to find all those Thelmas. Unless you either specify the last name and can then turn directly to the page where that last name first appers (a seek), or have an index organized by First Name (a seek on the index followed by a seek on the table, aka a "bookmark lookup"), there's no way around it.
The index you would create to make your query faster is on uid, area, type. You could include sid, though leave it out if sid is part of the primary key.
Note: Tables often do have multiple indexes. Just note that the more indexes, the slower the write performance. Unnecessary indexes can slow overall performance, sometimes radically so. Testing and eventually experience will help guide you in this. Also, reasoning it out as a real-world problem (like my phone book examples) can really help. If it wouldn't make sense with phone books (and separate phone book indexes) then it probably won't make sense in the database.
One more thing: even if you put an index on those columns, if your query is going to end up pulling a great percentage of the rows in the main table, it will still be cheaper to scan the table rather than do the bookmark lookup (seek the index then seek the table for each row found). The exact "tipping point" of whether to do a bookmark lookup with a seek, or to do a table scan isn't something I can tell you off the top of my head, but it is based on solid math.
The index is not really usefull as it does start with the wrong field... which means a table scan.
Looks like you have a normal computer there, not something made for databases. I run table scans over 650 million rows in about a minute on my lower end db server, but that means reading about a gigabyte per second from the discs, which are a RAID of 10k RM discs - RAID 10. Just to say that basically... that databases love IO, and that in a degree that you have never seen before. Basically larger db servers have many discs to satisfy the IOPS (IO per second) requirement. I have seen a server with 190 discs.
So, you ahve two choices: beed up your IOPS capability (means spending money), or set up indices that get used because they are "proper".
Proper means: an index only is usefull if the fields it contains are used from left to right. Not necessarily in the same order... but if a field is missed there is a chance the SQL System decides it is not worth pursuing the index and instead goes table scan (as in your case).
When you create your new index on uid, area and type, you should also do a select distinct on each one to determine which has the fewest distinct entries, then create your index such that the fewer the differences the earlier they show up in the index definition.