How does postgres implement a sequential scan? - sql

I understand that when the majority of a table is estimated to be required in the result set for a given query, that a sequential scan may be preferred over using an index.
What I'm curious about is how postgres actually reads the pages into memory?
Does it organise them into some kind of ad-hoc in memory index whilst it reads them?
What if the table's too large to fit into memory?
Are there any high level papers on the topic?
(I've done some searching but results are full of blog posts explaining the basics of indexing, not the implementation details of a sequential scan. I expect it's not as straightforward as read into an array when evaluating a join condition over most of a table)

What I'm curious about is how postgres actually reads the pages into memory?
The engine reads the whole heap in any order while discarding rows marked as deleted. Hot blocks (already present in the cache) are much faster to process.
Does it organise them into some kind of ad-hoc in memory index whilst it reads them?
No, a sequential scan avoids indexes and reads the heap directly using buffering and the cache.
What if the table's too large to fit into memory?
A sequential scan is pipelined. This means I/O blocks are read as needed. The engine does not need to have the whole heap in memory before it starts processing it. It read a few blocks, then process them and discards them; then it does this again and again until it reads all the blocks of the heap.
Are there any high level papers on the topic?
There should be but, anyway, any good book on query optimization will describe this process in detail.
EDIT For Your Second Question:
What I guess I mean is if you're joining on some random column X, does it have to iterate through each possible row multiple times to find the correct row for each value in the other table, or does it do something more advanced than that?
Well, when you join a couple of tables (or more) the engine query planner produces a plan that includes a "Nested Loop", a "Hash Join", or a "Merge Join" operator. There are more operators but these are the common ones.
The Nested Loop Join retrieves rows for the linked table that match the first one. It could perform an index seek or scan on the related table (ideal) or a full table scan (not ideal).
The Hash Join hashes the secondary table first (incurring in high startup cost) and then joins fast.
The Merge Join sorts both tables by the join key (assuming an equi-join), again incurring in heavy startup cost) and then joins fast (like a zipper).

Related

Index versus Sequential search performance?

Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?
It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).
Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.

Oracle 10g Full table scan(parallel access) 100x times faster than index access by rowid

There was a query in production which was running for several hours(5-6) hours. I looked into its execution plan, and found that it was ignoring a parallel hint on a huge table. Reason - it was using TABLE ACCESS BY INDEX ROWID. So after I added a /*+ full(huge_table) */ hint before the parallel(huge_table) hint, the query started running in parallel, and it finished in less than 3 minutes. What I could not fathom was the reason for this HUGE difference in performance.
The following are the advantages of parallel FTS I can think of:
Parallel operations are inherently fast if you have more idle CPUs.
Parallel operations in 10g are direct I/O which bypass
buffer cache which means there is no risk of "buffer busy waits" or
any other contention for buffer cache.
Sure there are the above advantages but then again the following disadvantages are still there:
Parallel operations still have to do I/O, and this I/O would be more than what we have for TABLE ACCESS BY INDEX ROWID as the entire table is scanned and is costlier(all physical reads)
Parallel operations are not very scalable which means if there aren't enough free resources, it is going to be slow
With the above knowledge at hand, I see only one reason that could have caused the poor performance for the query when it used ACCESS BY INDEX ROWID - some sort of contention like "busy buffer waits". But it doesn't show up on the AWR top 5 wait events. The top two events were "db file sequential read" and "db file scattered read". Is there something else that I have missed to take into consideration? Please enlighten me.
First, without knowing anything about your data volumes, statistics, the selectivity of your predicates, etc. I would guess that the major benefit you're seeing is from doing a table scan rather than trying to use an index. Indexes are not necessarily fast and table scans are not necessarily slow. If you are using a rowid from an index to access a row, Oracle is limited to doing single block reads (sequential reads in Oracle terms) and that it's going to have to read the same block many times if the block has many rows of interest. A full table scan, on the other hand, can do nice, efficient multiblock reads (scattered reads in Oracle terms). Sure, an individual single block read is going to be more efficient than a single multiblock read but the multiblock read is much more efficient per byte read. Additionally, if you're using an index, you've potentially got to read a number of blocks from the index periodically to find out the next rowid to read from the table.
You don't actually need to read all that much data from the table before a table scan is more efficient than an index. Depending on a host of other factors, the tipping point is probably in the 10-20% range (that's a very, very rough guess). Imagine that you had to get a bunch of names from the phone book and that the phone book had an index that included the information you're filtering on and the page that the entry is on. You could use an index to find the name of a single person you want to look at, flip to the indicated page, record the information, flip back to the index, find the next name, flip back, etc. Or you could simply start at the first name, scan until you find a name of interest, record the information, and continue the scan. It doesn't take too long before you're better off ignoring the index and just reading from the table.
Adding parallelism doesn't reduce the amount of work your query does (in fact, adding in parallel query coordination means that you're doing more work). It's just that you're doing that work over a shorter period of elapsed time by using more of the server's available resources. If you're running the query with 6 parallel slaves, that could certainly allow the query to run 5 times faster overall (parallel query obviously scales a bit less than linearly because of overheads). If that's the case, you'd expect that doing a table scan made the query 20 times faster and adding parallelism added another factor of 5 to get your 100x improvement.

Querying Oracle table of high degree of parallelism results in full table scan

Well, the title described what I've just encountered recently with Oracle database.
Here's some background:
Table in concern in partitioned by hash into 4 partitions.
Parallel degree of the table is 4.
Hash key equals PK.
There is quite a number of rows in the table, around 200M.
PK index is also partitioned (local partition).
Parallel degree of the index is 1.
Okay now I've got a query behaves strangely as I change the parallel degree of the table.
If table degree is 4, it results in full table scan (coordinated parallel full table scan) as revealed by explain plan. Takes 30 minutes or more to complete the query.
If table degree is 1-3, it correctly make use of the PK index (range scan, single threaded) and returns result in 20 seconds.
If I set both table degree and index degree to 4, results in full table scan (same result as the first scenario in above).
This behavior, however, does not happen in another database where I have an nearly identical clone of the table. The only difference is number of records. The table in another database is of slightly smaller size (minus 1-2 million). The smaller table, also with degree of 4, does not runs into full table scan with the same query.
I've spent some time on Googling around and found the following things about parallel query:
From Oracle official doc
A high degree of parallelism for a table skews the optimizer toward full table scans over range scans. Examine the DEGREE column in ALL_TABLES for the table to determine the degree of parallelism.
And from http://www.toadworld.com/Portals/0/GuyH/Articles/Oracle%20Parallel%20SQL%20Part%201.pdf
Parallel query should be applied when
The SQL performs at least one full table, index or partition scan
And from AskTom.com
Parallel query is suitable for a certain class of large problems: very large problems
that have no other solution. Parallel query is my last path of action for solving a
performance problem; it's never my first course of action.
It seems that parallel execution is designed for processing a very large scale of data when no other better solution exists. It attempts to give better performance by running things in parallel, with each CPU (process) dedicated to work on separated portion of data (block range, table partitions or index partitions). Such that it is not designed to speed up general query, or query that does not cover a sufficient portion of the whole table.
Is my above understanding correct that parallel should not be used as a mean to speed up general query?
If yes, is that also means that the best practice to turn off parallel (degree as 0) and enable for particular query/operation through hint or parallel clause?
And in addition to all, what should be the best practice for setting up PARALLEL? If what I want to do is give best read performance through multi-threading, what should the setup be?
Lots of questions here. Lots of thanks in advance.
As a general rule I agree with Tom. Our main base table is an approx 240m rows iot, plus other indexes, with somewhere between 10 and 1,000 insert, delete, update operations happening 24 hours a day. We generally get information out of it in split seconds and then if we want a lot of information go for the full scan and deal with the 2.5 hours it takes. In answer to some of your questions, if you're going to be doing more large queries than small ones then go with the partition. If not then don't.
For your specific query, parallelism likely isn't your biggest problem. The new estimated cost and time of a query will be very roughly equal to the original cost divided by the degree of parallelism. The optimizer could be wrong here; for example, if you only have one hard drive then the new plan probably won't be any faster at all. But a 4x estimate mistake shouldn't lead to a 90x performance difference. This leads me to believe that your plan was already on the brink of failure, and this just tipped it over. How close are the estimated and actual cardinalities of your non-parallel plan? Whatever is causing those differences might be responsible for the bulk of your problem.
For your more general questions, there are no simple answers. There are several dozen things you may need to consider for parallelism, only you can know which ones will apply to your situation. Your best bet is to stop trying to Google it, and instead read the manual. The Using Parallel Execution chapter in the Data Warehousing Guide is a good place to start.
Degree of a relation or table in SQL means number of attribute in a relation.
For Example: If a relation in SQL has three rows and four columns then its degree in four. Simply we can say that number of columns of a relation called its degree.

To what degree can effective indexing overcome performance issues with VERY large tables?

So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.

Querying Postgresql with a very large result set

In an application I need to query a Postgres DB where I expect tens or even hundreds of millions of rows in the result set. I might do this query once a day, or even more frequently. The query itself is relatively simple, although may involve a few JOINs.
My question is: How smart is Postgres with respect to avoiding having to seek around the disk for each row of the result set? Given the time required for a hard disk seek, this could be extremely expensive.
If this isn't an issue, how does Postgres avoid it? How does it know how to lay out data on the disk such that it can be streamed out in an efficient manner in response to this query?
When PostgreSQL analyzes your data, one of the statistics calculated, and used by the query planner is the correlation between the ordering of values in your field or index, and the order on disk.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)
The index cost estimation functions also calculate a correlation:
The indexCorrelation should be set to the correlation (ranging between -1.0 and 1.0) between the index order and the table order. This is used to adjust the estimate for the cost of fetching rows from the parent table.
I don't know for sure, but I assume that the correlation values from various possible plans are used by the planner when determining whether the number of rows to be read from a table can be done with lower cost by performing a table scan, with sequential io (possibly joining in with another concurrent scan of the same table), filtering for the required rows, or an index scan, with its resulting seeks.
PostgreSQL doesn't keep tables sorted according to any particular key, but they can periodically be recreated in a particular index order using the CLUSTER command (which will be slow, with a disk seek per row, if the data to cluster has low correlation to the index values order).
PostgreSQL is able to effectively collect a set of disk blocks that need retrieving, then obtain them in physical order to reduce seeking. It does this through Bitmap Scans. Release Notes for 8.1 say:
Bitmap scans are useful even with a single index, as they reduce the amount of random access needed; a bitmap index scan is efficient for retrieving fairly large fractions of the complete table, whereas plain index scans are not.
Edit: I meant to mention the planner cost contants seq_page_cost and random_page_cost that inform the planner of the relative costs of performing a disk page fetch that is part of a series of sequential fetches, vs. a non-sequentially-fetched disk page.