Querying Postgresql with a very large result set

Querying Postgresql with a very large result set - sql

In an application I need to query a Postgres DB where I expect tens or even hundreds of millions of rows in the result set. I might do this query once a day, or even more frequently. The query itself is relatively simple, although may involve a few JOINs.
My question is: How smart is Postgres with respect to avoiding having to seek around the disk for each row of the result set? Given the time required for a hard disk seek, this could be extremely expensive.
If this isn't an issue, how does Postgres avoid it? How does it know how to lay out data on the disk such that it can be streamed out in an efficient manner in response to this query?

When PostgreSQL analyzes your data, one of the statistics calculated, and used by the query planner is the correlation between the ordering of values in your field or index, and the order on disk.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)
The index cost estimation functions also calculate a correlation:
The indexCorrelation should be set to the correlation (ranging between -1.0 and 1.0) between the index order and the table order. This is used to adjust the estimate for the cost of fetching rows from the parent table.
I don't know for sure, but I assume that the correlation values from various possible plans are used by the planner when determining whether the number of rows to be read from a table can be done with lower cost by performing a table scan, with sequential io (possibly joining in with another concurrent scan of the same table), filtering for the required rows, or an index scan, with its resulting seeks.
PostgreSQL doesn't keep tables sorted according to any particular key, but they can periodically be recreated in a particular index order using the CLUSTER command (which will be slow, with a disk seek per row, if the data to cluster has low correlation to the index values order).
PostgreSQL is able to effectively collect a set of disk blocks that need retrieving, then obtain them in physical order to reduce seeking. It does this through Bitmap Scans. Release Notes for 8.1 say:
Bitmap scans are useful even with a single index, as they reduce the amount of random access needed; a bitmap index scan is efficient for retrieving fairly large fractions of the complete table, whereas plain index scans are not.
Edit: I meant to mention the planner cost contants seq_page_cost and random_page_cost that inform the planner of the relative costs of performing a disk page fetch that is part of a series of sequential fetches, vs. a non-sequentially-fetched disk page.

Related

Index versus Sequential search performance?

Say I have a database that holds information about books and their dates of publishing. (two attributes, bookName and publicationDate).
Say that the attribute publicationDate has a Hash Index.
If I wanted to display every book that was published in 2010 I would enter this query : select bookName from Books where publicationDate=2010.
In my lecture, it is explained that if there is a big volume of data and that the publication dates are very diverse, the more optimized way is to use the Hash index in order to keep only the books published in 2010.
However, if the vast majority of the books that are in the database were published in 2010 it is better to search the database sequentially in terms of performance.
I really don't understand why? What are the situations where using an index is more optimized and why?

It is surprising that you are learning about hash indexes without understanding this concept. Hash indexing is a pretty advanced database concept; most databases don't even support them.
Although the example is quite misleading. 2010 is not a DATE; it is a YEAR. This is important because a hash index only works on equality comparisons. So the natural way to get a year of data from dates:
where publicationDate >= date '2010-01-01' and
publicationDate < date '2011-01-01'
could not use a hash index because the comparisons are not equality comparisons.
Indexes can be used for several purposes:
To quickly determine which rows match filtering conditions so fewer data pages need to be read.
To identify rows with common key values for aggregations.
To match rows between tables for joins.
To support unique constraints (via unique indexes).
And for b-tree indexes, to support order by.
This is the first purpose, which is to reduce the number of data pages being read. Reading a data page is non-trivial work, because it needs to be fetched from disk. A sequential scan reads all data pages, regardless of whether or not they are needed.
If only one row matches the index conditions, then only one page needs to be read. That is a big win on performance. However, if every page has a row that matches the condition, then you are reading all the pages anyway. The index seems less useful.
And using an index is not free. The index itself needs to be loaded into memory. The keys need to be hashed and processed during the lookup operation. All of this overhead is unnecessary if you just scan the pages (although there is other overhead for the key comparisons for filtering).

Using an index has a performance cost. If the percentage of matches is a small fraction of the whole table, this cost is more than made up for by not having to scan the whole table. But if there's a large percentage of matches, it's faster to simply read the table.
There is the cost of reading the index. A small, frequently used index might be in memory, but a large or infrequently used one might be on disk. That means slow disk access to search the index and get the matching row numbers. If the query matches a small number of rows this overhead is a win over searching the whole table. If the query matches a large number of rows, this overhead is a waste; you're going to have to read the whole table anyway.
Then there is an IO cost. With disks it's much, much faster to read and write sequentially than randomly. We're talking 10 to 100 times faster.
A spinning disk has a physical part, the head, it must move around to read different parts of the disk. The time it takes to move is known as "seek time". When you skip around between rows in a table, possibly out of order, this is random access and induces seek time. In contrast, reading the whole table is likely to be one long continuous read; the head does not have to jump around, there is no seek time.
SSDs are much, much faster, there's no physical parts to move, but they're still much faster for sequential access than random.
In addition, random access has more overhead between the operating system and the disk; it requires more instructions.
So if the database decides a query is going to match most of the rows of a table, it can decide that it's faster to read them sequentially and weed out the non-matches, than to look up rows via the index and using slower random access.
Consider a bank of post office boxes, each numbered in a big grid. It's pretty fast to look up each box by number, but it's much faster to start at a box and open them in sequence. And we have an index of who owns which box and where they live.
You need to get the mail for South Northport. You look up in the index which boxes belong to someone from South Northport, see there's only a few of them, and grab the mail individually. That's an indexed query and random access. It's fast because there's only a few mailboxes to check.
Now I ask you to get the mail for everyone but South Northport. You could use the index in reverse: get the list of boxes for South Northport, subtract those from the list of every box, and then individually get the mail for each box. But this would be slow, random access. Instead, since you're going to have to open nearly every box anyway, it is faster to check every box in sequence and see if it's mail for South Northport.
More formally, the indexed vs table scan performance is something like this.
# Indexed query
C[index] + (C[random] * M)
# Full table scan
(C[sequential] + C[match]) * N
Where C are various constant costs (or near enough constant), M is the number of matching rows, and N is the number of rows in the table.
We know C[sequential] is 10 to 100 times faster than C[random]. Because disk access is so much slower than CPU or memory operations, C[match] (the cost of checking if a row matches) will be relatively small compared to C[sequential]. More formally...
C[random] >> C[sequential] >> C[match]
Using that we can assume that C[sequential] + C[match] is C[sequential].
# Indexed query
C[index] + (C[random] * M)
# Full table scan
C[sequential] * N
When M << N the indexed query wins. As M approaches N, the full table scan wins.
Note that the cost of using the index isn't really constant. C[index] is things like loading the index, looking up a key, and reading the row IDs. This can be quite variable depending on the size of the index, type of index, and whether it is on disk (cold) or in memory (hot). This is why the first few queries are often rather slow when you've first started a database server.
In the real world it's more complicated than that. In reality rows are broken up into data pages and databases have many tricks to optimize queries and disk access. But, generally, if you're matching most of the rows a full table scan will beat an indexed lookup.
Hash indexes are of limited use these days. It is a simple key/value pair and can only be used for equality checks. Most databases use a B-Tree as their standard index. They're a little more costly, but can handle a broader range of operations including equality, ranges, comparisons, and prefix searches such as like 'foo%'.
The Postgres Index Types documentation is pretty good high level run-down of the various advantages and disadvantages of types of indexes.

Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key?

I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this
CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;
EXPLAIN SELECT MAX(x) FROM a;
yields
QUERY PLAN
XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)
I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by ANALYZE?
And just as a sanity check, the EXPLAIN for SELECT x FROM a WHERE x > 3 only scans 2 rows instead of the whole table.
Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.

Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.
The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.
Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.
Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.
In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.
Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.
Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.
How to have a meaningful data set to be able to evaluate a query plan?
The short answer is that your table would have a "large number" of data blocks per slice.
How many blocks is per slice is my table going to require? The answer depends on several factors:
Number of nodes in your cluster
Type of node in the cluster - Number of slices per node
Data Type - How many bytes each value requires.
The type of compression encoding for the column involved in the
query. The optimal encoding depends on data demographics
So let's start at the top.
Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.
Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.
When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.
Here is a simple example:
If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.
Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.
Redshift uses a block size of 1MB.
So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.
If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.
In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.
But wait.....
There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.
What is the bottom line?
In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.
This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.
If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:
How many rows per slice:
Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
How to count how many blocks:
select trim(name) as table_name, col, b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;

How database system comes to know how many different values a particular column has?

At following link
http://www.programmerinterview.com/index.php/database-sql/selectivity-in-sql-databases/
the author has written that since "SEX" column has only two possible values thus its selectivity for 10000 records would be; according to formula given; 0.02 %.
But my question that how a database system come to know that this particular column has this many unique values? Wouldn't the database system require scanning the entire table at least once? or some other way the database system would come to know about those unique values?

First, you are applying the formula wrong. The selectivity for sex (in the example given) would be 50% not 0.02%. That means that each value appears about 50% of the time.
The general way that databases keep track of this is using something called "statistics". These are measures that are kept about all tables and used by the optimizer. Sometimes, the information can also be provided by an index on the column.

Comming back to your actual question: Yes, the database scans all table data frequently and saves some statistics, (e.g. max value, min value, number of distinct keys, number of rows in a table, etc.) in a internal table. These statistics are used to estimate the basic result of your query (or other DML operations) in order to evalutat the optimal execution plan. You can manually trigger generation of statistic by running command EXEC DBMS_STATS.GATHER_DATABASE_STATS; or some of the other ones. You can advise Oracle also to read only a sample of all data (e.g. 10% of all rows)
Usually data content does not change drastically, so it does not matter if the numbers are not absolutly exact, they are (usually) sufficient to estimate an execution plan.

Oracle has many processes related to calculating the number of distinct values (NDV).
Manual Statistics Gathering: Statistics gathering can be triggered manually, through many different procedures in DBMS_STATS.
AUTOTASK: Since 10g Oracle has a default AUTOTASK job, "auto optimizer stats collection". It will only gather statistics if the current stats are stale.
Bulk Load: In 12c statistics can be gathered during a bulk load.
Sample: The NDV can be computed from 100% of the data or can be estimated based on a sample. The sample can be either based on blocks or rows.
One-pass distinct sampling: 11g introduced a new AUTO_SAMPLE_SIZE algorithm. It scans the entire table but only uses one pass. It's much faster to scan the whole table than to have to sort even a small part of it. There are several more in-depth descriptions of the algorithm, such as this one.
Incremental Statistics: For partitioned tables Oracle can store extra information about the NDV, called a synopsis. With this information, if only a single partition is modified, only that one partition needs to be analyzed to generate both partition and global statistics.
Index NDV: Index statistics are created by default when an index is created. Also, the information can be periodically re-gathered from DBMS_STATS.GATHER_INDEX_STATS or the cascade option in other procedures in DBMS_STATS.
Custom Statistics: The NDV can be manually set with DBMS_STATS.SET_* or ASSOCIATE STATISTICS.
Dynamic Sampling: Right before a query is executed, Oracle can automatically sample a small number of blocks from the table to estimate the NDV. This usually only happens when statistics are missing.

Database scans the data set in a table so it can use the most efficient method to retrieve data. Database measures the uniqueness of values using the following formula:
Index Selectivity = number of distinct values / the total number of values
The result will be between zero or one. Index Selectivity of zero means that there are not any unique values. In these cases indexes actually reduce performance. So database uses sequential scanning instead of seek operations.
For more information on indexes read https://dba.stackexchange.com/questions/42553/index-seek-vs-index-scan

How much cost is spent when retrieving a field or entire record from the DB [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
For example there are 20 fields in a record, which includes 5 indexed fields out of 20 fields. Given proper indexes on columns are set up and the data will be retrieved with the indexed field. I want to discuss 2 situations below.
retrieving a field from a record
retrieving a entire record
The only difference I know is that in case 1, the system uses small amount of data, so it spent less on the bus traffic. But when it comes to retrieving time, I'm not sure in these 2 cases if there will be any difference in terms of hardware operation, because I think the main cost on retrieving task on DB is finding the record regardless of how many fields. Is this correct?

Assuming you are retrieving from a heap-based table and your WHERE clause is identical in both cases:
It matters whether the field(s) being retrieved is in the index or not. If it's in the index, the DBMS will not need to access the table heap - this is called index-only scan. If it's not in the index, the DBMS must access the heap page in which the the field resides, possibly requiring additional I/O if not already cached.
If you are reading the whole row, it is less likely all of its fields are covered by the index the DBMS query planner chose to use, so it is more likely you'll pay the I/O cost of the table heap access. This is not so bad for a single row, but can absolutely destroy performance if many rows are retrieved and index's clustering factor is bad1.
The situation is similar but slightly more complicated for clustered tables, since indexes tend to cover PK fields even when not explicitly mentioned in CREATE INDEX, and the "main" portion of the table cannot (typically) be accessed directly, but through an index seek.
On top of that, transferring more data puts more pressure on network bandwidth, as you already noted.
For these reasons, always try to select exactly what you need and no more.
1 A good query optimizer will notice that and perform the full table scan because it's cheaper, even though the index is available.

Reading several material I came to conclusions:
Select only those fields required when performing a query.
If only indexed field will be scanned, the DB will perform index-only searching, which is fast.
When trying to fetch many rows which includes un-indexed fields, the worst case is that the query will perform as many block I/Os as number of rows, which is very expensive cost. So the better way is to perform full table scan because the total number of block I/Os equals to the total number of blocks, which could be much smaller than the number of rows.

To what degree can effective indexing overcome performance issues with VERY large tables?

So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.

Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm

Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)

Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.

If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas