How much cost is spent when retrieving a field or entire record from the DB [duplicate] - sql

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
For example there are 20 fields in a record, which includes 5 indexed fields out of 20 fields. Given proper indexes on columns are set up and the data will be retrieved with the indexed field. I want to discuss 2 situations below.
retrieving a field from a record
retrieving a entire record
The only difference I know is that in case 1, the system uses small amount of data, so it spent less on the bus traffic. But when it comes to retrieving time, I'm not sure in these 2 cases if there will be any difference in terms of hardware operation, because I think the main cost on retrieving task on DB is finding the record regardless of how many fields. Is this correct?

Assuming you are retrieving from a heap-based table and your WHERE clause is identical in both cases:
It matters whether the field(s) being retrieved is in the index or not. If it's in the index, the DBMS will not need to access the table heap - this is called index-only scan. If it's not in the index, the DBMS must access the heap page in which the the field resides, possibly requiring additional I/O if not already cached.
If you are reading the whole row, it is less likely all of its fields are covered by the index the DBMS query planner chose to use, so it is more likely you'll pay the I/O cost of the table heap access. This is not so bad for a single row, but can absolutely destroy performance if many rows are retrieved and index's clustering factor is bad1.
The situation is similar but slightly more complicated for clustered tables, since indexes tend to cover PK fields even when not explicitly mentioned in CREATE INDEX, and the "main" portion of the table cannot (typically) be accessed directly, but through an index seek.
On top of that, transferring more data puts more pressure on network bandwidth, as you already noted.
For these reasons, always try to select exactly what you need and no more.
1 A good query optimizer will notice that and perform the full table scan because it's cheaper, even though the index is available.

Reading several material I came to conclusions:
Select only those fields required when performing a query.
If only indexed field will be scanned, the DB will perform index-only searching, which is fast.
When trying to fetch many rows which includes un-indexed fields, the worst case is that the query will perform as many block I/Os as number of rows, which is very expensive cost. So the better way is to perform full table scan because the total number of block I/Os equals to the total number of blocks, which could be much smaller than the number of rows.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

Is selecting fewer SQL columns making the request faster? [duplicate]

This question already has answers here:
select * vs select column
(12 answers)
Closed 8 years ago.
I have a rails/backbone single page application processing a lot of SQL queries.
Is this request:
SELECT * FROM `posts` WHERE `thread_id` = 1
faster than this one:
SELECT `id` FROM `posts` WHERE `thread_id` = 1
How big is the impact of selecting unused columns on the query execution time?
For all practical purposes, when looking for a single row, the difference is negligible. As the number of result rows increases, the difference can become more and more important, but as long as you have an index on thread_id and you are not more than 10-20% of all the rows in the table, here is still not a big issue. FYI the differentiation factor comes from the fact that selecting * will force, for each row, an additional lookup in the primary index. Selecting only id can be satisfied just by looking up the secondary index on thread_id.
There is also the obvious cost associated with any large field, like BLOB documents or big test fields. If the posts fields have values measuring tens of KBs, then obviously retrieving them adds extra transfer cost.
All these assume a normal execution engine, with B-Tree or ISAM row-mode storage. Almost all 'tables' and engines would fall into this category. The difference would be significant if you would be talking about a columnar storage, because columnar storage only reads the columns of interests and reading extra columns unnecessary impacts more visible such storage engines.
Having or not having an index on thread_id will have a hugely more visible impact. Make sure you have it.
Selecting fewer columns is generally faster. Unfortunately, it is hard to say exactly how much the time difference will be. It may depend on things like how many columns there are and what data is in them (for example, large CLOBS can take longer to fetch than simple integers), what indexes have been set up, and the network latency between you and the database server.
For a definitive answer on the time difference, the best I can say is do both queries and see how long each takes.
There will be two components: The query time and the I/O time (you could also break down the I/O into server I/O and server-client (network) I/O).
Selecting just one column will be faster in both respects - certainly because there's less data to fetch and transmit, but also because the column in question could be a part of whatever index is used to find the data, so the server may not have to look up the actual data pages - it may be able to pull the data directly from the index.
The performance difference is almost certainly insignificant for your application. Try it and see whether you can detect a difference; it is very simple to try.

Index not used Postgres

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.
A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

To what degree can effective indexing overcome performance issues with VERY large tables?

So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.