This question already has answers here:
select * vs select column
(12 answers)
Closed 8 years ago.
I have a rails/backbone single page application processing a lot of SQL queries.
Is this request:
SELECT * FROM `posts` WHERE `thread_id` = 1
faster than this one:
SELECT `id` FROM `posts` WHERE `thread_id` = 1
How big is the impact of selecting unused columns on the query execution time?
For all practical purposes, when looking for a single row, the difference is negligible. As the number of result rows increases, the difference can become more and more important, but as long as you have an index on thread_id and you are not more than 10-20% of all the rows in the table, here is still not a big issue. FYI the differentiation factor comes from the fact that selecting * will force, for each row, an additional lookup in the primary index. Selecting only id can be satisfied just by looking up the secondary index on thread_id.
There is also the obvious cost associated with any large field, like BLOB documents or big test fields. If the posts fields have values measuring tens of KBs, then obviously retrieving them adds extra transfer cost.
All these assume a normal execution engine, with B-Tree or ISAM row-mode storage. Almost all 'tables' and engines would fall into this category. The difference would be significant if you would be talking about a columnar storage, because columnar storage only reads the columns of interests and reading extra columns unnecessary impacts more visible such storage engines.
Having or not having an index on thread_id will have a hugely more visible impact. Make sure you have it.
Selecting fewer columns is generally faster. Unfortunately, it is hard to say exactly how much the time difference will be. It may depend on things like how many columns there are and what data is in them (for example, large CLOBS can take longer to fetch than simple integers), what indexes have been set up, and the network latency between you and the database server.
For a definitive answer on the time difference, the best I can say is do both queries and see how long each takes.
There will be two components: The query time and the I/O time (you could also break down the I/O into server I/O and server-client (network) I/O).
Selecting just one column will be faster in both respects - certainly because there's less data to fetch and transmit, but also because the column in question could be a part of whatever index is used to find the data, so the server may not have to look up the actual data pages - it may be able to pull the data directly from the index.
The performance difference is almost certainly insignificant for your application. Try it and see whether you can detect a difference; it is very simple to try.
Related
I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.
Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.
This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
For example there are 20 fields in a record, which includes 5 indexed fields out of 20 fields. Given proper indexes on columns are set up and the data will be retrieved with the indexed field. I want to discuss 2 situations below.
retrieving a field from a record
retrieving a entire record
The only difference I know is that in case 1, the system uses small amount of data, so it spent less on the bus traffic. But when it comes to retrieving time, I'm not sure in these 2 cases if there will be any difference in terms of hardware operation, because I think the main cost on retrieving task on DB is finding the record regardless of how many fields. Is this correct?
Assuming you are retrieving from a heap-based table and your WHERE clause is identical in both cases:
It matters whether the field(s) being retrieved is in the index or not. If it's in the index, the DBMS will not need to access the table heap - this is called index-only scan. If it's not in the index, the DBMS must access the heap page in which the the field resides, possibly requiring additional I/O if not already cached.
If you are reading the whole row, it is less likely all of its fields are covered by the index the DBMS query planner chose to use, so it is more likely you'll pay the I/O cost of the table heap access. This is not so bad for a single row, but can absolutely destroy performance if many rows are retrieved and index's clustering factor is bad1.
The situation is similar but slightly more complicated for clustered tables, since indexes tend to cover PK fields even when not explicitly mentioned in CREATE INDEX, and the "main" portion of the table cannot (typically) be accessed directly, but through an index seek.
On top of that, transferring more data puts more pressure on network bandwidth, as you already noted.
For these reasons, always try to select exactly what you need and no more.
1 A good query optimizer will notice that and perform the full table scan because it's cheaper, even though the index is available.
Reading several material I came to conclusions:
Select only those fields required when performing a query.
If only indexed field will be scanned, the DB will perform index-only searching, which is fast.
When trying to fetch many rows which includes un-indexed fields, the worst case is that the query will perform as many block I/Os as number of rows, which is very expensive cost. So the better way is to perform full table scan because the total number of block I/Os equals to the total number of blocks, which could be much smaller than the number of rows.
I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.
Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?
The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.
Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.
Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.
Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.
The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.
You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity