Performance benefit when SQL query is limited vs calling entire row? - sql

How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?

It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.

It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.

The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.

The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.

If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.

Related

Selecting one column from a table that has 100 columns

I have a table with 100 columns (yes, code smell and arguably a potentially less optimized design). The table has an 'id' as PK. No other column is indexed.
So, if I fire a query like:
SELECT first_name from EMP where id = 10
Will SQL Server (or any other RDBMS) have to load the entire row (all columns) in memory and then return only the first_name?
(In other words - the page that contains the row id = 10 if it isn't in the memory already)
I think the answer is yes! unless it has column markers within a row. I understand there might be optimization techniques, but is it a default behavior?
[EDIT]
After reading some of your comments, I realized I asked an XY question unintentionally. Basically, we have tables with 100s of millions of rows with 100 columns each and receive all sorts of SELECT queries on them. The WHERE clause also changes but no incoming request needs all columns. Many of those cell values are also NULL.
So, I was thinking of exploring a column-oriented database to achieve better compression and faster retrieval. My understanding is that column-oriented databases will load only the requested columns. Yes! Compression will help too to save space and hopefully performance as well.
For MySQL: Indexes and data are stored in "blocks" of 16KB. Each level of the B+Tree holding the PRIMARY KEY in your case needs to be accessed. For example a million rows, that is 3 blocks. Within the leaf block, there are probably dozens of rows, with all their columns (unless a column is "too big"; but that is a different discussion).
For MariaDB's Columnstore: The contents of one columns for 64K rows is held in a packed, compressed structure that varies in size and structure. Before getting to that, the clump of 64K rows must be located. After getting it, it must be unpacked.
In both cases, the structure of the data on disk is a compromises between speed and space for both simple and complex queries.
Your simple query is easy and efficient to doing a regular RDBMS, but messier to do in a Columnstore. Columnstore is a niche market in which your query is abnormal.
Be aware that fetching blocks are typically the slowest part of performing the query, especially when I/O is required. There is a cache of blocks in RAM.

SQL - split a large table according to how frequently they're accessed?

I have a table that has 50 fields:
10 Fields that are almost always needed.
40 Fields that are very rarely needed.
I would roughly say that the fields in (1) are needed to be accessed 1000 times more frequently than the fields in (2).
Should I split them to two tables with one-to-one relation, or keep all in the same table?
The process that you are describing is sometimes referred to as "vertical partitioning". Taken to an extreme (one column per vertical partition), this is how columnar databases store data. Unfortunately (to the best of my knowledge), Postgres does not currently have direct support for vertical partitioning.
Your idea of splitting the data into two tables is fine. I would note the following:
You will need to modify queries that use the extra columns to use the second table. (You can wrap the join into a view which you use when you want the extra columns.)
If both tables have a clustered primary key that connects them, then the join should be really fast.
If you are inserting/updating/deleting data, then you need to be careful about synchronization. I think you can handle this with an INSTEAD OF trigger on a view combining the tables.
If some records do not have extra columns, this can be a big win on the space side.
If all records and all columns are going to be loaded into the cache, then this probably is not a big win.
This can be a big performance win, under some circumstances. But there is additional manual work to keep the tables synchronized.
There's really not nearly enough information here to estimate (never mind actually quantify) what the benefits might be, but the costs are very clear -- more complex code, a more complex schema, probably greater overall space usage, and a performance overhead when adding and removing rows.
A performance improvement might come from scanning a smaller amount of data when performing a full table scan, or from an increased likelihood in finding data blocks in memory when required, and an overall smaller memory footprint, but without specific information on the types of operation commonly performed, and whether the server is under memory pressure, no reliable advice can be given.
Be very wary of making your system more complex as a side-effect of uncertain performance gains.

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.

How many rows in a database are TOO MANY?

I've a MySQL InnoDB table with 1,000,000 records. Is this too much? Or databases can handle this and more? I ask because I noticed that some queries (for example, getting the last row from a table) are slower (seconds) in the table with 1 millon rows than in one with 100.
I've a MySQL InnoDB table with 1000000 registers. Is this too much?
No, 1,000,000 rows (AKA records) is not too much for a database.
I ask because I noticed that some queries (for example, getting the last register of a table) are slower (seconds) in the table with 1 million registers than in one with 100.
There's a lot to account for in that statement. The usual suspects are:
Poorly written query
Not using a primary key, assuming one even exists on the table
Poorly designed data model (table structure)
Lack of indexes
I have a database with more than 97,000,000 records(30GB datafile), and having no problem .
Just remember to define and improve your table index.
So its obvious that 1,000,000 is not MANY ! (But if you don't index; yes, it is MANY )
Use 'explain' to examine your query and see if there is anything wrong with the query plan.
I think this is a common misconception - size is only one part of the equation when it comes to database scalability. There are other issues that are hard (or harder):
How large is the working set (i.e. how much data needs to be loaded in memory and actively worked on). If you just insert data and then do nothing with it, it's actually an easy problem to solve.
What level of concurrency is required? Is there just one user inserting/reading, or do we have many thousands of clients operating at once?
What levels of promise/durability and consistency of performance are required? Do we have to make sure that we can honor each commit. Is it okay if the average transaction is fast, or do we want to make sure that all transactions are reliably fast (six sigma quality control like - http://www.mysqlperformanceblog.com/2010/06/07/performance-optimization-and-six-sigma/).
Do you need to do any operational issues, such as ALTER the table schema? In InnoDB this is possible, but incredibly slow since it often has to create a temporary table in foreground (blocking all connections).
So I'm going to state the two limiting issues are going to be:
Your own skill at writing queries / having good indexes.
How much pain you can tolerate waiting on ALTER TABLE statements.
If you mean 1 million rows, then it depends on how your indexing is done and the configuration of your hardware. A million rows is not a large amount for an enterprise database, or even a dev database on decent equipment.
if you mean 1 million columns (not sure thats even possible in MySQL) then yes, this seems a bit large and will probably cause problems.
Register? Do you mean record?
One million records is not a real big deal for a database these days. If you run into any issue, it's likely not the database system itself, but rather the hardware that you're running it on. You're not going to run into a problem with the DB before you run out of hardware to throw at it, most likely.
Now, obviously some queries are slower than others, but if two very similar queries run in vastly different times, you need to figure out what the database's execution plan is and optimize for it, i.e. use correct indexes, proper normalization, etc.
Incidentally, there is no such thing as a "last" record in a table, from a logical standpoint they have no inherent order.
I've seen non-partitioned tables with several billion (indexed) records, that self-joined for analytical work. We eventually partitioned the thing but honestly we didn't see that much difference.
That said, that was in Oracle and I have not tested that volume of data in MySQL. Indexes are your friend :)
Assuming you mean "records" by "registers" no, it's not too much, MySQL scales really well and can hold as many records as you have space for in your hard disk.
Obviously though search queries will be slower. There is really no way around that except making sure that the fields are properly indexed.
The larger the table gets (as in more rows in it), the slower queries will typically run if there are no indexes. Once you add the right indexes your query performance should improve or at least not degrade as much as the table grows. However, if the query itself returns more rows as the table gets bigger, then you'll start to see degradation again.
While 1M rows are not that many, it also depends on how much memory you have on the DB server. If the table is too big to be cached in memory by the server, then queries will be slower.
Using the query provided will be exceptionally slow because of using a sort merge method to sort the data.
I would recommend rethinking the design so you are using indexes to retrieve it or make sure it is already ordered in that manner so no sorting is needed.

Does varchar result in performance hit due to data fragmentation?

How are varchar columns handled internally by a database engine?
For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?
You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.
The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!
The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.
Your question is too general because different database engines will have different behavior. If you really need to know this, I suggest that you set up a benchmark to write a large number of records and time it. You would want enough records to take at least an hour to write.
As you suggested, it would be interesting to see what happens if you write insert all the records with an empty string ("") and then update them to have 100 characters that are reasonably random, not just 100 Xs.
If you try this with SQLITE and see no significant difference, then I think it unlikely that the larger database servers, with all the analysis and tuning that goes on, would be worse than SQLITE.
This is going to be completely database specific.
I do know that in Oracle, the database will reserve a certain percentage of each block for future updates (The PCTFREE parameter). For example, if PCTFREE is set to 25%, then a block will only be used for new data until it is 75% full. By doing that, room is left for rows to grow. If the row grows such that the 25% reserved space is completely used up, then you do end up with chained rows and a performance penalty. If you find that a table has a large number of chained rows, you can tune the PCTFREE for that table. If you have a table which will never have any updates at all, a PCTFREE of zero would make sense
In SQL Server varchar (except varchar(MAX)) is generally stored together with the rest of the row's data (on the same page if the row's data is < 8KB and on the same extent if it is < 64KB. Only the large data types such as TEXT, NTEXT, IMAGE, VARHCAR(MAX), NVARHCAR(MAX), XML and VARBINARY(MAX) are stored seperately.