SQL Read/Write efficiency - sql

Is there any diffrenece in the performance of read and write operations in SQL? Using Linq to SQL in an ASP.NET MVC application, I often update many values in one of my tables in single posts (during this process, many posts of this type will come in rapidly from the user, although the user is unable to submit new data until the previous update is complete). My current implementation is to loop through the input (a list of the current values for each row), and write them to the field (nullable int). I wonder if there would be any performance difference if instead I read the current db value, and only wrote if it has changed. Most of these operations change the values for roughly 1/4 to 2/3 of the rows, some change fewer, and few change more than 2/3 of the rows.
I don't know much about the comparative speeds of these operations (or if there is even any difference). Is there any benefit to be gained from doing this? If so, what table sizes would benefit the most/not benefit at all, and would there be any percentage of the rows changing that would be a threshold for this improvement?

It's always faster to read.
A write is actually always a read followed by a write.
SQL needs to know which row to write to, which involves reading either an index or the table itself in a seek or scan operation, then writing to the appropriate row.
Writing also needs to update any applicable indexes. Depending on the circumstance, the index may get "updated" even when the data doesn't change.
As a very general rule, it's a good idea only to modify the data that needs to be changed.

Related

Best practice to update bulk data in table used for reporting in SQL

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.
This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.
It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

Is selecting fewer SQL columns making the request faster? [duplicate]

This question already has answers here:
select * vs select column
(12 answers)
Closed 8 years ago.
I have a rails/backbone single page application processing a lot of SQL queries.
Is this request:
SELECT * FROM `posts` WHERE `thread_id` = 1
faster than this one:
SELECT `id` FROM `posts` WHERE `thread_id` = 1
How big is the impact of selecting unused columns on the query execution time?
For all practical purposes, when looking for a single row, the difference is negligible. As the number of result rows increases, the difference can become more and more important, but as long as you have an index on thread_id and you are not more than 10-20% of all the rows in the table, here is still not a big issue. FYI the differentiation factor comes from the fact that selecting * will force, for each row, an additional lookup in the primary index. Selecting only id can be satisfied just by looking up the secondary index on thread_id.
There is also the obvious cost associated with any large field, like BLOB documents or big test fields. If the posts fields have values measuring tens of KBs, then obviously retrieving them adds extra transfer cost.
All these assume a normal execution engine, with B-Tree or ISAM row-mode storage. Almost all 'tables' and engines would fall into this category. The difference would be significant if you would be talking about a columnar storage, because columnar storage only reads the columns of interests and reading extra columns unnecessary impacts more visible such storage engines.
Having or not having an index on thread_id will have a hugely more visible impact. Make sure you have it.
Selecting fewer columns is generally faster. Unfortunately, it is hard to say exactly how much the time difference will be. It may depend on things like how many columns there are and what data is in them (for example, large CLOBS can take longer to fetch than simple integers), what indexes have been set up, and the network latency between you and the database server.
For a definitive answer on the time difference, the best I can say is do both queries and see how long each takes.
There will be two components: The query time and the I/O time (you could also break down the I/O into server I/O and server-client (network) I/O).
Selecting just one column will be faster in both respects - certainly because there's less data to fetch and transmit, but also because the column in question could be a part of whatever index is used to find the data, so the server may not have to look up the actual data pages - it may be able to pull the data directly from the index.
The performance difference is almost certainly insignificant for your application. Try it and see whether you can detect a difference; it is very simple to try.

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?
Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.
I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)
The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.
Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.

Performance benefit when SQL query is limited vs calling entire row?

How much of a performance benefit is there by selecting only required field in query instead of querying the entire row? For example, if I have a row of 10 fields but only need 5 fields in the display, is it worth querying only those 5? what is the performance benefit with this limitation vs the risk of having to go back and add fields in the sql query later if needed?
It's not just the extra data aspect that you need to consider. Selecting all columns will negate the usefulness of covering indexes, since a bookmark lookup into the clustered index (or table) will be required.
It depends on how many rows are selected, and how much memory do those extra fields consume. It can run much slower if several text/blobs fields are present for example, or many rows are selected.
How is adding fields later a risk? modifying queries to fit changing requirements is a natural part of the development process.
The only benefit I know of explicitly naming your columns in your select statement is that if a column your code is using gets renamed your select statement will fail before your code. Even better if your select statement is within a proc, your proc and the DB script would not compile. This is very handy if you are using tools like VS DB edition to compile/verify DB scripts.
Otherwise the performance difference would be negligible.
The number of fields retrieved is a second order effect on performance relative to the large overhead of the SQL request itself -- going out of process, across the network to another host, and possibly to disk on that host takes many more cycles than shoveling a few extra bytes of data.
Obviously if the extra fields include a megabyte blob the equation is skewed. But my experience is that the transaction overhead is of the same order, or larger, than the actual data retreived. I remember vaguely from many years ago than an "empty" NOP TNS request is about 100 bytes on the wire.
If the SQL server is not the same machine from which you're querying, then selecting the extra columns transfers more data over the network (which can be a bottleneck), not forgetting that it has to read more data from the disk, allocate more memory to hold the results.
There's not one thing that would cause a problem by itself, but add things up and they all together cause performance issues. Every little bit helps when you have lots of either queries or data.
The risk I guess would be that you have to add the fields to the query later which possibly means changing code, but then you generally have to add more code to handle extra fields anyway.