Update necessary fields only in VoltDB - sql

I have a table with some 50 column count. Every time there is change in row I do not have a knowledge that which columns will change. I don't want to deal with each and every permutation and combination when updating the table.
So when I have to do so I am updating all 50 columns and which, I know, takes much more time than my expectation when dealing with huge number of updates.
To address this I have one solution. Create different set of frequently and together updated fields and design my application that way. Which I know will require change whenever new field is added to my table.
UPDATE TBLX SET NAME = ? WHERE ID = ?;
Result of Explain Update...
UPDATE
INDEX SCAN of "TBLX" using "TBLX_ID"
scan matches for (U_ID = ?5), filter by (column#0 = ?6)
Another approach is that I write a query with when and then(as shown below). This way my code will need update but not as much as it might require in first approach.
UPDATE TBLX SET NAME = CASE WHEN (? != '####') THEN ? ELSE NAME END WHERE ID = ?;
Result of Explain Update...
UPDATE
INDEX SCAN of "TBLX" using "TBLX_ID"
scan matches for (U_ID = ?3), filter by (column#0 = ?4)
So my question is about the internal of the query execution.
How both type of the query will be treated and which one will work faster.
Thing I want to understand is whether executor will ignore the part of the query where I am not changing value in column. i.e. assign same value to the column.

The plans show that both queries are using a match on the TBLX_ID index, which is the fastest way to find the particular row or rows to be updated. If it is a single row, this should be quite fast.
The difference between these two queries is essentially what it is doing for the update work once it has found the row. While the plan doesn't show the steps it will take when updating one row, it should be fast either way. At that point, it's native C++ code updating a row in memory that it has exclusive access to. If I had to guess, the one using the CASE clause may take slightly longer, but it could be a negligible difference. You'd have to run a benchmark to measure the difference in execution times to be certain, but I would expect it to be fast in both cases.
What would be more significant than the difference between these two updates is how you handle updating multiple columns. For example, the cost of finding the affected row may be higher than the logic of the actual updates to the columns. Or, at least if you desiged it so that in order to update n columns you have to queue n SQL statements, then the engine has to execute n statements, and use the same index to find the same row n times. All of that overhead would be much more signficant. If instead you had a complex UPDATE statement with many parameters where you could pass in different values to update various columns or set them to their current value, but in all of that the engine only has to execute one statement and find the row once, then even though that seems complex, it would probably be faster. Faster still may be to simply update all of the columns to the new values, whether it is the same as the current value or not.
If you can test this and run a few hundred examples, then the "exec #Statistices PROCEDUREDETAIL 0;" output will show the average execution time for each of the SQL statements as well as the entire procedure. That should provide the metrics necessary to find the optimum way to do this.

Related

Azure SQL query performance significantly degrades when WHERE clause returns a high proportion of rows

I have been trying to get to bottom of a performance bottle neck in a web app I am developing. I have managed to identify the SQL query which is causing the problem but I am uncertain as to how to resolve it. The basic query is:
SELECT *
FROM Table
WHERE ColumnA = 0
ORDER BY AnotherColumn
OFFSET 0 ROWS FETCH NEXT 20 ROWS ONLY
Column A is of type BIT, is nullable and holds no default value. At the moment every row (around 290,000 in the table) holds a value of 0. Currently the query takes around 1 minute and 50 seconds to complete.
What I find odd is that by changing a small proportion of the values of ColumnA in the database to 1 the performance dramatically increases.
Simply by running:
UPDATE Table SET ColumnA = 1 WHERE ID % 100 = 0
Which switches the value in about 1% of the rows, the query time is reduced to 7 seconds - over 90% quicker.
I don't understand why there is such a dramatic difference and can think of no way to optimise the query to resolve the issue. Removing the WHERE clause entirely results in the same ~7 second query time so I do not think it has to do with the data which is being returned.
I am using AzureSQL with EFCore but have been running the above queries in SSMS to try to get to the bottom of the issue.
The problem here was the lack of indexing. From your data, you only had 1 index on your table, on the Primary key, nothing else. This means, if you look at the values of the primary key, things are nice and quick, however, for other things, not so much.
When you start querying things, especially in the WHERE, ORDER, ON, etc clauses, on tables with a lot of data, and without indexes on those columns, things start to slow down. Why? Because the SQL Server doesn't know where to look for that data, so it has to check every row.
Consider your data, with it's ID column and Column A. Column A is actually a derived value of ID (Let's just go with ID % 100), however, your column is a persisted value but not calculated from ID. Then you ask the SQL Server "Can I have all the rows where the value of Column A is 0, please?" SQL Server has no idea what those rows contain, and it has no INDEX to help it, thus off it goes, checking every single one of them as it goes through all the IDs.
Now, imagine you have an INDEX on that column. Now, when you ask the Server the same question, it can look at the index. The index, will have an ordered list of Column A, and tell the Server its corresponding ID (which is where the data is stored). The SQL Server can then look at that and see that all the Column As with the value 0 have nicely been placed together in its list (Index); then it simply goes to each ID it needs. It doesn't end up checking the value of Column A for every row.
This is, of course, a very "simplistic" way of looking at Indexes. They are, in actuality, far more complex. Indexes, normally speed up getting data from your Server, but it's worth noting that they SLOW DOWN some tasks, such as INSERT. This is because when it writes the data, it also has to update the indexes. This means more IO as well, so slower discs will have performance issues as well (although they will for a SELECT as well). UPDATE commands can be faster, depending on what you're doing.
As I said, this is a very basic description; but might help you understand a little more. This is, in no way, me saying you should put an INDEX on every column. Knowing which columns to Index and how is a very important thing, but it can in no way be taught by a single Answer on SO.
Adding an index on (ColumnA, AnotherColumn) solved the problem. Now the query takes less than 1 second. Thanks to Larnu for pointing me in the right direction. I would still appreciate an answer than clarifies why the performance was so bad in the first place.

SQL Server Query running slow when changing constant in where clause

Hello all and thanks in advance. I have a view that when queried with no where clause takes just over 0 seconds to return ~8600 rows. However, when I query with a where clause such as:
SELECT * FROM myView WHERE myID = 123
depending on what constant I put in place of 123 the query execution time changes considerably.
Now, "considerably" in this case means the difference between just above 0 seconds and 3 to 4 seconds. But the view is called frequently and repeatedly for certain tasks which makes 3 seconds turn into 30 or more seconds.
While I cannot give the code for the view itself, what I can confirm is that:
The view is comprised of the joining of 6 standard tables (no special qualities).
While there may not always be records in table A that link up with table B, thus creating null columns in the results, I have confirmed that such instances are not consistently resulting in the longer or shorter query times.
The view itself has no clauses beyond the standard Select, From, and Left Outer Join clauses.
Certain IDs always result in long query times and the others always result in short query times
I have dropped and created the view in between queries on the off chance that there was a cached execution plan that was sub-optimal.
If these known variables are not enough to reduce the possibilities down to 2 or 3 possible causes I would still like to know what THEORETICAL problems might be causing this issue just to expand my understanding.
Thanks Again,
ProtoNoob
I would assume that the statistics for the tables are outdated and do not match the real content of the tables. This would mean that the optimizer, relying on the statistics, e. g. assumes that a value you use in the WHERE clause does not occur in the data at all, hence the result set being rather small, while in reality it contains many rows. Or the other way round: Relying on the statistics, the optimizer could assume that - say- 20% of the rows of the table have this value, and hence it is better to do a full table scan than to first access index pages for evaluating the where condition, then jump to a data page for almost each index entry to read the data, and in the end having to read nearly all pages anyway. Or it would access the tables in a wrong order, or ... But in reality, the value is not contained in the table at all, thus just leading to a wrong plan.
One hint pointing to outdated statistics would be if the query plan shows a huge difference between estimated and actual number of rows.
Which DBMS are you using? If SQL Server, then you can see the current statistics using DBCC SHOW_STATISTICS and refresh the statistics for selected columns and tables using the UPDATE STATISTICS statement. There are more views and procedures around this subject, most of them are linked from one of these two articles.

trigger or computed field

I wonder when I want to display result of calculating some fields in the same table should I do it as computed field or by using "before insert or update" trigger ?.
Note: I found similar question but it was for SQL Server and I need to know when I display the result in a grid with many records visible, if the computed field will affect performance in this case.
Example of the calculation I use now in a computed field:
field_1 * (
iif(field_2 is null,0,1)
+iif(field_3 is null,0,1)
+iif(field_4 is null,0,1)
+iif(field_5 is null,0,1))
A trigger only works if you're storing the information in the table, because they only get fired when an actual INSERT, UPDATE, or DELETE happens. They have no effect on SELECT statements. Therefore, the actual question becomes "Should I calculate column values in my SELECT statement, or add a column to store them?".
There's no need to store a value that can be easily calculated in the SELECT, and there's seldom a performance impact when doing a simple calculation like the one you've included here.
Whether you should store it depends on many factors, such as how frequently the data changes, and how large the result set will be for your typical query. The more rows you return, the greater the impact of the calculations, and at some point the process of calculating becomes more costly than the increased storage requirements adding a column incurs. However, if you can limit the number of rows returned by your query, the cost of calculations can be so negligible that the overhead of maintaining an extra column of data for every row when it's not needed can be higher, as every row that is inserted or updated will have the trigger execute even when that data isn't being accessed.
However, if your typical query returns a very large number of rows or the calculation is extremely complex, the calculation may become so expensive that it's better to store the data in an actual column where it can be quickly and easily retrieved. If data is frequently inserted or updated, though, the execution of the trigger slows those operations, and if they happen much more frequently than the large SELECT queries then it may not be worth the tradeoff.
There's at least one disadvantage (which I failed to mention, but you asked in a comment below) to actually storing the calculation results in a column.If your calculation (formula) logic changes, you have to:
Disable the trigger
Update all of the rows with a new value based on the new calculation
Edit the trigger to use the new calculation
Re enable the trigger
With the calculation being done in your query itself, you simply change the query's SQL and you're done.
So my answer here is
It's generally better to calculate the column values on the fly unless you have a clear reason not to do so, and
"A clear reason to do so" means that you have an actual performance impact you can prove is related to the calculation, or you have a need to SELECT very large numbers of rows with a fairly intense calculation.
Performance should be fine, except with larger tables when your computed field becomes part of a WHERE clause. The other thing is, even if computed by other fields, if your requirements allow to overwrite the calculated value for some reason. Then you need a real physical field as well.

When do sql optimizations become overkill?

I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.
If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).
This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.

Fastest way to count total number and then list a set of records in MySQL

I have a SQL statement to select results from a table. I need to know the total number of records found, and then list a sub-set of them (pagination).
Normally, I would make 2 SQL calls:
one for counting the total number of records (using COUNT),
the other for returning the sub-set (using LIMIT).
But, this way, you are really duplicating the same operation on MySQL: the WHERE statements are the same in both calls.
Isn't there a way to gain speed NOT duplicating the select on MySQL ?
That first query is going to result in data being pulled into the cache, so presumable the second query should be fast. I wouldn't be too worried about this.
You have to make both SQL queries, and the COUNT is very fast with no WHERE clause. Cache the data where possible.
You should just run the COUNT a single time and then cache it somewhere. Then you can just run the pagination query as needed.
If you really don't want to run the COUNT() query- and as others have stated, it's not something that slows things down appreciably- then you have to decide on your chunk size (ie the LIMIT number) up front. This will save you the COUNT() query, but you may end up with unfortunate pagination results (like 2 pages where the 2nd page has only 1 result).
So, a quick COUNT() and then a sensible LIMIT set-up, or no COUNT() and an arbitrary LIMIT that may increase the number of more expensive queries you have to do.
You could try selecting just one field (say, the IDs) and see if that helps, but I don't think it will - I imagine the biggest overhead is MySQL finding the correct rows in the first place.
If you simply want to count the total number of rows in the entire table (i.e. without a WHERE clause) then I believe SELECT COUNT(*) FROM table is fairly efficient.
Otherwise, the only solution if you need to have the total number visible is to select all the rows. However, you can cache this in another table. If you are selecting something from a category, say, store the category UID and the total rows selected. Then whenever you add/delete rows, count the totals again.
Another option - though it may sacrifice usability a little - is to only select the rows needed for the current page and next page. If there are some rows available for the next page, add a "Next" link. Do the same for the previous page. If you have 20 rows per page, you're selecting at most 60 rows on each page load, and you don't need to count all the rows available.
If you write your query to include one column that contains the count (in every row), and then the rest of the columns from your second query, you can:
avoid the second database round-trip (which is probably more expensive than your query anyways)
Increase the likelihood that MySQL's parser will generate an optimized execution plan that reuses the base query.
Make the operation atomic.
Unfortunately, it also creates a little repetition, returning more data than you really need. But I would expect it to be much more efficient anyway. This is the sort of strategy used by a lot of ORM products when they eagerly load objects from connected tables with many-to-one or many-to-many relationships.
As others have already pointed out, it's probably not worth much concern in this case -- as long as 'field' is indexed, both select's will be extremely fast.
If you have (for whatever reason) a situation where that's not enough, you could create a memory-based temporary table (i.e. a temporary table backed by the memory storage engine), and select your records into that temporary table. Then you could do selects from the temporary table and be quite well assured they'll be fast. This can use a lot of memory though (i.e. it forces that data to all stay in memory for the duration), so it's pretty unfriendly unless you're sure that:
The amount of data is really small;
You have so much memory it doesn't matter; or
The machine will be nearly idle otherwise anyway.
The main time this comes in handy is if you have a really complex select that can't avoid scanning all of a large table (or more than one) but yields only a tiny amount of data.