I wonder when I want to display result of calculating some fields in the same table should I do it as computed field or by using "before insert or update" trigger ?.
Note: I found similar question but it was for SQL Server and I need to know when I display the result in a grid with many records visible, if the computed field will affect performance in this case.
Example of the calculation I use now in a computed field:
field_1 * (
iif(field_2 is null,0,1)
+iif(field_3 is null,0,1)
+iif(field_4 is null,0,1)
+iif(field_5 is null,0,1))
A trigger only works if you're storing the information in the table, because they only get fired when an actual INSERT, UPDATE, or DELETE happens. They have no effect on SELECT statements. Therefore, the actual question becomes "Should I calculate column values in my SELECT statement, or add a column to store them?".
There's no need to store a value that can be easily calculated in the SELECT, and there's seldom a performance impact when doing a simple calculation like the one you've included here.
Whether you should store it depends on many factors, such as how frequently the data changes, and how large the result set will be for your typical query. The more rows you return, the greater the impact of the calculations, and at some point the process of calculating becomes more costly than the increased storage requirements adding a column incurs. However, if you can limit the number of rows returned by your query, the cost of calculations can be so negligible that the overhead of maintaining an extra column of data for every row when it's not needed can be higher, as every row that is inserted or updated will have the trigger execute even when that data isn't being accessed.
However, if your typical query returns a very large number of rows or the calculation is extremely complex, the calculation may become so expensive that it's better to store the data in an actual column where it can be quickly and easily retrieved. If data is frequently inserted or updated, though, the execution of the trigger slows those operations, and if they happen much more frequently than the large SELECT queries then it may not be worth the tradeoff.
There's at least one disadvantage (which I failed to mention, but you asked in a comment below) to actually storing the calculation results in a column.If your calculation (formula) logic changes, you have to:
Disable the trigger
Update all of the rows with a new value based on the new calculation
Edit the trigger to use the new calculation
Re enable the trigger
With the calculation being done in your query itself, you simply change the query's SQL and you're done.
So my answer here is
It's generally better to calculate the column values on the fly unless you have a clear reason not to do so, and
"A clear reason to do so" means that you have an actual performance impact you can prove is related to the calculation, or you have a need to SELECT very large numbers of rows with a fairly intense calculation.
Performance should be fine, except with larger tables when your computed field becomes part of a WHERE clause. The other thing is, even if computed by other fields, if your requirements allow to overwrite the calculated value for some reason. Then you need a real physical field as well.
Related
I have a table with some 50 column count. Every time there is change in row I do not have a knowledge that which columns will change. I don't want to deal with each and every permutation and combination when updating the table.
So when I have to do so I am updating all 50 columns and which, I know, takes much more time than my expectation when dealing with huge number of updates.
To address this I have one solution. Create different set of frequently and together updated fields and design my application that way. Which I know will require change whenever new field is added to my table.
UPDATE TBLX SET NAME = ? WHERE ID = ?;
Result of Explain Update...
UPDATE
INDEX SCAN of "TBLX" using "TBLX_ID"
scan matches for (U_ID = ?5), filter by (column#0 = ?6)
Another approach is that I write a query with when and then(as shown below). This way my code will need update but not as much as it might require in first approach.
UPDATE TBLX SET NAME = CASE WHEN (? != '####') THEN ? ELSE NAME END WHERE ID = ?;
Result of Explain Update...
UPDATE
INDEX SCAN of "TBLX" using "TBLX_ID"
scan matches for (U_ID = ?3), filter by (column#0 = ?4)
So my question is about the internal of the query execution.
How both type of the query will be treated and which one will work faster.
Thing I want to understand is whether executor will ignore the part of the query where I am not changing value in column. i.e. assign same value to the column.
The plans show that both queries are using a match on the TBLX_ID index, which is the fastest way to find the particular row or rows to be updated. If it is a single row, this should be quite fast.
The difference between these two queries is essentially what it is doing for the update work once it has found the row. While the plan doesn't show the steps it will take when updating one row, it should be fast either way. At that point, it's native C++ code updating a row in memory that it has exclusive access to. If I had to guess, the one using the CASE clause may take slightly longer, but it could be a negligible difference. You'd have to run a benchmark to measure the difference in execution times to be certain, but I would expect it to be fast in both cases.
What would be more significant than the difference between these two updates is how you handle updating multiple columns. For example, the cost of finding the affected row may be higher than the logic of the actual updates to the columns. Or, at least if you desiged it so that in order to update n columns you have to queue n SQL statements, then the engine has to execute n statements, and use the same index to find the same row n times. All of that overhead would be much more signficant. If instead you had a complex UPDATE statement with many parameters where you could pass in different values to update various columns or set them to their current value, but in all of that the engine only has to execute one statement and find the row once, then even though that seems complex, it would probably be faster. Faster still may be to simply update all of the columns to the new values, whether it is the same as the current value or not.
If you can test this and run a few hundred examples, then the "exec #Statistices PROCEDUREDETAIL 0;" output will show the average execution time for each of the SQL statements as well as the entire procedure. That should provide the metrics necessary to find the optimum way to do this.
I have a View that has structure exactly same as the query to generate record in the Table.
But the time to execute the select statement as below by using View is taking longer time if compare to Table.
Is that mean View is taking longer time to retrieve the data if compare to table?
If I have huge data, that is more suitable to use Table instead of View?
select count(*) from XYZ_VIEW -- this is a view which returns record by using 4min4sec ,count = 5896
select count(*) from XYZ --this is a table which return return record less than 1 second ,count = 5896
You don't show the query behind the view but I assume it involves some joining to expose normalized data and/or other processing. Here's how you work with views in this regard.
Views should not (as in never) be the target of an unfiltered query. Actually, this should be discouraged against tables but they are sometimes necessary during testing. Fortunately, these queries are almost never justified or even necessary in production code.
And since you don't really care about performance during testing but rather performance during production, then you haven't said anything that suggests you might have a problem. That's right, four minutes against one second for the query you show is meaningless. Many of my views (if not most) would give similar results.
Instead, use a query that would more likely be used in production. Using timing methods with millisecond precision, use a query more in the form:
select ...
from table/view
where <typical filtering criteria>;
That will give you more useful information. Depending on your particular application and requirements, an acceptable range would be 10-20%. That is, if the table returns the result in 30 ms, the view is good if it returns the same result in less than about 36 ms.
You are using a view probably because it manipulates the data in some way to present the data more favorably in some way. This processing is overhead you will not have when you query the tables directly. When you omit filtering, you perform that processing on every row of the underlying table(s), probably needlessly. When you do something particularly stupid like select count(*), you perform all that extra processing for no benefit whatsoever.
When you filter the query, the extra processing is performed only on the qualifying result set. Should that result set be only one row, the processing will be performed only on the single row, rendering the performance from the view (depending, of course, on the exact amount and type of processing) practically indistinguishable from the table.
I always recommend the use of views -- lots and lots of views -- to present your users with data in whatever form is best for their various purposes. Test those views, by all means. But use meaningful tests.
I have a table in which 3 rows of data are added per second and in which I intend to keep around 30M rows. (Older data will be removed).
I need to add a column: varchar(1000). I can't tell in advance what it's content will be but I do know it will be very repetitive : thousands to millions of rows will have the same value. It is usually around 200 characters long.
Since data is being added using a Stored Procedure I see two option
Add a column varchar(1000)
Create a table (int id,varchar(1000) value)
Within the StoredProcedure, look if the value exist in that other table or create it
I would expect this other table to have a maximum of 100 value at all time.
I know some of the tradeoff between these two options but I have difficulty making up my mind on the question.
Option 1 is heavier but I get faster inserts. Requires less joins hence query are simpler.
Option 2 is lighter insert take longers but query have the potential to be faster. I think I'm closer to normal form but then I also have a table with a single meaningful column.
From the information I gave you, which option seems better? (You can also come up with another option).
You should also investigate page compression, perhaps you can do the simple thing and still get a a small(ish) table. Although, if you say is SQL Express, you won't be able to use it as is an Enterprise Edition requirement.
I have used repeatedly in my projects your second approach. Every insert would have to go through a stored procedure that gets the lookup value id, or inserts a new one if not found and returns the id. Specially for such large columns like your seems to be, with plenty of rows yet so few distinct values, the saving in space should trump the additional overhead of the foreign key and lookup cost in query joins. See also Disk is Cheap... That's not the point!.
I have a SQL statement to select results from a table. I need to know the total number of records found, and then list a sub-set of them (pagination).
Normally, I would make 2 SQL calls:
one for counting the total number of records (using COUNT),
the other for returning the sub-set (using LIMIT).
But, this way, you are really duplicating the same operation on MySQL: the WHERE statements are the same in both calls.
Isn't there a way to gain speed NOT duplicating the select on MySQL ?
That first query is going to result in data being pulled into the cache, so presumable the second query should be fast. I wouldn't be too worried about this.
You have to make both SQL queries, and the COUNT is very fast with no WHERE clause. Cache the data where possible.
You should just run the COUNT a single time and then cache it somewhere. Then you can just run the pagination query as needed.
If you really don't want to run the COUNT() query- and as others have stated, it's not something that slows things down appreciably- then you have to decide on your chunk size (ie the LIMIT number) up front. This will save you the COUNT() query, but you may end up with unfortunate pagination results (like 2 pages where the 2nd page has only 1 result).
So, a quick COUNT() and then a sensible LIMIT set-up, or no COUNT() and an arbitrary LIMIT that may increase the number of more expensive queries you have to do.
You could try selecting just one field (say, the IDs) and see if that helps, but I don't think it will - I imagine the biggest overhead is MySQL finding the correct rows in the first place.
If you simply want to count the total number of rows in the entire table (i.e. without a WHERE clause) then I believe SELECT COUNT(*) FROM table is fairly efficient.
Otherwise, the only solution if you need to have the total number visible is to select all the rows. However, you can cache this in another table. If you are selecting something from a category, say, store the category UID and the total rows selected. Then whenever you add/delete rows, count the totals again.
Another option - though it may sacrifice usability a little - is to only select the rows needed for the current page and next page. If there are some rows available for the next page, add a "Next" link. Do the same for the previous page. If you have 20 rows per page, you're selecting at most 60 rows on each page load, and you don't need to count all the rows available.
If you write your query to include one column that contains the count (in every row), and then the rest of the columns from your second query, you can:
avoid the second database round-trip (which is probably more expensive than your query anyways)
Increase the likelihood that MySQL's parser will generate an optimized execution plan that reuses the base query.
Make the operation atomic.
Unfortunately, it also creates a little repetition, returning more data than you really need. But I would expect it to be much more efficient anyway. This is the sort of strategy used by a lot of ORM products when they eagerly load objects from connected tables with many-to-one or many-to-many relationships.
As others have already pointed out, it's probably not worth much concern in this case -- as long as 'field' is indexed, both select's will be extremely fast.
If you have (for whatever reason) a situation where that's not enough, you could create a memory-based temporary table (i.e. a temporary table backed by the memory storage engine), and select your records into that temporary table. Then you could do selects from the temporary table and be quite well assured they'll be fast. This can use a lot of memory though (i.e. it forces that data to all stay in memory for the duration), so it's pretty unfriendly unless you're sure that:
The amount of data is really small;
You have so much memory it doesn't matter; or
The machine will be nearly idle otherwise anyway.
The main time this comes in handy is if you have a really complex select that can't avoid scanning all of a large table (or more than one) but yields only a tiny amount of data.
I need to do paging with the sort order based on a calculation. The calculation is similar to something like reddit's hotness algorithm in that its dependant on time - time since post creation.
I'm wondering what the best practice for this would be. Whether to have this sort as a SQL function, or to run an update once an hour to calculate the whole table.
The table has hundreds of thousands of rows. And I'm using nhibernate, so this could cause problems for the scheduled full calcution.
Any advice?
It most likely will depend a lot on the load on your server. A few assumptions for my answer:
Your calculation is most likely not simple, but will take into account a variety of factors, including time elapsed since post
You are expecting at least reasonable growth in your site, meaning new data will be added to your table.
I would suggest your best bet would be to calculate and store your ranking value, and as Nuno G mentioned retrieve using an ordered clause. As you note there are likely to be some implications, two of which would be:
Scheduling Updates
Ensuring access to the table
As far as scheduling goes you may be able to look at some ways of intelligently recalculating your value. For example, you may be able to identify when a calculation is likely to be altered (for example, if a dependant record is updated you might fire a trigger, adding the ID of your table to a queue for recalculation). You may also do the update in ranges, rather then in the full table.
You will also want to minimise any locking of your table whilst you are recalculating. There are a number of ways to do this, including setting your isolation levels (using MS SQL terminonlogy). If you are really worried you could even perform your calculation externally (eg. in a temp table) and then simply run an update of the values to your main table.
As a final note I would recommend looking into the paging options available to you - if you are talking about thousands of records make sure that your mechanism determines the page you need on the SQL server so that you are not returning the thousands of rows to your application, as this will slow things down for you.
If you can perform the calculation using SQL, try use Hibernate to load the sorted collection by executing a SQLQuery, where your query includes a 'ORDER BY' expression.