Why is the load method of a datatable sometimes so slow? - vb.net

The project is a web app in ASP/VB.net. The issue is that some pages are mind-boggingly slow. After trying to track down the bottleneck, it was discovered to be the load method when filling a datatable with query results.
We are using an Oracle database and queries are executed in stored procedures. As an example, we have a relatively simple select statement within a procedure which returns 2 columns with 6 rows which was determined to take about 0.015 seconds to execute. However it takes on average 7 seconds to load the OracleDataReader into a datatable - a ridiculous amount of time for such a small record set. After messing around with the query, I found that a simple decode statement appeared to be causing the issue. The decode statement is used similar to as follows:
WHERE
DECODE(iBln, 1, column1, column2) BETWEEN iDate1 and iDate2
The iBln variable is simply a number being passed in to act as a boolean variable for determining which column should be between two dates. If I comment this decode statement out and make it simply "column1 BETWEEN iDate1 and iDate2" then the load method takes no time at all as it should, signifying that it is indeed the decode statement causing issues.
So I'm just hoping to hear from anyone that could have an idea as to what's causing this or how to fix it. It's a simple decode, how does it even affect the load method anyway?

I would verify that indexes exist for column1 and column2. If so, the likely problem is that the DECODE is preventing the use of the indexes. Try rewriting as:
WHERE ( ( iBin = 1 AND column1 BETWEEN iDate1 AND iDate2)
OR
( (iBin IS NULL OR iBin <> 1) AND column2 BETWEEN iDate1 AND iDate2)
)

If your stored procedure is returning a REF CURSOR, opening the cursor in the stored procedure will be very fast regardless of the query you're executing. Opening a cursor doesn't require that Oracle do any of the work of actually running the query, it just requires that Oracle determine the query plan which should be more or less instantaneous.
How long does it take to fetch the data from the REF CURSOR in something like SQL*Plus? If it takes something close to 7 seconds (as I suspect it will), you can eliminate the OracleDataReader class as the source of the problem. In that case, the problem would almost certainly be that the query plan is inefficient.
Based on your description, my guess is that column1 is indexed. column2 may also be indexed, it's not clear. But a regular index on either column1 or column2 could not be used to evaluate the predicate that involves the call to the DECODE function. If there are no other predicates on indexed columns, that may force Oracle to do a table scan on the underlying table (posting the full query, the table definition, and the query plan would be helpful).

Related

TSQL: Is there a way to limit the rows returned and count the total that would have been returned without the limit (without adding it to every row)?

I'm working to update a stored procedure that current selects up to n rows, if the rows returned = n, does a select count without the limit, and then returns the original select and the total impacted rows.
Kinda like:
SELECT TOP (#rowsToReturn)
A.data1,
A.data2
FROM
mytable A
SET #maxRows = ##ROWCOUNT
IF #rowsToReturn = ##ROWCOUNT
BEGIN
SET #maxRows = (SELECT COUNT(1) FROM mytableA)
END
I'm wanting reduce this to a single select statement. Based on this question, COUNT(*) OVER() allows this, but it is put on every single row instead of in an output parameter. Maybe something like FOUND_ROWS() in MYSQL, such as a ##TOTALROWCOUNT or such.
As a side note, since the actual select has an order by, the data base will need to already traverse the entire set (to make sure that it gets the correct first n ordered records), so the database should already have this count somewhere.
As #MartinSmith mentioned in a comment on this question, there is no direct (i.e. pure T-SQL) way of getting the total numbers of rows that would be returned while at the same time limiting it. In the past I have done the method of:
dump the query to a temp table to grab ##ROWCOUNT (the total set)
use ROW_NUBMER() AS [ResultID] on the ordered results of the main query
SELECT TOP (n) FROM #Temp ORDER BY [ResultID] or something similar
Of course, the downside here is that you have the disk I/O cost of getting those records into the temp table. Put [tempdb] on SSD? :)
I have also experienced the "run COUNT(*) with the same rest of the query first, then run the regular SELECT" method (as advocated by #Blam), and it is not a "free" re-run of the query:
It is a full re-run in many cases. The issue is that when doing COUNT(*) (hence not returning any fields), the optimizer only needs to worry about indexes in terms of the JOIN, WHERE, GROUP BY, ORDER BY clauses. But when you want some actual data back, that could change the execution plan quite a bit, especially if the indexes used to get the COUNT(*) are not "covering" for the fields in the SELECT list.
The other issue is that even if the indexes are all the same and hence all of the data pages are still in cache, that just saves you from the physical reads. But you still have the logical reads.
I'm not saying this method doesn't work, but I think the method in the Question that only does the COUNT(*) conditionally is far less stressful on the system.
The method advocated by #Gordon is actually functionally very similar to the temp table method I described above: it dumps the full result set to [tempdb] (the INSERTED table is in [tempdb]) to get the full ##ROWCOUNT and then it gets a subset. On the downside, the INSTEAD OF TRIGGER method is:
a lot more work to set up (as in 10x - 20x more): you need a real table to represent each distinct result set, you need a trigger, the trigger needs to either be built dynamically, or get the number of rows to return from some config table, or I suppose it could get it from CONTEXT_INFO() or a temp table. Still, the whole process is quite a few steps and convoluted.
very inefficient: first it does the same amount of work dumping the full result set to a table (i.e. into the INSERTED table--which lives in [tempdb]) but then it does an additional step of selecting the desired subset of records (not really a problem as this should still be in the buffer pool) to go back into the real table. What's worse is that second step is actually double I/O as the operation is also represented in the transaction log for the database where that real table exists. But wait, there's more: what about the next run of the query? You need to clear out this real table. Whether via DELETE or TRUNCATE TABLE, it is another operation that shows up (the amount of representation based on which of those two operations is used) in the transaction log, plus is additional time spent on the additional operation. AND, let's not forget about the step that selects the subset out of INSERTED into the real table: it doesn't have the opportunity to use an index since you can't index the INSERTED and DELETED tables. Not that you always would want to add an index to the temp table, but sometimes it helps (depending on the situation) and you at least have that choice.
overly complicated: what happens when two processes need to run the query at the same time? If they are sharing the same real table to dump into and then select out of for the final output, then there needs to be another column added to distinguish between the SPIDs. It could be ##SPID. Or it could be a GUID created before the initial INSERT into the real table is called (so that it can be passed to the INSTEAD OF trigger via CONTEXT_INFO() or a temp table). Whatever the value is, it would then be used to do the DELETE operation once the final output has been selected. And if not obvious, this part influences a performance issue brought up in the prior bullet: TRUNCATE TABLE cannot be used as it clears the entire table, leaving DELETE FROM dbo.RealTable WHERE ProcessID = #WhateverID; as the only option.
Now, to be fair, it is possible to do the final SELECT from within the trigger itself. This would reduce some of the inefficiency as the data never makes it into the real table and then also never needs to be deleted. It also reduces the over-complication as there should be no need to separate the data by SPID. However, this is a very time-limited solution as the ability to return results from within a trigger is going bye-bye in the next release of SQL Server, so sayeth the MSDN page for the disallow results from triggers Server Configuration Option:
This feature will be removed in the next version of Microsoft SQL Server. Do not use this feature in new development work, and modify applications that currently use this feature as soon as possible. We recommend that you set this value to 1.
The only actual way to do:
the query one time
get a subset of rows
and still get the total row count of the full result set
is to use .Net. If the procs are being called from app code, please see "EDIT 2" at the bottom. If you want to be able to randomly run various stored procedures via ad hoc queries, then it would have to be a SQLCLR stored procedure so that it could be generic and work for any query as stored procedures can return dynamic result sets and functions cannot. The proc would need at least 3 parameters:
#QueryToExec NVARCHAR(MAX)
#RowsToReturn INT
#TotalRows INT OUTPUT
The idea is to use "Context Connection = true;" to make use of the internal / in-process connection. You then do these basic steps:
call ExecuteDataReader()
before you read any rows, do a GetSchemaTable()
from the SchemaTable you get the result set field names and datatypes
from the result set structure you construct a SqlDataRecord
with that SqlDataRecord you call SqlContext.Pipe.SendResultsStart(_DataRecord)
now you start calling Reader.Read()
for each row you call:
Reader.GetValues()
DataRecord.SetValues()
SqlContext.Pipe.SendResultRow(_DataRecord)
RowCounter++
Rather than doing the typical "while (Reader.Read())", you instead include the #RowsToReturn param: while(Reader.Read() && RowCounter < RowsToReturn.Value)
After that while loop, call SqlContext.Pipe.SendResultsEnd() to close the result set (the one that you are sending, not the one you are reading)
then do a second while loop that cycles through the rest of the result, but never gets any of the fields:
while (Reader.Read())
{
RowCounter++;
}
then just set TotalRows = RowCounter; which will pass back the number of rows for the full result set, even though you only returned the top n rows of it :)
Not sure how this performs against the temp table method, the dual call method, or even #M.Ali's method (which I have also tried and kinda like, but the question was specific to not sending the value as a column), but it should be fine and does accomplish the task as requested.
EDIT:
Even better! Another option (a variation on the above C# suggestion) is to use the ##ROWCOUNT from the T-SQL stored procedure, sent as an OUTPUT parameter, rather than cycling through the rest of the rows in the SqlDataReader. So the stored procedure would be similar to:
CREATE PROCEDURE SchemaName.ProcName
(
#Param1 INT,
#Param2 VARCHAR(05),
#RowCount INT OUTPUT = -1 -- default so it doesn't have to be passed in
)
AS
SET NOCOUNT ON;
{any ol' query}
SET #RowCount = ##ROWCOUNT;
Then, in the app code, create a new SqlParameter, Direction = Output, for "#RowCount". The numbered steps above stay the same, except the last two (10 and 11), which change to:
Instead of the 2nd while loop, just call Reader.Close()
Instead of using the RowCounter variable, set TotalRows = (int)RowCountOutputParam.Value;
I have tried this and it does work. But so far I have not had time to test the performance against the other methods.
EDIT 2:
If the T-SQL stored procs are being called from the app layer (i.e. no need for ad hoc execution) then this is actually a much simpler variation of the above C# methods. In this case you don't need to worry about the SqlDataRecord or the SqlContext.Pipe methods. Assuming you already have a SqlDataReader set up to pull back the results, you just need to:
Make sure the T-SQL stored proc has a #RowCount INT OUTPUT = -1 parameter
Make sure to SET #RowCount = ##ROWCOUNT; immediately after the query
Register the OUTPUT param as a SqlParameter having Direction = Output
Use a loop similar to: while(Reader.Read() && RowCounter < RowsToReturn) so that you can stop retrieving results once you have pulled back the desired amount.
Remember to not limit the result in the stored proc (i.e. no TOP (n))
At that point, just like what was mentioned in the first "EDIT" above, just close the SqlDataReader and grab the .Value of the OUTPUT param :).
How about this....
DECLARE #N INT = 10
;WITH CTE AS
(
SELECT
A.data1,
A.data2
FROM mytable A
)
SELECT TOP (#N) * , (SELECT COUNT(*) FROM CTE) Total_Rows
FROM CTE
The last column will be populated with the total number of rows it would have returned without the TOP Clause.
The issue with your requirement is, you are expecting a SINGLE select statement to return a table and also a scalar value. which is not possible.
A Single select statement will return a table or a scalar value. OR you can have two separate selects one returning a Scalar value and other returning a scalar. Choice is yours :)
Just because you think TSQL should have a row count because of a sort doe not mean it does. And if it does it does it is not currently sharing it with the outside world.
What you are missing is this is very efficient
select count(*)
from ...
where ...
select top x
from ...
where ...
order by ...
With the count(*) unless the query is just plain ugly those indexes should be in memory.
It has to perform a count to sort based on what?
Did you actually evaluate any query plans?
If TSQL has to perform a sort then explain the following.
Why is the count(*) 100% of the cost when the second had to do a count anyway?
Just where in that second query plan is there a free opportunity to count?
Why are those query plans so different if they both need to count?
I think there is an arcane way to do what you want. It involves triggers and non-temporary tables. And, I should mention, although I have implemented each piece (for different purposes), I have never put them together for this purpose.
The idea starts with this Stack Overflow question. According to this source, ##ROWCOUNT counts the number of attempted inserts, even when they don't really happen. Now, I must admit that a perusal of available documentation doesn't seem to touch on this topic, so this may or may not be "correct" behavior. This method is relying on this "problem".
So, you could do what you want by:
Creating a new table for the output -- but not a table variable or a temporary table.
Creating an "instead of" trigger that prevents more than #maxRows from going into the table.
Select the query results into the table.
Read ##ROWCOUNT after the select.
Note that you can create the table and trigger using dynamic SQL. You could also create it once, and have the trigger read the #maxRows value from some sort of parameter table. As mentioned before, this needs to be a real table that supports triggers.

Database Function VS Case Statement

Yesterday we got a scenario where had to get type of a db field and on base of that we had to write the description of the field. Like
Select ( Case DB_Type When 'I' Then 'Intermediate'
When 'P' Then 'Pending'
Else 'Basic'
End)
From DB_table
I suggested to write a db function instead of this case statement because that would be more reusable. Like
Select dbo.GetTypeName(DB_Type)
from DB_table
The interesting part is, One of our developer said using database function will be inefficient as database functions are slower than Case statement. I searched over the internet to find the answer which is better approach in terms of efficiency but unfortunately I found nothing that could be considered satisfied answer. Please enlighten me with your thoughts, which approach is better?
UDF function is always slower than case statements
Please refer the article
http://blogs.msdn.com/b/sqlserverfaq/archive/2009/10/06/performance-benefits-of-using-expression-over-user-defined-functions.aspx
The following article suggests you when to use UDF
http://www.sql-server-performance.com/2005/sql-server-udfs/
Summary :
There is a large performance penalty paid when User defined functions is used.This penalty shows up as poor query execution time when a query applies a UDF to a large number of rows, typically 1000 or more. The penalty is incurred because the SQL Server database engine must create its own internal cursor like processing. It must invoke each UDF on each row. If the UDF is used in the WHERE clause, this may happen as part of the filtering the rows. If the UDF is used in the select list, this happens when creating the results of the query to pass to the next stage of query processing.
It's the row by row processing that slows SQL Server the most.
When using a scalar function (a function that returns one value) the contents of the function will be executed once per row but the case statement will be executed across the entire set.
By operating against the entire set you allow the server to optimise your query more efficiently.
So the theory goes that the same query run both ways against a large dataset then the function should be slower. However, the difference may be trivial when operating against your data so you should try both methods and test them to determine if any performance trade off is worth the increased utility of a function.
Your devolper is right. Functions will slow down your query.
https://sqlserverfast.com/?s=user+defined+ugly
Calling functionsis like:
wrap parts into paper
put it into a bag
carry it to the mechanics
let him unwrap, do something, wrapt then result
carry it back
use it

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

Assume mytable is an Oracle table and it has a field called id. The datatype of id is NUMBER(8). Compare the following queries:
select * from mytable where id like '715%'
and
select * from mytable where id between 71500000 and 71599999
I would think the second is more efficient since I think "number comparison" would require fewer number of assembly language instructions than "string comparison". I need a confirmation or correction. Please confirm/correct and throw any further comment related to either operator.
UPDATE: I forgot to mention 1 important piece of info. id in this case must be an 8-digit number.
If you only want values between 71500000 and 71599999 then yes the second one is much more efficient. The first one would also return values between 7150-7159, 71500-71599 etc. and so forth. You would either need to sift through unecessary results or write another couple lines of code to filter the rest of them out. The second option is definitely more efficient for what you seem to want to do.
It seems like the execution plan on the second query is more efficient.
The first query is doing a full table scan of the id's, whereas the second query is not.
My Test Data:
Execution Plan of first query:
Execution Plan of second query:
I don't like the idea of using LIKE with a numeric column.
Also, it may not give the results you are looking for.
If you have a value of 715000000, it will show up in the query result, even though it is larger than 71599999.
Also, I do not like between on principle.
If a thing is between two other things, it should not include those two other things. But this is just a personal annoyance.
I prefer to use >= and <= This avoids confusion when I read the query. In addition, sometimes I have to change the query to something like >= a and < c. If I started by using the between operator, I would have to rewrite it when I don't want to be inclusive.
Harv
In addition to the other points raised, using LIKE in the manner you suggest would cause Oracle to not use any indexes on the ID column due to the implicit conversion of the data from number to character, resulting in a full table scan when using LIKE versus and index range scan when using BETWEEN. Assuming, of course, you have an index on ID. Even if you don't, however, Oracle will have to do the type conversion on each value it scans in the LIKE case, which it won't have to do in the other.
You can use math function, otherwise you have to use to_char function to use like, but it will cause performance problems.
select * from mytable where floor(id /100000) = 715
or
select * from mytable where floor(id /100000) = TO_NUMBER('715') // this is parametric

Performance of SQL comparison using substring vs like with wildcard

I am working on a join condition between 2 tables where one of the columns to match on is a concatentation of values. I need to join columnA from tableA to the first 2 characters of columnB from tableB.
I have developed 2 different statements to handle this and I have tried to analyze the performance of each method.
Method 1:
ON tB.columnB like tA.columnA || '%'
Method 2:
ON substr(tB.columnB,1,2) = tA.columnA
The query execution plan has a lot less steps using Method 1 compared to Method 2, however, it looks like Method 2 executes much faster. Also, the execution plan shows a recommended index for Method 2 that could improve its performance.
I am running this on an IBM iSeries, though would be interested in answers in a general sense to learn more about sql query optimization.
Does it make sense that Method 2 would execute faster?
This SO question is similar, but it looks like no one provided any concrete answers to the performance difference of these approaches: T-SQL speed comparison between LEFT() vs. LIKE operator.
PS: The table design that requires this type of join is not something that I can get changed at this time. I realize having the fields separated which hold different types of data would be preferrable.
I ran the following in the SQL Advisor in IBM Data Studio on one of the tables in my DB2 LUW 10.1 database:
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND SUBSTR(DB30_REL_TABLE_NM, 1, 4) = 'ZZZZ'
and
SELECT *
FROM PDM.DB30
WHERE DB30_SYSTEM_ID = 'XXX'
AND DB30_VERSION_ID = 'YYY'
AND DB30_REL_TABLE_NM LIKE 'ZZZZ%'
They both had the exact same access path utilizing the same index, the same estimated IO cost and the same estimated cardinality, the only difference being the estimated total CPU cost for the LIKE was 178,343.75 while the SUBSTR was 197,518.48 (~10% difference).
The cumulative total cost for both were the same though, so this difference is negligible as per the advisor.
Yes, Method 2 would be faster. LIKE is not as efficient a function.
To compare performance of various techniques, try using Visual Explain. You will find it buried in System i Navigator. Under your system connection, expand databases, then click onyour RDB name. In the lower right pane you can then click on the option to Run an SQL Script. Enter in your SELECT statement, and choose the menu option for Visual Explain or Run and Explain. Visual explain will break down the execution plan for your statement and show you the cost for each part as estimated on your tables with the indexes available.
You can actually run with real examples in your database.
LIKE is always better at my run.
select count(*) from u_log where log_text like 'AUT%';
1 row(s) returned : 90ms taken
select count(*) from u_log where substr(log_text,1,3)='AUT';
1 row(s) returned : 493ms taken
I found this reference in an IBM redbook related to SQL performance. It sounds like the SUBSTR scalar function can be handled in an optimized manner by an iSeries.
If you search for the first character and want to use the SQE instead
of the CQE, you can use the scalar function substring on the left sign
of the equal sign. If you have to search for additional characters in
the string, you can additionally use the scalar function POSSTR. By
splitting the LIKE predicate into several scalar function, you can
affect the query optimizer to use the SQE.
http://publib-b.boulder.ibm.com/abstracts/sg246654.html?Open

Large number of UPDATE queries slowing down page

I am reading and validating large fixed-width text files (range from 10-50K lines) that are submitted via our ASP.net website (coded in VB.Net). I do an initial scan of the file to check for basic issues (line length, etc). Then I import each row into a MS SQL table. Each DB rows basically consists of a record_ID (Primary, auto-incrementing) and about 50 varchar fields.
After the insert is done, I run a validation function on the file that checks each field in each row based on a bunch of criteria (trimmed length, isnumeric, range checks, etc). If it finds an error in any field, it inserts a record into the Errors table, which has an error_ID, the record_ID and an error message. In addition, if the field fails in a particular way, I have to do a "reset" on that field. A reset might consist of blanking the entire field, or simply replacing the value with another value (e.g. replacing the string with a new one that has all illegals chars taken out).
I have a 5,000 line test file. The upload, initial check, and import takes about 5-6 seconds. The detailed error check and insert into the Errors table takes about 5-8 seconds (this file has about 1200 errors in it). However, the "resets" part takes about 40-45 seconds for 750 fields that need to be reset. When I comment out the resets function (returning immediately without actually calling the UPDATE stored proc), the process is very fast. With the resets turned on, the pages take 50 seconds to return.
My UPDATE stored proc is using some recommended code from http://sommarskog.se/dynamic_sql.html, whereby it uses CASE instead of dynamic SQL:
UPDATE dbo.Records
SET dbo.Records.file_ID = CASE #field_name WHEN 'file_ID' THEN #field_value ELSE file_ID END,
.
. (all 50 varchar field CASE statements here)
.
WHERE dbo.Records.record_ID = #record_ID
Is there any way I can help my performance here. Can I somehow group all of these UPDATE calls into a single transaction? Should I be reworking the UPDATE query somehow? Or is it just sheer quantity of 750+ UPDATEs and things are just slow (it's a quad proc server with 8GB ram).
Any suggestions appreciated.
Don't do this in sql; fix the data up in code, then do you updates.
If you have sql 2008, then look into table-value parameters. It enables you to pass an entire table as a parameter to a s'proc. From their you just have the one insert/update or merge statement
If your looping through the lines and doing individual updates/inserts this can be really expensive... Consider using SqlBulkCopy which can speed up all your inserts. Similarly, you can create a DataSet, make your updates on the dataset and then submit them all in one shot through a SqlDataAdapter.
I believe you are doing 50 case statements on every update. Sounds like that would be slow.
It is possible to solve this problem with inject proof code via parameterized querys and a string constant table.
Quick and dirty example code.
string [] queryList = { "UPDATE records SET col1 = {val} WHERE ID={key}",
"UPDATE records SET col2 = {val} WHERE ID={key}",
"UPDATE records SET col3 = {val} WHERE ID={key}",
...
"UPDATE records SET col50 = {val} WHERE ID={key}"}
Then in your call to SQL you just pick the item in the array corresponding to the col you want to update and set the value and key for the parameterized items.
I'm guessing you will see a significant improvement... let me know how it goes.
Um. Why are you inserting numeric data into VARCHAR fields then trying to run numeric checks on it? This is yucky.
Apply correct data typing and constraints to your table, do the INSERT, and see if it failed. SQL Server will happily report errors back to you.
I would try changing the recovery model to simple and look at my indexes. Kimberly Tripp did a session showing a scenario with improved performance using a heap.