LINQ / SQL Performace problem with a very long list

LINQ / SQL Performace problem with a very long list - sql

In our database we have a table with more then 100000 entries but most of the time we only need a part of it. We do this with a very simple query.
items.AddRange(from i in this
where i.ResultID == resultID && i.AgentID == parentAgentID
orderby i.ChangeDate descending
select i);
After this Query we get a List with up to 500 items. But even from this result we only need the newest and following item. My coworker did this very simple with:
items[0];
items[1];
Works fine since the query result is already ordered by date. But the overall performance is very poor. Takes some seconds even.
My idea was to add a .Take(2) at the end of the query but my coworker said this will make no difference.
items.AddRange((from i in resultlist
where i.ResultID == resultID && i.AgentID == parentAgentID
orderby i.ChangeDate descending
select i).Take(2));
We haven't tried this yet and we are still looking for additional ways to speed things up. But database programming is not our strong side and any advice would be great.
Maybe we can even make some adjustments to the database itself? We use a SQL Compact Database.

Using Take(2) should indeed make a difference, if the optimiser is reasonably smart, and particularly if the ChangeDate column is indexed. (I don't know how much optimization SQL Compact edition does, but I'd still expect limiting the results to be helpful.)
However, you shouldn't trust me or anyone else to say so. See what query is being generated in each case, and run it against the SQL profiler. See what the execution plan is. Measure the performance with various samples. Measure, measure, measure.

The problem you might be having is that the data is being pulled down to your computer and then you're doing the Take(2) on it. The part that probably takes the most time is pulling all of that data to your application. If you want SQL server to do it then make sure you don't access any of the result set record's values until you're done with your query statements.
Second, LINQ isn't fast for doing things like sorting and where clauses on large sets of data in application memory. It's much easier to write in LINQ at times but it's always better to do as much sorting and where clauses in the database as opposed to manipulating in memory sets of objects.
If you really care about performance in this scenario, don't use LINQ. Just make a loop.
http://ox.no/posts/linq-vs-loop-a-performance-test
I love using LINQ-To-SQL and LINQ but it's not always the right tool for the job. If you have a lot of data and performance is critical then you don't want to use LINQ for in memory sorting and where statements.

Adding .Take(2) will make a big difference. If you only need two items then you should definitely use it and it will most certainly make a performance difference for you.
Add it and look at the SQL that is generated from it. The SQL generated will only get 2 records, which should save you time on the SQL side and also on the object instantiation side.

1 - Add index to cover the fields you use in the query
2 - Make sure that getting just top 2 is not paid by repeating query too frequently
try to define query criteria that will let you take a batch of records
3 - Try to compile your LINQ query

Related

Entity Framework Skip/Take is very slow when number to skip is big

So, code is very simple:
var result = dbContext.Skip(x).Take(y).ToList();
When x is big (~1.000.000), query is very slow. y is small - 10, 20.
SQL code for this is: (from sql profiler)
SELECT ...
FROM ...
ORDER BY ...
OFFSET x ROWS FETCH NEXT y ROWS ONLY
The question is if anybody knows how to speed up such paging?

You are right on that, Skip().Take() approach is slow on SQL server. When I noticed that I used another approach and it worked good. Instead of using Linq Skip().Take() - which writes the code you showed - , I explicitly write the SQL as:
select top NTake ... from ... order by ... where orderedByValue > lastRetrievedValue
this one works fast (considering I have index on the ordered by column(s)).

I think OFFSET .. FETCH is very useful when browsing the first pages from your large data (which is happening very often in most applications) and have a performance drawback when querying high order pages from large data.
Check this article for more details regarding performance and alternatives to OFFSET .. FETCH.
Try to apply as many filters to your data before applying paging, so that paging is run against a smaller data volume. It is hard to imagine that the user wants no navigate through 1M rows.

Navigating though a million records in a database will always be slow in comparison to other ways, the database has to "skip" a million records, it does this by creating the result in memory and then discarding the first million rows.
Have you thought about a non-sql alternative (solr, lucene, etc), at least to get the ids of your rows first and then using a where id in () query?
Alternatively you could have a Search table (cooked up table) of your main table with only the bare minimum data and the Ids, so you can skip over that and get the ids and query the big table with those.

You may lack some index on your table (or you may have too many of them), causing the SQL ordering/filtering to be unable to efficiently skip that many rows (or in case of too many indexes, causing it to fail choosing a good index for the job).
Try testing the SQL query directly:
check its actual execution plan,
check for missing index hints (may require to rewrite the query as a non-dynamic sql query, if ef has issued some dynamic query code),
check for sorts spilling in temp db,
...
So, in short, check whether the issue is really an Entity-Framework issue or a 'pure' SQL issue.
Side note: EF issues offset/fetch paged queries only if it is configured for SQL2012 dialect. For previous dialects, it uses row_number() instead.

The reason is that when EF CORE is converted to RawQuery, the Take/Skip value is parameterized, which will cause slowness, which can be solved by using the view

is "where (ParamID = #ParamID) OR (#ParamID = -1)" a good practice in sql selection

i used to write sql statments like
select * from teacher where (TeacherID = #TeacherID) OR (#TeacherID = -1)
read more
and pass #TeacherID value = -1 to select all teachers
now i'm worry about the performance
can you tell me is that a good practice or bad one?
many thanks

If TeacherID is indexed and you are passing a value other than -1 as TeacherID to search for details of a specific teacher then this query will end up doing a full table scan rather than the potentially far more efficient option of seeking into the index to retrieve the details of the specific teacher...
... Unless you are on SQL 2008 SP1 CU5 and later and use the OPTION (RECOMPILE) hint. See Dynamic Search Conditions in T-SQL for the definitive article on the topic.

We use this in a very limited fashion in stored procedures.
The problem is that the database engine isn't able to keep a good query plan for it. When dealing with a lot of data this can have a serious negative performance impact.
However, for smaller data sets (I'd say less than 1000 records, but that's a guess) it should be fine. You'll have to test in your particular environment.
If it's in a stored procedure, you might want to include something like a WITH RECOMPILE option so that the plan is regenerated on each execution. This adds (slightly) to the time for each run, but over several runs can actually reduce the average execution time. Also, this allows the database to inspect the actual query and "short circuit" the parts that aren't necessary on each call.
If you are directly creating your SQL and passing it through, then I'd suggest you make the part that builds your sql a little smarter so that it only includes the part of the where clause you actually need.
Another path you might consider is using UNION ALL queries as opposed to optional parameters. For example:
SELECT * FROM Teacher WHERE (TeacherId = #TeacherID)
UNION ALL
SELECT * FROM Teacher WHERE (#TeacherId = -1)
This actually accomplishes the exact same thing; however, the query plan is cacheable. We've used this method in a few places as well and saw performance improvements over using WITH RECOMPILE. We don't do this everywhere because some of our queries are extremely complicated and I'd rather have a performance hit than to complicate them further.
Ultimately though, you need to do a lot of testing.
There is a second part here that you should reconsider. SELECT *. It is ALWAYS preferable to actually name the columns you want returned and to make sure that you are only returning the ones you will actually need. Moving data across network boundaries is very expensive and you can generally get a fair amount of performance boost simply by specifying exactly what you want. In addition if what you need is very limited you can sometimes do covering indexes so that the database engine doesn't even have to touch the underlying tables to get the data you want.

If you're really worried about performance, you could break up your procedure to call on two different procs: one for all records, and one based on the parameter.
If #TeacherID = -1
exec proc_Get_All_Teachers
else
exec proc_Get_Teacher_By_TeacherID #TeacherID
Each one can be optimized individually.
It's your system, compare the performance. Consider optimizing on the most popular choice. If most users are going to select a single record, why hider their preformance just to accomodate the few that selct all teachers (And should have a reasonable expectation of performance.).
I know a single select query is easier to maintain, but at some point ease of maintenance eventually gives way to performance.

How to improve query performance

I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.

This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)

You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial

For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.

There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.

Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)

I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID

Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?

using distinct command

using distinct command in SQL is good practice or not? is there any drawback of distinct command?

It depends entirely on what your use case is. DISTINCT is useful in certain circumstances, but it can be overused.
The drawbacks are mainly increased load on the query engine to perform the sort (since it needs to compare the resultset to itself to remove duplicates), and it can be used to mask an issue in your data - if you are getting duplicates there may be a problem with your source data.
The command itself isn't inherently good or bad. You can use a screwdriver to hammer a nail, but that doesn't mean it's a good idea, or that screwdrivers are bad in all cases.

If you need to use it regularly to get the correct output then you have a design or JOIN issue
It's perfectly valid for use otherwise.
It is a kind of aggregate though: the equivalent to a GROUP BY on all output columns. So it is an extra step is query processing

From this http://www.mindfiresolutions.com/Think-Before-Using-Distinct-Command-Arbitarily-1050.php
Sometimes it is seen if the beginners are getting some duplicates in their resultset then they are using DISTINCT. But this has its own disadvantages.
Distinct decreases the query's performance. Because the normal procedure is sorting the results and then removing rows that
are equal to the row immediately before it.
DISTINCT compares between all fields of the record. So DISTINCT increases computation .

It is part of the language, so should be used.
Is some circumstances using DISTINCT may cause a table scan where otherwise one would not occur.
You will need to test for each of your own use cases to see if there is an impact and find a workaround if the impact is unacceptable.

If you want the work to make sure the results are distinct to happen inside the SQL server on the SQL machine, then use it. If you don't mind sending extra results to the client and doing the work there (to reduce server load) then do that. It depends on your performance requirements and the characteristics of your database.
For example, if it's extremely unlikely that distinct will reduce the result set much, and you don't have the right columns indexed to make it fast, and you need to reduce SQL Server load, and you have spare cycles on the client, and it's easy to ensure distinctness on the client -- then you might want to do that.
That's a lot of ifs, ands, and mights. If you don't know -- just use it.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?

This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.

There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.

I'm with you - a single SQL would be better

There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.

It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.

The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!

You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.

Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.

The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.

A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.

Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)

You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.

There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas