LINQ vs datasets - performance hit? - vb.net

I am refactoring an existing VB.NET application to use Linq. I've been able to successfully get it to work, but it takes ages (over a minute) on the client machine!
They have lots of rows in the database table, but the old version of the programme on the same machine (which uses Datasets) takes 5 seconds.
My Linq queries are pretty standard, like so:
Dim query = From t As TRANSACTION In db.TRANSACTIONs _
where t.transactionID = transactionID _
select t
They only ever return one or zero rows. Any thoughts?

I am surprised by the huge time differential (5 seconds to 60+ seconds). I guess it would depend on how complex the TRANSACTION entity is. LINQ to SQL will process each row from your result set and turn it into an object, then add some state-tracking information to the DataContext. A DataSet simply stores the data raw, and processes it into strongly typed data as you read it from the DataTable. I wouldn't expect L2S to incur a 12-fold cost increase, but I would expect some increase.

The code you've pasted doesn't actually access the database at all -- what you do next with 'query' will determine how much data ends up getting transferred to the client. Is it possible something you're doing later on is causing the LINQ version to download more data than the Dataset version?
I have done the same transition on a project and seen only equivalent or better performance from LINQ but there have been instances where the LINQ version was doing a lot more roundtrips to the server, e.g. doing a Count() followed by a fetch of the data as two separate server queries. I usually solved this by doing a .ToList() to get the data locally before working on it. You have to use SQL Profiler sometimes to find out what is going on behind the scenes.

Related

Running an SQL query in background

I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.

Performance Issue : Sql Views vs Linq For Decreasing Query Execution Time

I am having a System Setup in ASP.NET Webforms and there is Acccounts Records Generation Form In Some Specific Situation I need to Fetch All Records that are near to 1 Million .
One solution could be to reduce number of records to fetch but when we need to fetch records for more than a year of 5 years that time records are half million, 1 million etc. How can I decrease its time?
What could be points that I can use to reduce its time? I can't show full query here, it's a big view that calls some other views in it
Does it take less time if I design it in as a Linq query? That's why I asked Linq vs Views
I have executed a "Select * from TableName" Query and its 40 mins and its still executing table is having 1,17,000 Records Can we decrease this timeline
I started this as a comment but ran out of room.
Use the server to do as much filtering for you as possible and return as few rows as possible. Client side filtering is always going to be much slower than server side filtering. Eg, it does not have access to the indexes & optimisation techniques that exist on the server.
Linq uses "lazy evaluation" which means that it builds up a method for filtering but does not execute it until it is forced to. I've used it and was initially impressed with the speed ... until I started to access the data it returned. When you use the data you want from Linq, this will trigger the actual selection process, which you'll find is slow.
Use the server to return a series of small resultsets and then process those. If you need to join these resultsets on a key, save them into dictionaries with that key so you can join them quickly.
Another approach is to look at Entity Framework to create a mirror of the server database structure along with indexes so that the subset of data you retrieve can be joined quickly.

Association properties, nightmare performance (Entity Framework)

I have a fairly large EF4 model, using POCO code gen. I've got lots of instances where I select a single entity from whichever table by its ID.
However on some tables, this takes 2 minutes or more, where on most tables it takes less than a second. I'm out of ideas as to where to look now, because I can't see any reason. It's always the same tables that cause problems, but I can query them directly against the database without problems, so it must be somewhere in Entity Framework territory that the problem is coming from.
The line is the quite innoccuous:
Dim newProd As New Product
Product.ShippingSize = Entities.ShippingSizes.Single(Function(ss) ss.Id = id)
id is simply an integer passed in from the UI, Id on my entity is the primary key, which is indexed on the database
Entities is a freshly created instance of my entity framework datacontext
This is not the first query being executed against the Context, it is the first query against this EntitySet though
I have re-indexed all tables having seen posts suggesting that a corrupt index could cause slow access, that hasn't made any difference
The exact same line of code against other tables runs almost instantly, it's only certain tables
This particular table is tiny - it only has 4 things in it
Any suggestions as to where to even start?
--edit - I'd oversimplified the code in the question to the point where the problem disappeared!
Where to start?
Print or log the actual SQL string that's being sent to the database.
Execute that literal string on the server and measure its performance.
Use your server's EXPLAIN plan system to see what the server's actually doing.
Compare the raw SQL performance to your EF performance.
That should tell you whether you have a database problem or an EF problem.
Seems like this is a function of the POCO template's Fixup behaviour in combination with lazy loading.
Because the entity has already been loaded via Single, subsequent operations seem to be happening in memory rather than against the database. The Fixup method by default makes Contains() calls, which is where everything grinds to a halt while 10s of thousands of items get retrieved, initialised as proxies, and evaluated in memory.
I tried changing this Contains() to a Where(Function(x) x.Id = id).Count > 0 (will do logically the same thing, but trying to force a quick DB operation instead of the slow in-memory one). The query was still performed in-memory, and just as slow.
I switched from POCO to the standard EntityGenerator, and this problem just disappeared with no other changes. Say what you will about patterns/practices, but this is a nasty problem to have - I didn't spot this until I switched from fakes and small test databases to a full size database. Entity Generator saves the day for now.

How to load 1 milion records from database fast?

Now we have a firebird database with 1.000.000 that must be processed after ALL are loaded in RAM memory. To get all of those we must extract data using (select * first 1000 ...) for 8 hours. What is the solution for this?
Does each of your "select * first 1000" (as you described it) do a full table scan? Look at those queries, and make sure they are using an index.
How long does it take to construct the DTO object that you are creating with each data read?
{ int a = read.GetInt32(0); int b = read.GetInt32(1); mylist.Add(new DTO(a,b)); }
You are creating a million of these objects. If it takes 29 milliseconds to create one DTO object, then that is going to take over 8 hours to complete.
to load data from a table with
1.000.000 rows in C# using a firebird db takes on a Pentium 4 3Ghz at least
8 hours
Everybody's been assuming you were running a SQL query to select the records from the database Something like
select *
from your_big_table
/
Because that really would take a few seconds. Well, a little longer to display it on a screen, but executing the actual select should be lightning fast.
But that reference to C# makes me think you're doing something else. Perhaps what you really have is an RBAR loop instantiating one million objects. I can see how that might take a little longer. But even so, eight hours? Where does the time go?
edit
My guess was right and you are instantiating 1000000 objects in a loop. The correct advice would be to find some other way of doing whatever it is you do once you have got all your objects in memory. Without knowing more about the details it is hard to give specifics. But it seems unlikely this is a UI think - what user is going to peruse a million objects?
So a general observation will have to suffice: use bulk operations to implement bulk activity. SQL databases excel at handling sets. Leverage the power of SQL to process your million rows in a single set, rather than as individual rows.
If you don't find this answer helpful then you need to give us more details regarding want you're trying to achieve.
What sort of processing do you need to do that would require to load them in memory and not just process them via SQL statements?
There are two techniques I use that work depending on what I am trying to do.
Assuming there is some sort of artificial key (identity), work in batches, incrementing the last identity value processed.
BCP the data out to a text file, churn through the updates, then BCP it back in, remembering to turn off constraints and indexes before the IN step.
Take a look at this:
http://www.firebirdfaq.org/faq13/

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.
For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.
You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.
If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.
The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData