Association properties, nightmare performance (Entity Framework) - sql

I have a fairly large EF4 model, using POCO code gen. I've got lots of instances where I select a single entity from whichever table by its ID.
However on some tables, this takes 2 minutes or more, where on most tables it takes less than a second. I'm out of ideas as to where to look now, because I can't see any reason. It's always the same tables that cause problems, but I can query them directly against the database without problems, so it must be somewhere in Entity Framework territory that the problem is coming from.
The line is the quite innoccuous:
Dim newProd As New Product
Product.ShippingSize = Entities.ShippingSizes.Single(Function(ss) ss.Id = id)
id is simply an integer passed in from the UI, Id on my entity is the primary key, which is indexed on the database
Entities is a freshly created instance of my entity framework datacontext
This is not the first query being executed against the Context, it is the first query against this EntitySet though
I have re-indexed all tables having seen posts suggesting that a corrupt index could cause slow access, that hasn't made any difference
The exact same line of code against other tables runs almost instantly, it's only certain tables
This particular table is tiny - it only has 4 things in it
Any suggestions as to where to even start?
--edit - I'd oversimplified the code in the question to the point where the problem disappeared!

Where to start?
Print or log the actual SQL string that's being sent to the database.
Execute that literal string on the server and measure its performance.
Use your server's EXPLAIN plan system to see what the server's actually doing.
Compare the raw SQL performance to your EF performance.
That should tell you whether you have a database problem or an EF problem.

Seems like this is a function of the POCO template's Fixup behaviour in combination with lazy loading.
Because the entity has already been loaded via Single, subsequent operations seem to be happening in memory rather than against the database. The Fixup method by default makes Contains() calls, which is where everything grinds to a halt while 10s of thousands of items get retrieved, initialised as proxies, and evaluated in memory.
I tried changing this Contains() to a Where(Function(x) x.Id = id).Count > 0 (will do logically the same thing, but trying to force a quick DB operation instead of the slow in-memory one). The query was still performed in-memory, and just as slow.
I switched from POCO to the standard EntityGenerator, and this problem just disappeared with no other changes. Say what you will about patterns/practices, but this is a nasty problem to have - I didn't spot this until I switched from fakes and small test databases to a full size database. Entity Generator saves the day for now.

Related

is it ok to loop a sql query in programing language

I have a doubt in mind when retrieving data from database.
There are two tables and master table id always inserted to other table.
I know that data can retrieve from two table by joining but want to know,
if i first retrieve all my desire data from master table and then in loop (in programing language) join to other table and retrieve data, then which is efficient and why.
As far as efficiency goes the rule is you want to minimize the number of round trips to the database, because each trip adds a lot of time. (This may not be as big a deal if the database is on the same box as the application calling it. In the world I live in the database is never on the same box as the application.) Having your application loop means you make a trip to the database for every row in the master table, so the time your operation takes grows linearly with the number of master table rows.
Be aware that in dev or test environments you may be able to get away with inefficient queries if there isn't very much test data. In production you may see a lot more data than you tested with.
It is more efficient to work in the database, in fewer larger queries, but unless the site or program is going to be very busy, I doubt that it'll make much difference that the loop is inside the database or outside the database. If it is a website application then looping large loops outside the database and waiting on results will take a more significant amount of time.
What you're describing is sometimes called the N+1 problem. The 1 is your first query against the master table, the N is the number of queries against your detail table.
This is almost always a big mistake for performance.*
The problem is typically associated with using an ORM. The ORM queries your database entities as though they are objects, the mistake is assume that instantiating data objects is no more costly than creating an object. But of course you can write code that does the same thing yourself, without using an ORM.
The hidden cost is that you now have code that automatically runs N queries, and N is determined by the number of matching rows in your master table. What happens when 10,000 rows match your master query? You won't get any warning before your database is expected to execute those queries at runtime.
And it may be unnecessary. What if the master query matches 10,000 rows, but you really only wanted the 27 rows for which there are detail rows (in other words an INNER JOIN).
Some people are concerned with the number of queries because of network overhead. I'm not as concerned about that. You should not have a slow network between your app and your database. If you do, then you have a bigger problem than the N+1 problem.
I'm more concerned about the overhead of running thousands of queries per second when you don't have to. The overhead is in memory and all the code needed to parse and create an SQL statement in the server process.
Just Google for "sql n+1 problem" and you'll lots of people discussing how bad this is, and how to detect it in your code, and how to solve it (spoiler: do a JOIN).
* Of course every rule has exceptions, so to answer this for your application, you'll have to do load-testing with some representative sample of data and traffic.

Merge identical databases into one

We have 15 databases of 75 tables with an avarage of a million rows. all with the same schema but different data. We have now been given the requirements by the client to bring all 15 into one database. Each set of data filtered by the user’s login.
The changes to the application have been completed to do the filtering. We are now left with the task of merging all databases into one.
The issue is conflicting PK and FK as the PK’s and the FK’s are of type int so we will have 15 PK ids of 1.
One idea is to use. net and the DBML to insert the records as new records into the new database letting linq deal with the PK and FK and using code to deal with duplicate data.
What other ways are there to do this?
It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.
First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.
Of course life isn't that easy. Now the bad news:
This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.
The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.
So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.
You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).
So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.
I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

Is it possible to Cache the result set of a select query in the database?

I am trying to optimize the search query which is the most used in our system. So far I have added some missing indexes and that has helped slightly. But I want to further reduce the load on the db server. One option that I will use is caching the result set as a LIST in the asp.net Cache so that I don't have to hit the db often.
However, I was wondering if there is a way to Cache some portions of the select query at the db as well. e.g. for the search results we consider only users who have been active in the last 180 days and who have share-info set as true. So this is like a super set which the db processes everytime and then applies other conditions such as category specified, city etc. which are passed. Is it possible to somehow Cache the Super Set so that I can run queries against the super set rather than run the query against the whole table? Will creating a View help in this? I am a bit hesitant to create a view as I read managing views can be an overhead and takes away some flexibility to modfy the tables.
I am using Sql-Server 2005 so cannot create a filtered index on the table, which I think would have been helpful.
I agree with #Neville K. SQL Server is pretty smart at caching data in memory. You might see limited / no performance gains for your effort.
You could consider indexed views (Enterprise Edition only) http://technet.microsoft.com/en-us/library/cc917715.aspx for your sub-query.
It is, of course, possible to do this - but I'm not sure if it will help.
You can create a scheduled job - once a night, perhaps - which populates a table called "active_users_with_share_info" by truncating it, and then repopulating it based on a select query filtering out users active in the last 180 days with "share_info = true".
Then you can join your search query to this table.
However, I doubt this would do much good - SQL Server is pretty smart at caching. Unless you're dealing with huge volumes of data (100 of millions of records), or very limited hardware, I doubt you'd get any measurable performance improvements - but by all means try it!
Of course, the price for this would be more moving parts in your application, more interesting failure modes (what happens if the overnight batch fails silently?), and more training for any new developers you bring into the team.

LINQ vs datasets - performance hit?

I am refactoring an existing VB.NET application to use Linq. I've been able to successfully get it to work, but it takes ages (over a minute) on the client machine!
They have lots of rows in the database table, but the old version of the programme on the same machine (which uses Datasets) takes 5 seconds.
My Linq queries are pretty standard, like so:
Dim query = From t As TRANSACTION In db.TRANSACTIONs _
where t.transactionID = transactionID _
select t
They only ever return one or zero rows. Any thoughts?
I am surprised by the huge time differential (5 seconds to 60+ seconds). I guess it would depend on how complex the TRANSACTION entity is. LINQ to SQL will process each row from your result set and turn it into an object, then add some state-tracking information to the DataContext. A DataSet simply stores the data raw, and processes it into strongly typed data as you read it from the DataTable. I wouldn't expect L2S to incur a 12-fold cost increase, but I would expect some increase.
The code you've pasted doesn't actually access the database at all -- what you do next with 'query' will determine how much data ends up getting transferred to the client. Is it possible something you're doing later on is causing the LINQ version to download more data than the Dataset version?
I have done the same transition on a project and seen only equivalent or better performance from LINQ but there have been instances where the LINQ version was doing a lot more roundtrips to the server, e.g. doing a Count() followed by a fetch of the data as two separate server queries. I usually solved this by doing a .ToList() to get the data locally before working on it. You have to use SQL Profiler sometimes to find out what is going on behind the scenes.

Simulating queries of large views for benchmarking purposes

A Windows Forms application of ours pulls records from a view on SQL Server through ADO.NET and a SOAP web service, displaying them in a data grid. We have had several cases with ~25,000 rows, which works relatively smoothly, but a potential customer needs to have many times that much in a single list.
To figure out how well we scale right now, and how (and how far) we can realistically improve, I'd like to implement a simulation: instead of displaying actual data, have the SQL Server send fictional, random data. The client and transport side would be mostly the same; the view (or at least the underlying table) would of course work differently. The user specifies the amount of fictional rows (e.g. 100,000).
For the time being, I just want to know how long it takes for the client to retrieve and process the data and is just about ready to display it.
What I'm trying to figure out is this: how do I make the SQL Server send such data?
Do I:
Create a stored procedure that has to be run beforehand to fill an actual table?
Create a function that I point the view to, thus having the server generate the data 'live'?
Somehow replicate and/or randomize existing data?
The first option sounds to me like it would yield the results closest to the real world. Because the data is actually 'physically there', the SELECT query would be quite similar performance-wise to one on real data. However, it taxes the server with an otherwise meaningless operation. The fake data would also be backed up, as it would live in one and the same database — unless, of course, I delete the data after each benchmark run.
The second and third option tax the server while running the actual simulation, thus potentially giving unrealistically slow results.
In addition, I'm unsure how to create those rows, short of using a loop or cursor. I can use SELECT top <n> random1(), random2(), […] FROM foo if foo actually happens to have <n> entries, but otherwise I'll (obviously) only get as many rows as foo happens to have. A GROUP BY newid() or similar doesn't appear to do the trick.
For data for testing CRM type tables, I highly recommend fakenamegenerator.com, you can get 40,000 fake names for free.
You didn't mention if you're using SQL Server 2008. If you use 2008 and you use Data Compression, be aware that random data will act very differently (slower) than real data. Random data is much harder to compress.
Quest Toad for SQL Server and Microsoft Visual Studio Data Dude both have test data generators that will put fake "real" data into records for you.
If you want results you can rely on you need to make the testing scenario as realistic as possible, which makes option 1 by far your best bet. As you point out if you get results that aren't good enough with the other options you won't be sure that it wasn't due to the different database behaviour.
How you generate the data will depend to a large degree on the problem domain. Can you take data sets from multiple customers and merge them into a single mega-dataset? If the data is time series then maybe it can be duplicated over a different range.
The data is typically CRM-like, i.e. contacts, projects, etc. It would be fine to simply duplicate the data (e.g., if I only have 20,000 rows, I'll copy them five times to get my desired 100,000 rows). Merging, on the other hand, would only work if we never deploy the benchmarking tool publicly, for obvious privacy reasons (unless, of course, I apply a function to each column that renders the original data unintelligible beyond repair? Similar to a hashing function, only without modifying the value's size too much).
To populate the rows, perhaps something like this would do:
WHILE (SELECT count(1) FROM benchmark) < 100000
INSERT INTO benchmark
SELECT TOP 100000 * FROM actualData