Does LINQPad load entire SQL result set into memory before displaying it? - linqpad

I'm attempting to use LINQPad as an SSMS replacement, but it takes an inordinate amount of time to return large result sets. I usually give up waiting after a few minutes, but if I leave it alone LINQPad will often time out with an out of memory error.
Does LINQPad load the entire result set into memory before displaying it in the grid? Is it capable of returning records in chunks, adding records to the output grid as more results become available from the database -- similar to the way SSMS works?
Cross-posted (and revised) from the LINQPad formus (http://forum.linqpad.net/discussion/303/is-entire-sql-result-set-loaded-into-memory-before-display) as I haven't had a response there.

This shouldn't happen in rich text mode, because LINQPad implicitly limits the amount of data it fetches (by default, 1000 rows). After some investigation, it appears the problem is due to a bug in ADO.NET's SqlDataReader. When you dispose a data reader after reading only a portion of the rows, it "cleans" the reader by enumerating all remaining data. It certainly is annoying, so I'm looking into whether it's possible to detect this condition and cancel the underlying command.
Edit: there's a workaround for this in the latest beta, so in rich text mode, the query should now complete quickly with the first 1000 rows.

Related

SQL Server select with large varchar columns take time to load

I am trying to run a simple select query and it has column called instructions with varchar(8000) in the select column list. The table has
90,000 records and it took my SQL server management studio console to 10 seconds to return and display the full table data
SELECT id, name, instructions, etc.... FROM TABLE;
however when i remove the instructions from the select list it took only a 1 second to execute and display the result. Can any one please help me to understand the theory behind this
Thanks
Keth
There are some obvious things here that impact the time, and a few more subtle ones around it. The topic of the underlying storage of SQL Server and how it stores / retrieves this data is a book in itself, of which there are many. (I'd personally recommend Kalen Delaney but everyone will have their own preference and I appreciate we should keep away from subjectivity on SO).
90k rows of instructions potentially have to be marshalled across your network connection if you were connected from another machine than the server.
The SSMS console itself, has to display these, which itself takes time.
depending on the size of what you are reading vs your buffer cache and other queries being executed you could be putting pressure on your cache and generating more physical IO load for the server as a whole.
As mentioned in comments, more data is being read, but does this mean more is being read from the disk? This one is far more subtle when looked at in detail.
In terms of the disk IO issue, depending on when the instructions are placed in the row and the settings for the column around inlining of data. It might be that the instructions for the row are stored inline with the row, which means no additional disk IO is actually occurring to read them vs not read them, its more a case of whether SQL Server bothers to decode the value from the page in memory.
The varchar(8000) though might not be inline with the rest of the data, it could be on a row_overflow_data page, sometimes referred to as short large object (SLOB), in which case the instruction field itself stores a pointer where the data is stored, and when you read the instructions it causes SQL Server to have to read another entirely random page (and extent) elsewhere on the disk per row.
Depending how / when instructions are added, you could see a huge level of fragmentation / lack of contiguous extents being allocated for these instructions, although depending on the IO subsystem, this may be immaterial to the problem.
There are a lot of unknowns at this point which makes it harder to give anything definitive - you are in the 'it depends' area of the DB, which would need a lot more specifics and investigation to be able to point at a specific cause, vs the more general (and not entirely complete) list above.
As Tim Biegeleisen mentioned, do not read the instructions unless you need to.

DBGrid Filter, Delphi.

I've recently delved into the world of Delphi, for my current mini project I'm obtaining data via an SQL query and then using the filter property to display exactly what I want.
I discovered the filter by mistake and now prefer it instead of making multiple connections or calls to the database. For example, I'm returning a person object that may own many cars, the app has a check box and depending which one is selected it will update the filter to display only he cars that are blue or pink or whatever.
As far as I understand it, the filter works like a where clause but on the Dataset that is returned from the initial query. So, my question is: Is it faster to use the filter property when working with a small dataset in this manner and I am completely wrong in thinking that Dataset is returned, stored and then the filter is applied to that as opposed to constantly being updated?
I've looked online, the resources do lead me to believe that it is more efficient but I'm still unsure. Thanks for any help.
A filter on a dataset does indeed work (or at least behave) like a WHERE clause, and in some cases can be very fast.
The issues with depending on filters are:
Increased network traffic. You're moving considerably more data from the server to the client that isn't needed, because you're just filtering it out anyway.
Filters are applied to the data row-by-row. A WHERE clause can be optimized by the server to be all (or at least partially) based on existing indexes, whereas the client does not have those indexes available.
Increased memory and CPU use on the client to maintain data it isn't using in memory and to process the rows for filtering.
Data updated by other users or processes is not visible to the client app, as you're now working with all of the data in local memory and not refreshing from the server.
IMO, using a filter on all but a trivial dataset isn't a good option, and if the amount of data is that small you can move the entire dataset into a TClientDataSet and keep it in memory yourself anyway. Like every other optimization being considered, the proper answer depends on the needs of your application and the actual data in question, and should be benchmarked using that criteria to determine what is actually the better solution.
Two different animals. You're asking if it's less overhead to repeatedly query a database or do the filtering exclusively on the client side.
If your app and db are both running on the same machine, then it's probably a toss-up.
But if you're running this in a client-server, n-tier, or partitioned mobile application, and this is a common operation, then I'd say you're probably better off cacheing a larger set of results made in a single query on the client side and using filters to allow the users to see different views of the results. That reduces the bandwidth to the host and the users enjoy faster response times.
(It's a pet peeve of mine to be searching for cars or apartments or real estate and I check or un-check a box to change the view, and I have to wait 5-10 seconds for the app to reply.)
That said, you might also want to consider the overall size of the data, it's temporality, how often it's updated, and see if it's worthwhile loading down significant chunks to the client and localize even more of the specialized views. Pull down whole records and cache them locally to offer users faster response times. And minimize reloading of cached records whenever possible.
A lot of times, the actual data is fairly small on a per-record basis. But when you add-in the media stuff, it explodes. People often don't think about that, considering only the aggregate size of each "record" including the media blobs. If the DB designer was smart, the media isn't even being stored in the DB, but elsewhere, and accessible via URLs.

Fine tuning oracle query with pipelined function

I have a query (that powers an Oracle Application Express Report) that I was told by our users was executing "slowly" or at an unacceptable speed (wasn't given an actual load time for the page and the query is the only thing on the page).
The query involves many tables and actually references a pipelined function which identifies the currently logged-in users to our website and returns a custom "table" of records they have permission to based upon a custom security scheme we have.
My main question is around Oracle's caching of queries and how they could be affected by our setup.
When I took the query out of the webpage and ran it in Sql Developer (and manually specified a user ID to simulate a logged-in user to the website), the performance went from 71 seconds to 19 seconds to .5 seconds. Clearly, Oracle is utilizing its caching mechanism to make subsequent runs faster.
How is this affected by?:
The fact that different users will get different tables from the
pipe-lined function (all the same columns, just different number of
rows and the values in the rows). Does the pipe-lining prevent
caching from working? Am I only seeing caching because I'm running
a very isolated test?
Further more - is caching easily influenced by the number of people using the system? I'm not sure how "much" can get cached. Therefore, if we have 50 concurrent users that are accessing different parts of the website that are loading different queries all day long, is it likely that oracle won't be able to cache many/any of them because it is constantly seeing different request for queries?
Sorry my question isn't very technical.
I'm a developer who has been asked to help out in this seemingly DBA question.
Also, this is complicated because I can't really determine what the actual load times are since our users don't report that level of detail.
Any thoughts on:
how I can determine if this query is actually slow?
what the average processing time would be?
and how to proceed with fine tuning if it is a problem?
Thanks!
It doesn't sound like this has anything to do with APEX, pipelined table functions, or query caching. It sounds like you are describing the effects of plain old data caching (most likely at the database level but potentially at the operating system and disk subsystem layers).
As a very basic overview, data is stored in rows, rows are stored in blocks (most commonly 8 kb in size), blocks are stored in extents (generally a few MB in size), and extents roll up to segments (i.e. a table). Oracle maintains a buffer cache where the most recently accessed blocks are stored. When you run a query, Oracle figures out which blocks it needs to read in order to get your data (this is the query plan). It then looks to see whether those blocks are in the buffer cache or whether they have to be read from disk. Obviously, reading a block from cache is much more efficient than reading it off the disk since RAM is much faster than disk. If you run the same query with the same set of bind variable values multiple times in a row, you'll be accessing the same set of blocks each time but more and more of the blocks you care about are going to be in the cache. So you'd generally expect that the second and third time that you call the query, you'll see faster performance.
If you run the query with a different set of bind variable values, if the second set of bind variable values causes Oracle to access many of the same blocks, those executions will benefit from the data the prior test cached. Otherwise, you'd be back to square 1 potentially reading all the data you need off disk. Most likely, you'll see some combination of the two.
Remember as well that it is not just Oracle that is caching data. Frequently, the operating system will be caching the most active pieces of the underlying Oracle data files. And the I/O subsystem will be caching the most recently accessed data as well. So even if Oracle thinks that it needs to go out to fetch a block because it is not in the database's buffer cache, the file system or the I/O subsystem may have cached that data so it may not require an actual physical read off of disk. These other caches behave similarly where running the same query multiple times in a row is likely to cause the cache to be "warm" and improve the performance of the later runs.

Problems with huge CPU overhead

I'm writing an application in vb.net 2005. The app reads a spreadsheet into a DataSet with ADO.NET and uses a column of that table to populate a ListBox. When a ListBox Item is selected, the user will be presented with detailed information on the selected record.
One part of this information isn't in the DataSet. I have to compare a column from the spreadsheet with several external data sources to determine the nature of the record in question. Here's where I have my problem.
This comparison has to search through 9.5m rows in a SQL table at one stage. I've checked and there's no way to "shrink" the query down as I'm already only searching absolutely essential data.
What happens is that the application never visibly does anything. The CPU usage shoots up to 100% regardless of what it was at beforehand and the system's performance becomes almost unbearably slow.
Can anyone suggest a way I can improve this situation while this massive query is running?
EDIT: I was originally going to write the contents of the 9.5m rows in the database table to a text file which I'd then read from, but after 6.5m rows, I got an OutOfMemoryException.
I suspect your CPU might be used in populating the DataSet, though you would have to profile your application to confirm that. Try using a DataReader instead and either storing the results in some more compact format in memory or, if you're running out of memory, then writing them to a file as you go. With the DataReader approach you never need to store the entire result set in memory at the same time.
An index in the column to search?
A new field in the table to help to search faster?

How do you speed up CSV file process? (5 million or more records)

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library
along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.
It took about 3 - 4 hours to process 5+ million records on the text file.
Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?
edit: Can someone recommend a profiler? prefer Open source or free.
read the record one at the time and insert into the database
Read them in batches and insert them in batches.
Use a profiler - find out where the time is going.
Short of a real profiler, try the following:
Time how long it takes to just read the files line by line, without doing anything with them
Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
Generate random data and insert it into the database, and time that
My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.
I have done many applications like this in the past and there are a number of ways that you can look at optimizing.
Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.
Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok
Consider dropping indexes, doing the import then re-doing the import.
Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.
Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.
BulkInsert property should be set to True on your database.
If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.
Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.
Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.
I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.
There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!
Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.
Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...
Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.
After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.
As others has suggested, profile the app first.
That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.
Consider 5 million round trips are a lot, specially if each of them is for a simple insert.
In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.
There is a good article here.
You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.