I'm writing an application in vb.net 2005. The app reads a spreadsheet into a DataSet with ADO.NET and uses a column of that table to populate a ListBox. When a ListBox Item is selected, the user will be presented with detailed information on the selected record.
One part of this information isn't in the DataSet. I have to compare a column from the spreadsheet with several external data sources to determine the nature of the record in question. Here's where I have my problem.
This comparison has to search through 9.5m rows in a SQL table at one stage. I've checked and there's no way to "shrink" the query down as I'm already only searching absolutely essential data.
What happens is that the application never visibly does anything. The CPU usage shoots up to 100% regardless of what it was at beforehand and the system's performance becomes almost unbearably slow.
Can anyone suggest a way I can improve this situation while this massive query is running?
EDIT: I was originally going to write the contents of the 9.5m rows in the database table to a text file which I'd then read from, but after 6.5m rows, I got an OutOfMemoryException.
I suspect your CPU might be used in populating the DataSet, though you would have to profile your application to confirm that. Try using a DataReader instead and either storing the results in some more compact format in memory or, if you're running out of memory, then writing them to a file as you go. With the DataReader approach you never need to store the entire result set in memory at the same time.
An index in the column to search?
A new field in the table to help to search faster?
Related
I currently find myself needing to do fairly simple computations on several million datapoints. (Constructing a large list of strings from a well defined multi-gigabite file, sorting that list, and then comparing it to another list, a superset.) This is the sort of simple work most of us normally do with the data entirely in-memory, but the size and quantity of the units of data I need to work with could make RAM an issue if I try to keep everything in memory. I quickly realized I probably need to write the data to a file, at a few points, to avoid exhausting my system's resources. I decided to use SQLite3 for this. (This is probably a bit much for a CSV.) It is fairly lightweight, while its storage limits seem to safely exceed my requirements.
The problem I am having is the understanding exactly how the result set works. The documentation I have come across seems a little vague on this. Obviously, SQLite is not writing a whole new table to the database every time a SELECT statement is executed. Does this mean it is duplicating all the selected fields in a complete in-memory table, or does it only keep some sort of pointers in memory (rather than the actual data)? Something else altogether?
I need to be able to sort the data in question. If the result set is really just an in-memory data structure, than simply creating creating a new table and populating it with the help of ORDER BY could be a bad idea.
SQLite does not really have result sets. It has cursors, which allow access to only the current row, and which cannot go backwards.
SQLite computes results on the fly, so only one row needs to be in memory at a time.
When a computation needs to access multiple rows (i.e., aggregate functions, or sorting without a usable index), as much data as possible is kept in the cache, and then spilled to disk in a temporary database.
I'm using PostgreSQL 9.4.
First, I have postgreSQL installed on a system with the only one ssd-drive.
I'm trying to understand what sequential read is and end up with some issue. For instance, if we ask for an SQL-Server to give us some unindexed data, the seq-scan is likely to be happen. But What if two different clients ask for data from two different tables simultaneously? In this case, sql-server creates two different processes for each client and executes the queries concurrently.
But if the queries are being executed concurrently, the head of the drive need to jump from the area the first table is stored to the area the second is.
So, we actually have no sequntial read, jumping between the tables' areas. Where am I worng? Couldn't you explain those things a bit?
"sequential scan" means a table was read from the beginning to the end, sequentially row by row. It means nothing in terms of how data is read from physical storage.
So the term is about logical reads.
Not sure if the answer needs more explanation.
I've recently delved into the world of Delphi, for my current mini project I'm obtaining data via an SQL query and then using the filter property to display exactly what I want.
I discovered the filter by mistake and now prefer it instead of making multiple connections or calls to the database. For example, I'm returning a person object that may own many cars, the app has a check box and depending which one is selected it will update the filter to display only he cars that are blue or pink or whatever.
As far as I understand it, the filter works like a where clause but on the Dataset that is returned from the initial query. So, my question is: Is it faster to use the filter property when working with a small dataset in this manner and I am completely wrong in thinking that Dataset is returned, stored and then the filter is applied to that as opposed to constantly being updated?
I've looked online, the resources do lead me to believe that it is more efficient but I'm still unsure. Thanks for any help.
A filter on a dataset does indeed work (or at least behave) like a WHERE clause, and in some cases can be very fast.
The issues with depending on filters are:
Increased network traffic. You're moving considerably more data from the server to the client that isn't needed, because you're just filtering it out anyway.
Filters are applied to the data row-by-row. A WHERE clause can be optimized by the server to be all (or at least partially) based on existing indexes, whereas the client does not have those indexes available.
Increased memory and CPU use on the client to maintain data it isn't using in memory and to process the rows for filtering.
Data updated by other users or processes is not visible to the client app, as you're now working with all of the data in local memory and not refreshing from the server.
IMO, using a filter on all but a trivial dataset isn't a good option, and if the amount of data is that small you can move the entire dataset into a TClientDataSet and keep it in memory yourself anyway. Like every other optimization being considered, the proper answer depends on the needs of your application and the actual data in question, and should be benchmarked using that criteria to determine what is actually the better solution.
Two different animals. You're asking if it's less overhead to repeatedly query a database or do the filtering exclusively on the client side.
If your app and db are both running on the same machine, then it's probably a toss-up.
But if you're running this in a client-server, n-tier, or partitioned mobile application, and this is a common operation, then I'd say you're probably better off cacheing a larger set of results made in a single query on the client side and using filters to allow the users to see different views of the results. That reduces the bandwidth to the host and the users enjoy faster response times.
(It's a pet peeve of mine to be searching for cars or apartments or real estate and I check or un-check a box to change the view, and I have to wait 5-10 seconds for the app to reply.)
That said, you might also want to consider the overall size of the data, it's temporality, how often it's updated, and see if it's worthwhile loading down significant chunks to the client and localize even more of the specialized views. Pull down whole records and cache them locally to offer users faster response times. And minimize reloading of cached records whenever possible.
A lot of times, the actual data is fairly small on a per-record basis. But when you add-in the media stuff, it explodes. People often don't think about that, considering only the aggregate size of each "record" including the media blobs. If the DB designer was smart, the media isn't even being stored in the DB, but elsewhere, and accessible via URLs.
I'm attempting to use LINQPad as an SSMS replacement, but it takes an inordinate amount of time to return large result sets. I usually give up waiting after a few minutes, but if I leave it alone LINQPad will often time out with an out of memory error.
Does LINQPad load the entire result set into memory before displaying it in the grid? Is it capable of returning records in chunks, adding records to the output grid as more results become available from the database -- similar to the way SSMS works?
Cross-posted (and revised) from the LINQPad formus (http://forum.linqpad.net/discussion/303/is-entire-sql-result-set-loaded-into-memory-before-display) as I haven't had a response there.
This shouldn't happen in rich text mode, because LINQPad implicitly limits the amount of data it fetches (by default, 1000 rows). After some investigation, it appears the problem is due to a bug in ADO.NET's SqlDataReader. When you dispose a data reader after reading only a portion of the rows, it "cleans" the reader by enumerating all remaining data. It certainly is annoying, so I'm looking into whether it's possible to detect this condition and cancel the underlying command.
Edit: there's a workaround for this in the latest beta, so in rich text mode, the query should now complete quickly with the first 1000 rows.
A long time ago when I was a young lat I used to do a lot of assembler and optimization programming. Today I mainly find myself building web apps (it's alright too...). However, whenever I create fields for database tables I find myself using values like 16, 32 & 128 for text fields and I try to combine boolean values into SET data fields.
Is giving a text field a length of 9 going to make my database slower in the long run and do I actually help it by specifying a field length that is more easy memory aligned?
Database optimization is quite unlike machine code optimization. With databases, most of the time you want to reduce disk I/O, and wastefully trying to align fields will only make less records fit in a disk block/page. Also, if any alignment is beneficial, the database engine will do it for you automatically.
What will matter most is indexes and how well you use them. Trying tricks to pack more information in less space can easily end up making it harder to have good indexes. (Do not overdo it, however; not only do indexes slow down INSERTs and UPDATEs to indexed columns, they also mean more work for the planner, which has to consider all the possibilities.)
Most databases have an EXPLAIN command; try using it on your selects (in particular, the ones with more than one table) to get a feel for how the database engine will do its work.
The size of the field itself may be important, but usually for text if you use nvarchar or varchar it is not a big deal. Since the DB will take what you use. the follwoing will have a greater impact on your SQL speed:
don't have more columns then you need. bigger table in terms of columns means the database will be less likely to find the results for your queries on the same disk page. Notice that this is true even if you only ask for 2 out of 10 columns in your select... (there is one way to battle this, with clustered indexes but that can only address one limited scenario).
you should give more details on the type of design issues/alternatives you are considering to get additional tips.
Something that is implied above, but which can stand being made explicit. You don't have any way of knowing what the computer is actually doing. It's not like the old days when you could look at the assembler and know pretty well what steps the program is going to take. A value that "looks" like it's in a CPU register may actually have to be fetched from a cache on the chip or even from the disk. If you are not writing assembler but using an optimizing compiler, or even more surely, bytecode on a runtime engine (Java, C#), abandon hope. Or abandon worry, which is the better idea.
It's probably going to take thousands, maybe tens of thousands of machine cycles to write or retrieve that DB value. Don't worry about the 10 additional cycles due to full word alignments.