Does anyone know how DbDataReaders actually work. We can use SqlDataReader as an example.
When you do the following
cmd.CommandText = "SELECT * FROM Customers";
var rdr = cmd.ExecuteReader();
while(rdr.Read())
{
//Do something
}
Does the data reader have all of the rows in memory, or does it just grab one, and then when Read is called, does it go to the db and grab the next one? It seems just bringing one into memory would be bad performance, but bringing all of them would make it take a while on the call to ExecuteReader.
I know I'm the consumer of the object and it doesn't really matter how they implement it, but I'm just curious, and I think that I would probably spend a couple hours in Reflector to get an idea of what it's doing, so thought I'd ask someone that might know.
I'm just curious if anyone has an idea.
As stated here :
Using the DataReader can increase
application performance both by
retrieving data as soon as it is
available, and (by default) storing
only one row at a time in memory,
reducing system overhead.
And as far as I know that's the way every reader works in the .NET framework.
Rhapsody is correct.
Results are returned as the query
executes, and are stored in the
network buffer on the client until you
request them using the Read method of
the DataReader
I ran a test using DataReaders vs DataAdaptors on an equal 10,000 record data set, and I found that the DataAdaptor was consistently 3-4 milliseconds faster than the DataReader, but the DataAdaptor will end up holding onto more memory.
When I ran the same test on equal 50,000 record data sets I saw a performance gain on the DataReader side to the tune of 50 milliseconds.
With that said, if you had a long running query or a huge result set, I think you may be better off with a DataReader since you get your results sooner and don't have to hold onto all of that data in memory. It is also important to keep in mind that a DataReader is forward only, so if you need to move around in your results set, then it is not the best choice.
Related
Why do most programming languages use the concept of result set when returning data from a database? Why aren't the results returned directly in a more common, immediately usable structure like an array? Why is this extra layer between querying and being able to use the results better or necessary?
An array is just a container of data. A result set is a much more powerful abstraction that encapsulates a very complex interaction between the database server and the client program making the data retrieval request.
"Immediately usable"... that's very naive. Yes, of course, often you just want the data, and often everything goes well and a result set object may seem a bit of a hindrance. But you should stop a moment and think of the complexity that's behind that data retrieval you are executing.
Data fetching
The first and most important consideration is that an array is a static structure that contains all data of all rows. While that might seem like a good solution for small queries, I assure you it is not in most cases. It assumes that fetching all data will require little time and memory, which is not always the case.
RDBMS return one row at a time... that's how things work usually. That way they can serve many clients... you can also cancel your data retrieval... or the RDBMS can take you down if you are hogging too many resources.
The result set handles the complexity of fetching one row or a page of rows or all the rows from the back-end, maybe caching the result internally. It does then allow the program access to just one row of data at a time, adding methods to navigate back and forth, without having to think what is happening behind the scenes. That is not for you to know usually, but there are many optimizations and gotchas.
Unidirectional queries
Some queries on some RDBMS are more efficient if executed unidirectionally. That is you tell the server you will never need to lookup a row of data you have already fetched. But result set objects can often cache this data internally and allow the program to navigate back to it (without disturbing the server).
Updatable queries
Some RDBMS support SELECT FOR UPDATE. Result set objects can often allow the program to modify the fetched data and then handle internally all the operations necessary to reflect those updates on the underlying database... and in many languages this is possible even if the RDBMS does not support SELECT FOR UPDATE.
Better handling of exceptions
When you ask for data, if things go well you get a stream of data that can fit in an array... if things go wrong, you get a stream of information that requires a different structure to be handled. A result set object can provide the client program with that structured information... and can maybe also provide a way of recovering.
I'm adding some more info on cursors, even though it is less relevant to this question. Fetching rows from the server is done through the use of a CURSOR. It tipically envolves 4 steps (DECLARE the cursor, OPEN it, use it to FETCH data, then CLOSE it). Declaring and opening a CURSOR allocates resources on the server which are used to remember what that specific client is asking for and what data has already been returned. FETCHing allows to navigate the result set and retrieve another row of data (not necessarily the next row). Closing the cursor tells the server you are done with that request and allows it to deallocate those resources.
Because arrays require all memory to be allocated at once, and all results to be pulled immediately. You might want to stream through terabytes of data. Or you might want to stop pulling results and abort the query mid-way.
Also note, that the way a specific API exposes query results is arbitrary. You could write yourself an API that exposes the data as an array to you. This is a design choice that the creator of the API has.
I've been looking around and haven't been able to find a good answer to this question. Is there a real performance difference between the following DBI methods?
fetchrow_arrayref vs
selectall_arrayref vs
fetchall_arrayref
It probably doesn't matter when making a single SELECT call that will give a smallish (>50 records) resultset, but what about making multiple SELECT statements all in a row? What about if the resultsets are huge (i.e. thousands of records)?
Thanks.
The key question to ask yourself is whether you need to keep all the returned rows around in memory.
If you do then you can let the DBI fetch them all for you - it will be faster than writing the equivalent code yourself.
If you don't need to keep all the rows in memory, in other words, if you can process each row in turn, then using fetchrow_arrayref in a loop will typically be much faster.
The reason is that the DBI goes to great lengths to reuse the memory buffers for each row. That can be a significant saving in effort. You can see the effect on this slide, although the examples don't directly match your question. You can also see from that slide the importance of benchmarking. It can be hard to tell where the balance lies.
If you can work on a per-row basis, then binding columns can yield a useful performance gain by reducing the work of accessing the values in the fetched rows.
You also asked about "huge" results. (I doubt "thousands of records" would be a problem on modern machines unless the rows themselves were very 'large'.) Clearly processing a row at a time is preferable for very large result sets. Note that some databases default to streaming all the results to the driver which then buffers them in a compact form and returns the rows one by one as your perl code (or a DBI method) fetches them. Again, benchmark and test for yourself.
If you need more help, the dbi-users mailing list is a good place to ask. You don't need to subscribe.
The difference between fetchrow* and fetch all will be the location of the loop code in the call stack. That is, the fetchall* implies the fetch loop, while the fetchrow* implies that you will write your own loop.
The difference between the fetch* and select* methods is that one requires you to manually prepare() and execute() the query, while the other does that for you. The time differences will come from how efficient your code is compared to DBI's.
My reading has shown that the main differences between methods are between *_arrayref and *_hashref, where the *_hashref methods are slower due to the need to look up the hash key names in the database's metadata.
I have an issue where my .NET 3.5 applications are causing the IIS worker process to continually eat up memory and never release it until the applications start throwing memory related errors and I have to recycle the IIS worker process. Another thing I've noticed is that the connection to the Oracle DB server also doesn't close and will remain open until I recycle the IIS worker process (as far as I can tell I'm closing the Oracle connections properly). From what I've read in other similar posts the GC is supposed to clean up unused memory and allow it to be reallocated but this is quite clearly not happening here (I'm observing the same problem on both the remote host and local host. I'm going to assume that this isn't an issue related to IIS settings but rather that I'm not doing proper housecleaning in my code; what things should I be look at? Thanks.
Here is my code related to querying the Oracle DB:
Using conn As New OracleConnection(oradb)
Try
cmd.Connection = conn
daData = New OracleDataAdapter(cmd)
cbData = New OracleCommandBuilder(daData)
dtData = New DataTable()
dtDADPLIs = New DataTable()
conn.Open()
cmd.CommandText = "SELECT * FROM TABLE" _
daData.Fill(dtData)
cmd.CommandText = "SELECT * FROM TABLE2"
daData.Fill(dtDADPLIs)
QueryName = "SD_TIER_REPORT"
WriteQueryLog(QueryName)
Catch ex As OracleException
'MessageBox.Show(ex.Message.ToString())
Finally
conn.Close()
conn.Dispose()
End Try
Once I ran into the same issue and I bumped into this article and this one.
I exchanged a few emails with the author (Paul Wilson) and he helped me to understand the problem with large objects which are allocated in memory in a "Large Object Heap" and it never gets compacted.
This is what he told me:
Larger objects are indeed allocated separately, where large is
something around 60-90 KB or larger (I don't remember exactly, and its
not officially documented anyhow). So if your byte arrays, and other
objects for that matter, are larger than that threshold then they will
be allocated separately. When does the large object heap get
collected? You may have ran into statements about there being several
generations of normal memory allocation (0, 1, and 2 in the current
frameworks) -- well the large object heap is basically considered to
be generation 2 automatically. That means that it will not be
collected until there isn't enough memory left after collecting gen 0
and gen 1 -- so basically it only happens on a full GC collection. So
to answer your question -- there is no way to make sure objects in the
large object heap get collected any sooner. The problem is that I'm
talking about garbage collection, which assumes that your objects
(large objects in this case) are no longer referenced anywhere and
thus available to be collected. If they are still referenced
somewhere, then it simply doesn't matter how much the GC runs -- your
memory usage is simply going to go up and up. So do you have all
references gone? It may seem you do, and you might be right -- all I
can tell you is that its very easy to be wrong, and its a terrible
amount of work with memory profilers and no shortcuts to prove it one
way or the other. I can tell you that if a manual GC.Collect reliably
does reduce your memory usage, then you've obviously got your objects
de-referenced -- else a GC.Collect wouldn't help. So the question may
simply be what makes you think you are having a memory problem? There
may be no reason for a GC to collect memory if you have plenty
available on a big server system!
Another article which is worth reading is this.
Solution?
Fetch only data you need
Avoid using datasets when possible and choose a datareader.
UPDATE:
If you're using a reporting tool like MS ReportViewer if you can bind your report to a "business object".
I am a newbie in VB.NET windows application development. Now I am referring an already developed application. In my reference project every data update is using a DataAdapter, first loading the data in the adapter then creating a new row that updates adapter. So, this method first fetches data then updates data. If we are using direct SQL commands there is only one database insert/update statement. Is there significant impact in between these two methods?
Is there significant impact in between these two methods?
That is a very general question, so the general answer is "It depends...". The overhead of creating and filling a DataAdapter will cost you something, but whether or not that is a significant cost will depend on factors like...
how much data is in the table(s) you are updating
how much information you are pulling into the DataAdapter
whether the database file is on a local drive or a network share
...and (perhaps) many other factors. The only way for you to know if there is a significant performance difference between the two approaches in your particular circumstances would be for you to run some tests and compare the results.
The latter option will be on the whole be faster, to my knowledge.
The DataAdapter will as you suggest query for the data and load it into memory, before making any required updates. I would be extremely surprised however, if the DataAdapter updated all rows regardless of whether the row has changed - I should expect that it will only send the required SQL to the DBMS.
Running an update statement via SqlCommand circumvents querying for and loading the data, and dependent on your query will probably make the update to the DB itself in roundabout the same way.
The decision between the two depends entirely on context.
I have a reasonable number of records in an Azure Table that I'm attempting to do some one time data encryption on. I thought that I could speed things up by using a Parallel.ForEach. Also because there are more than 1K records and I don't want to mess around with continuation tokens myself I'm using a CloudTableQuery to get my enumerator.
My problem is that some of my records have been double encrypted and I realised that I'm not sure how thread safe the enumerator returned by CloudTableQuery.Execute() is. Has anyone else out there had any experience with this combination?
I would be willing to bet the answer to Execute returning a thread-safe IEnumerator implementation is highly unlikely. That said, this sounds like yet another case for the producer-consumer pattern.
In your specific scenario I would have the original thread that called Execute read the results off sequentially and stuff them into a BlockingCollection<T>. Before you start doing that though, you want to start a separate Task that will control the consumption of those items using Parallel::ForEach. Now, you will probably also want to look into using the GetConsumingPartitioner method of the ParallelExtensions library in order to be most efficient since the default partitioner will create more overhead than you want in this case. You can read more about this from this blog post.
An added bonus of using BlockingCollection<T> over a raw ConcurrentQueueu<T> is that it offers the ability to set bounds which can help block the producer from adding more items to the collection than the consumers can keep up with. You will of course need to do some performance testing to find the sweet spot for your application.
Despite my best efforts I've been unable to replicate my original problem. My conclusion is therefore that it is perfectly OK to use Parallel.ForEach loops with CloudTableQuery.Execute().