What are the benefits of using database cursor? - sql

It is based on the interview question that I faced.
Very short definition can be
It can be used to manipulate the rows
returned by a query.
Besides the use of the cursor (Points are listed here on MSDN), I have a question in my mind that if we can perform all the operations using query or stored procedure (if I'm not wrong, Like we can use Transact-SQL for ms-sql), is there any concrete point that we should use cursor?

Using cursors compared to big resultsets is like using video streaming instead of downloading an video in one swoop, and watching it when it has downloaded.
If you download, you have to have a few gigs of space and the patience to wait until the download finished. Now, no matter how fast your machine or network may be, everyone watches a movie at the same speed.
Normally any query gets sent to the server, executed, and the resultset sent over the network to you, in one burst of activity.
The cursor will give you access to the data row by row and stream every row only when you request it (can actually view it).
A cursor can save you time - because you don't need to wait for the processing and download of your complete recordset
It will save you memory, both on the server and on the client because they don't have to dedicate a big chunk of memory to resultsets
Load-balance both your network and your server - Working in "burst" mode is usually more efficient, but it can completely block your server and your network. Such delays are seldom desirable for multiuser environments. Streaming leaves room for other operations.
Allows operations on queried tables (under certain conditions) that do not affect your cursor directly. So while you are holding a cursor on a row, other processes are able to read, update and even delete other rows. This helps especially with very busy tables, many concurrent reads and writes.
Which brings us to some caveats, however:
Consistency: Using a cursor, you do (usually) not operate on a consistent snapshot of the data, but on a row. So your concurrency/consistency/isolation guarantees drop from the whole database (ACID) to only one row. You can usually inform your DBMS what level of concurrency you want, but if you are too nitpicky (locking the complete table you are in), you will throw away many of the resource savings on the server side.
Transmitting every row by itself can be very inefficient, since every packet has negotiation overhead that you might avoid by sending big, maybe compressed, chunks of data per packet. ( No DB server or client library is stupid enough to transmit every row individually, there's caching and chunking on both ends, still, it is relevant.)
Cursors are harder to do right. Consider a query with a big resultset, motivating you to use a cursor, that uses a GROUP BY clause with aggregate functions. (Such queries are common in data warehouses). The GROUP BY can completely trash your server, because it has to generate and store the whole resultset at once, maybe even holding locks on other tables.
Rule of thumb:
If you work on small, quickly created resultsets, don't use cursors.
Cursors excell on ad hoc, complex (referentially), queries of sequential nature with big resultsets and low consistency requirements.
"Sequential nature" means there are no aggregate functions in heavy GROUP BY clauses in your query. The server can lazily decide to compute 10 rows for your cursor to consume from a cache and do other stuff meanwhile.
HTH

A cursor is a tool that allows you to iterate the records in a set. It has concepts of order and current record.
Generally, SQL operates with multisets: these are sets of possibly repeating records in no given order, taken as a whole.
Say, this query:
SELECT *
FROM a
JOIN b
ON b.a = a.id
, operates on multisets a and b.
Nothing in this query makes any assumptions about the order of the records, how they are stored, in which order they should be accessed, etc.
This allows to abstract away implementation details and let the system try to choose the best possible algorithm to run this query.
However, after you have transformed all your data, ultimately you will need to access the records in an ordered way and one by one.
You don't care about how exactly the entries of a phonebook are stored on a hard drive, but a printer does require them to be feed in alphabetical order; and the formatting tags should be applied to each record individually.
That's exactly where the cursors come into play. Each time you are processing a resultset on the client side, you are using a cursor. You don't get megabytes of unsorted data from the server: you just get a tiny variable: a resultset descriptor, and just write something like this:
while (!rs.EOF) {
process(rs);
rs.moveNext();
}
That's cursor that implements all this for you.
This of course concerns database-client interaction.
As for the database itself: inside the database, you rarely need the cursors, since, as I have told above, almost all data transformations can be implemented using set operations more efficiently.
However, there are exceptions:
Analytic operations in SQL Server are implemented very poorly. A cumulative sum, for instance, could be calculated much more efficiently with a cursor than using the set-based operations
Processing data in chunks. There are cases when a set based operation should be sequentially applied to a portion of a set and the results of each chunk should be committed independently. While it's still possible to do it using set-based operations, a cursor is often a more preferred way to do this.
Recursion in the systems that do not support it natively.
You also may find this article worth reading:
The Island of Misfit Cursors

Using a cursor it is possible to read sequentially through a set of data, programmatically, so it behaves in a similar manner to conventional file access, rather than the set-based behaviour characteristic of SQL.
There are a couple of situations where this may be of use:
Where it is necessary to simulate file-based record access behaviour - for example, where a relational database is being used as the data storage mechanism for a piece of code that was previously written to use indexed files for data storage.
Where it is necessary to process data sequentially - a simple example might be to calculate a running total balance for a specific customer. (A number of relational databases, such as Oracle and SQLServer, now have analytical extensions to SQL that should greatly reduce the need for this.)
Inevitably, wikipedia has more: http://en.wikipedia.org/wiki/Database_cursor

With cursor you access one row at a time. So it is good to use it when you want manipulate with a lot of rows but with only one at a given time.
I was told at my classes, the reason to use cursor is you want to access more rows than you can fit your memory - so you can't just get all rows into a collection and then loop through it.

Sometimes a set based logic can get quite complex and opaque. In these cases and if the performance is not an issue a server side cursor can be used to replace the relational logic with a more manageable and familiar (to a non relational thinker) procedural logic resulting in easier maintenance.

Related

SQL SERVER equivalent of a HashMap?

In the Java application, if a table (small one) has to be queried many times one can simply create a HashMap and store the contents of the table in it. Once this is done subsequent calls can be made to the HashMap instead of the table thereby substantially improving the performance of the program.
I have a similar situation in SQL Server where I have to loop say 5,00,00 times in a stored procedure and each time there are many small tables that are queried. Is there a way I can substantially reduce the time taken to execute the procedure by using an equivalent concept in SQL Server?
Databases have similar functionality, but it is hidden in the page caching mechanisms. If you repeatedly use a table, then it will be loaded into memory, and memory operations are much faster than disk operations.
I should note that you can explicitly store tables in memory using memory optimized tables (see here). Those maintain data integrity for data modifications, if that is necessary for your application.
Further, you should define indexes if you are randomly accessing rows in the table. Indexes in most databases (including SQL Server) are using binary trees. The difference in performance between a binary tree and a hashmap -- particularly on a small table -- is not going to significant.
I should note that if your goal is to write the most optimized code possible, then SQL is probably not the right environment. SQL engines do a lot more than process data, particularly with respect to data management and maintaining data integrity. This incurs overhead. For everything they do, they are usually quite efficient, but for a particular algorithm, it might be faster to write the logic elsewhere. (Of course, in a database you get parallelism for free.)

Are SQLite result sets in-memory data structures?

I currently find myself needing to do fairly simple computations on several million datapoints. (Constructing a large list of strings from a well defined multi-gigabite file, sorting that list, and then comparing it to another list, a superset.) This is the sort of simple work most of us normally do with the data entirely in-memory, but the size and quantity of the units of data I need to work with could make RAM an issue if I try to keep everything in memory. I quickly realized I probably need to write the data to a file, at a few points, to avoid exhausting my system's resources. I decided to use SQLite3 for this. (This is probably a bit much for a CSV.) It is fairly lightweight, while its storage limits seem to safely exceed my requirements.
The problem I am having is the understanding exactly how the result set works. The documentation I have come across seems a little vague on this. Obviously, SQLite is not writing a whole new table to the database every time a SELECT statement is executed. Does this mean it is duplicating all the selected fields in a complete in-memory table, or does it only keep some sort of pointers in memory (rather than the actual data)? Something else altogether?
I need to be able to sort the data in question. If the result set is really just an in-memory data structure, than simply creating creating a new table and populating it with the help of ORDER BY could be a bad idea.
SQLite does not really have result sets. It has cursors, which allow access to only the current row, and which cannot go backwards.
SQLite computes results on the fly, so only one row needs to be in memory at a time.
When a computation needs to access multiple rows (i.e., aggregate functions, or sorting without a usable index), as much data as possible is kept in the cache, and then spilled to disk in a temporary database.

Is it possible to reject excessively large queries on specific views?

I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.

DBGrid Filter, Delphi.

I've recently delved into the world of Delphi, for my current mini project I'm obtaining data via an SQL query and then using the filter property to display exactly what I want.
I discovered the filter by mistake and now prefer it instead of making multiple connections or calls to the database. For example, I'm returning a person object that may own many cars, the app has a check box and depending which one is selected it will update the filter to display only he cars that are blue or pink or whatever.
As far as I understand it, the filter works like a where clause but on the Dataset that is returned from the initial query. So, my question is: Is it faster to use the filter property when working with a small dataset in this manner and I am completely wrong in thinking that Dataset is returned, stored and then the filter is applied to that as opposed to constantly being updated?
I've looked online, the resources do lead me to believe that it is more efficient but I'm still unsure. Thanks for any help.
A filter on a dataset does indeed work (or at least behave) like a WHERE clause, and in some cases can be very fast.
The issues with depending on filters are:
Increased network traffic. You're moving considerably more data from the server to the client that isn't needed, because you're just filtering it out anyway.
Filters are applied to the data row-by-row. A WHERE clause can be optimized by the server to be all (or at least partially) based on existing indexes, whereas the client does not have those indexes available.
Increased memory and CPU use on the client to maintain data it isn't using in memory and to process the rows for filtering.
Data updated by other users or processes is not visible to the client app, as you're now working with all of the data in local memory and not refreshing from the server.
IMO, using a filter on all but a trivial dataset isn't a good option, and if the amount of data is that small you can move the entire dataset into a TClientDataSet and keep it in memory yourself anyway. Like every other optimization being considered, the proper answer depends on the needs of your application and the actual data in question, and should be benchmarked using that criteria to determine what is actually the better solution.
Two different animals. You're asking if it's less overhead to repeatedly query a database or do the filtering exclusively on the client side.
If your app and db are both running on the same machine, then it's probably a toss-up.
But if you're running this in a client-server, n-tier, or partitioned mobile application, and this is a common operation, then I'd say you're probably better off cacheing a larger set of results made in a single query on the client side and using filters to allow the users to see different views of the results. That reduces the bandwidth to the host and the users enjoy faster response times.
(It's a pet peeve of mine to be searching for cars or apartments or real estate and I check or un-check a box to change the view, and I have to wait 5-10 seconds for the app to reply.)
That said, you might also want to consider the overall size of the data, it's temporality, how often it's updated, and see if it's worthwhile loading down significant chunks to the client and localize even more of the specialized views. Pull down whole records and cache them locally to offer users faster response times. And minimize reloading of cached records whenever possible.
A lot of times, the actual data is fairly small on a per-record basis. But when you add-in the media stuff, it explodes. People often don't think about that, considering only the aggregate size of each "record" including the media blobs. If the DB designer was smart, the media isn't even being stored in the DB, but elsewhere, and accessible via URLs.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.