In the Java application, if a table (small one) has to be queried many times one can simply create a HashMap and store the contents of the table in it. Once this is done subsequent calls can be made to the HashMap instead of the table thereby substantially improving the performance of the program.
I have a similar situation in SQL Server where I have to loop say 5,00,00 times in a stored procedure and each time there are many small tables that are queried. Is there a way I can substantially reduce the time taken to execute the procedure by using an equivalent concept in SQL Server?
Databases have similar functionality, but it is hidden in the page caching mechanisms. If you repeatedly use a table, then it will be loaded into memory, and memory operations are much faster than disk operations.
I should note that you can explicitly store tables in memory using memory optimized tables (see here). Those maintain data integrity for data modifications, if that is necessary for your application.
Further, you should define indexes if you are randomly accessing rows in the table. Indexes in most databases (including SQL Server) are using binary trees. The difference in performance between a binary tree and a hashmap -- particularly on a small table -- is not going to significant.
I should note that if your goal is to write the most optimized code possible, then SQL is probably not the right environment. SQL engines do a lot more than process data, particularly with respect to data management and maintaining data integrity. This incurs overhead. For everything they do, they are usually quite efficient, but for a particular algorithm, it might be faster to write the logic elsewhere. (Of course, in a database you get parallelism for free.)
I currently find myself needing to do fairly simple computations on several million datapoints. (Constructing a large list of strings from a well defined multi-gigabite file, sorting that list, and then comparing it to another list, a superset.) This is the sort of simple work most of us normally do with the data entirely in-memory, but the size and quantity of the units of data I need to work with could make RAM an issue if I try to keep everything in memory. I quickly realized I probably need to write the data to a file, at a few points, to avoid exhausting my system's resources. I decided to use SQLite3 for this. (This is probably a bit much for a CSV.) It is fairly lightweight, while its storage limits seem to safely exceed my requirements.
The problem I am having is the understanding exactly how the result set works. The documentation I have come across seems a little vague on this. Obviously, SQLite is not writing a whole new table to the database every time a SELECT statement is executed. Does this mean it is duplicating all the selected fields in a complete in-memory table, or does it only keep some sort of pointers in memory (rather than the actual data)? Something else altogether?
I need to be able to sort the data in question. If the result set is really just an in-memory data structure, than simply creating creating a new table and populating it with the help of ORDER BY could be a bad idea.
SQLite does not really have result sets. It has cursors, which allow access to only the current row, and which cannot go backwards.
SQLite computes results on the fly, so only one row needs to be in memory at a time.
When a computation needs to access multiple rows (i.e., aggregate functions, or sorting without a usable index), as much data as possible is kept in the cache, and then spilled to disk in a temporary database.
I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.
I've recently delved into the world of Delphi, for my current mini project I'm obtaining data via an SQL query and then using the filter property to display exactly what I want.
I discovered the filter by mistake and now prefer it instead of making multiple connections or calls to the database. For example, I'm returning a person object that may own many cars, the app has a check box and depending which one is selected it will update the filter to display only he cars that are blue or pink or whatever.
As far as I understand it, the filter works like a where clause but on the Dataset that is returned from the initial query. So, my question is: Is it faster to use the filter property when working with a small dataset in this manner and I am completely wrong in thinking that Dataset is returned, stored and then the filter is applied to that as opposed to constantly being updated?
I've looked online, the resources do lead me to believe that it is more efficient but I'm still unsure. Thanks for any help.
A filter on a dataset does indeed work (or at least behave) like a WHERE clause, and in some cases can be very fast.
The issues with depending on filters are:
Increased network traffic. You're moving considerably more data from the server to the client that isn't needed, because you're just filtering it out anyway.
Filters are applied to the data row-by-row. A WHERE clause can be optimized by the server to be all (or at least partially) based on existing indexes, whereas the client does not have those indexes available.
Increased memory and CPU use on the client to maintain data it isn't using in memory and to process the rows for filtering.
Data updated by other users or processes is not visible to the client app, as you're now working with all of the data in local memory and not refreshing from the server.
IMO, using a filter on all but a trivial dataset isn't a good option, and if the amount of data is that small you can move the entire dataset into a TClientDataSet and keep it in memory yourself anyway. Like every other optimization being considered, the proper answer depends on the needs of your application and the actual data in question, and should be benchmarked using that criteria to determine what is actually the better solution.
Two different animals. You're asking if it's less overhead to repeatedly query a database or do the filtering exclusively on the client side.
If your app and db are both running on the same machine, then it's probably a toss-up.
But if you're running this in a client-server, n-tier, or partitioned mobile application, and this is a common operation, then I'd say you're probably better off cacheing a larger set of results made in a single query on the client side and using filters to allow the users to see different views of the results. That reduces the bandwidth to the host and the users enjoy faster response times.
(It's a pet peeve of mine to be searching for cars or apartments or real estate and I check or un-check a box to change the view, and I have to wait 5-10 seconds for the app to reply.)
That said, you might also want to consider the overall size of the data, it's temporality, how often it's updated, and see if it's worthwhile loading down significant chunks to the client and localize even more of the specialized views. Pull down whole records and cache them locally to offer users faster response times. And minimize reloading of cached records whenever possible.
A lot of times, the actual data is fairly small on a per-record basis. But when you add-in the media stuff, it explodes. People often don't think about that, considering only the aggregate size of each "record" including the media blobs. If the DB designer was smart, the media isn't even being stored in the DB, but elsewhere, and accessible via URLs.
I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.