Is is possible to somehow optimize the performance of the queries (apart from playing with hardware and OS settings) under these conditions
1) You can't add indexes.
2) You can't alter the queries themselves.
This is the common constraint while bench-marking the performance of a database.
I understand that the dbms has a query optimizer that plays a number game with all the statistics pertaining to accessing the tables touched by the query. Are there cases when the query optimizer comes up with sub optimal solutions. I know that you can force the optimizer to use a particular query plan. Not sure how to cache it though without altering the query plan. DB in question is Sybase
Independent of the specific case here (Sybase), there are multiple ways to optimize a query under the given conditions. Syntax is system-specific.
Most systems rely on statistics for finding the best query plan. So updating the statistics could help improve performance.
Many systems allow to set an optimization level independent of the application. This can have positive impact on the performance.
Many systems allow to re-use query plans for similar ad-hoc queries (dynamic SQL). Usually this has positive impact.
Allowing the database system (independent to the OS) to assign more memory to bottlenecks can also help.
What privileges do you have, what are the benchmark rules?
Data Henrik mentions optimisation level - you can set this system-wide for Sybase, or per session.
You can even have a flexible method that sets the level according to application name or login Id (see Rob Verschoor's Sybase site - login triggers.) I'd guess if you're not allowed to change queries or indexes you'd not likely be allowed to do this.
As far as I can tell you don't have a specific problem - you just mention benchmarking.
You should be sure all tables have UPDATE INDEX STATISTICS run on them, and you could then do your benchmarks with the 3 Sybase optimisation levels - OLTP, MIX, DSS.
If you have specific problems, that's another subject.
I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.
I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.
Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.
If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps
For each account, I have millions of data items (rows in analytics logs), each with 20-50 numeric properties (they can be null too). I need to show them stats which mostly involve queries like SELECT SUM(f1), f2, f3 WHERE f4>f5 GROUP BY f2, f3. The aggregation functions are sometimes more complex than SUM(), and GROUP BY sometimes involves simple functions like ROUND(). The problem is that such queries are built in the user interface and can be run on any combination of those properties (though there are some popular combinations of course).
Once in the database, the data would most likely not be modified, only read. It should be possible to easily add/remove properties – not necessarily realtime in database terms, but it should not require complete table blocks like in MySQL.
What SQL or NoSQL databases would be best to handle these kinds of queries? I was thinking of PostgreSQL or MongoDB, even though in the latter I will most likely have to use MapReduce rather than its Group feature because of its limitations.
Any other advice on performance of such queries? Does this sound possible to do at all, or do I absolutely have to ask users to pre-define which exact queries they want to run?
Any ideas would be much appreciated.
What query performance are you looking for? How often will it be queried?
If you're OK with query performance in the low minutes and have a similarly low query rate, then you can use a relational table with a main table for the data items, and a join table for the properties. Be sure to put a combined index on the second table on the combination (property_type, data_item_id, property_value) to guarantee good query performance. You don't actually need property_value in there, but if you have it then queries can pull their data from the index in a highly efficient manner, which will make joins much, much easier. You can do this with any relational database. I happen to like PostgreSQL, but MySQL can also work. (But less efficiently on complex queries.)
If you follow this strategy then each property you want will require you to add yet another join. But the joins will be fairly efficient.
You can build this kind of application in an RDBMS or in a NoSQL database (Berkeley DB for example, has both a key-value pair API and a SQL API). The key-value pair API is a nice option, since it supports some pretty low level optimizations that may help when looking at how to tune the performance to meet your application needs.
Another option is to look into a columnar data store, but even that kind of product is going to have to retrieve data from multiple columns (which is slow in these kinds of databases) in order to resolve the kinds of queries that you list.
Ultimately the issue here boils down to disk I/O VS cache and data organization. The more data that you can fit into memory, the less I/O you have to perform and I/O is going to be the performance killer. The more compact you can make the data, the more rows will fit in the memory that you have. I would suggest looking into Berkeley DB, especially the key-value pair API. You can then choose to create one or more tables with the properties organized in an manner that optimizes the most frequent kinds of access. Additionally, if you're using the key-value pair API, take a look at the Bulk Get functions -- this allows you to fetch and process whole groups of records at a time.
You may also want to create and maintain some "well known" statistical results (in memory and/or persisted on disk) that allow you to take "shortcuts" when the user is asking for a value that has already been computed.
Good luck in your research.
What you've described - essentially ad hoc aggregate queries on data that does not need to be realtime - is what OLAP solutions are very good at. In addition to other suggestions you've seen, you should look into whether an OLAP solution makes sense for you.
What are the best docs/articles out there that show how different queries perform on large data sets? I'm trying to get a feel approaches are better than others in building a dashboard where the number and type of queries easily becomes large and complex, and slow.
Suggest you take the time to get a book on performance tuning for the database backend you are using. Performance tuning is very database specific and techiniques that are fast on one database may be dog slow on another (Cursors on Oracle and SQL Server comes to mind). If you are going to be writing complex queries, you need to understand performance tuning in depth before trying to write them. I will give you a hint - joins are usually faster than correlated subqueries whcih developers often use becasue they seem more straightforward.
It is not easy to compare the actual performance of one database architecture over another, I think the most common you'll find are benchmarks, which will vary, or theoretical proposals.
It's certainly harder to compare SQL vs NoSQL because you need a lot of data in the first place to have NoSQL be beneficial.
Minimizing joins, indexing properly, and having enough memory/cpu availability are going to be the three dominating factors.
I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.