What are the best docs/articles out there that show how different queries perform on large data sets? I'm trying to get a feel approaches are better than others in building a dashboard where the number and type of queries easily becomes large and complex, and slow.

Suggest you take the time to get a book on performance tuning for the database backend you are using. Performance tuning is very database specific and techiniques that are fast on one database may be dog slow on another (Cursors on Oracle and SQL Server comes to mind). If you are going to be writing complex queries, you need to understand performance tuning in depth before trying to write them. I will give you a hint - joins are usually faster than correlated subqueries whcih developers often use becasue they seem more straightforward.

It is not easy to compare the actual performance of one database architecture over another, I think the most common you'll find are benchmarks, which will vary, or theoretical proposals.
It's certainly harder to compare SQL vs NoSQL because you need a lot of data in the first place to have NoSQL be beneficial.
Minimizing joins, indexing properly, and having enough memory/cpu availability are going to be the three dominating factors.


Would mysql with sharding (Vitess) be faster than any no SQL database?

I have been trying to understand why nosql db is considered to be faster than RDBMS. I understand that nosql dbs don't follow the ACID properties but instead follows BASE principle which is the reason why nosql is able to horizontally scale.
What I am trying to understand here is that if there is a big difference in similar queries.
For instance let's say we have a search which searches all the matching users in our db. We have same data in MySQL as well as in any nosql db and let's assume that for nosql we have just 1 shard. So would there still be a difference in the speed of the query or would it be the same?
With disks getting faster and cheaper, the speed advantage may not be significant enough to make a difference.
However, you can do a lot more with systems like Vitess in terms of transactions, joins and indexes. At the same time, you can continue to scale like NoSQL systems. For this reason, I think Vitess is an overall better trade-off.

Should I move to NoSQL? (big data)

I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.
I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.
Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.
If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps

What database is most efficient for simple group by queries on tons of data?

For each account, I have millions of data items (rows in analytics logs), each with 20-50 numeric properties (they can be null too). I need to show them stats which mostly involve queries like SELECT SUM(f1), f2, f3 WHERE f4>f5 GROUP BY f2, f3. The aggregation functions are sometimes more complex than SUM(), and GROUP BY sometimes involves simple functions like ROUND(). The problem is that such queries are built in the user interface and can be run on any combination of those properties (though there are some popular combinations of course).
Once in the database, the data would most likely not be modified, only read. It should be possible to easily add/remove properties – not necessarily realtime in database terms, but it should not require complete table blocks like in MySQL.
What SQL or NoSQL databases would be best to handle these kinds of queries? I was thinking of PostgreSQL or MongoDB, even though in the latter I will most likely have to use MapReduce rather than its Group feature because of its limitations.
Any other advice on performance of such queries? Does this sound possible to do at all, or do I absolutely have to ask users to pre-define which exact queries they want to run?
Any ideas would be much appreciated.
What query performance are you looking for? How often will it be queried?
If you're OK with query performance in the low minutes and have a similarly low query rate, then you can use a relational table with a main table for the data items, and a join table for the properties. Be sure to put a combined index on the second table on the combination (property_type, data_item_id, property_value) to guarantee good query performance. You don't actually need property_value in there, but if you have it then queries can pull their data from the index in a highly efficient manner, which will make joins much, much easier. You can do this with any relational database. I happen to like PostgreSQL, but MySQL can also work. (But less efficiently on complex queries.)
If you follow this strategy then each property you want will require you to add yet another join. But the joins will be fairly efficient.
You can build this kind of application in an RDBMS or in a NoSQL database (Berkeley DB for example, has both a key-value pair API and a SQL API). The key-value pair API is a nice option, since it supports some pretty low level optimizations that may help when looking at how to tune the performance to meet your application needs.
Another option is to look into a columnar data store, but even that kind of product is going to have to retrieve data from multiple columns (which is slow in these kinds of databases) in order to resolve the kinds of queries that you list.
Ultimately the issue here boils down to disk I/O VS cache and data organization. The more data that you can fit into memory, the less I/O you have to perform and I/O is going to be the performance killer. The more compact you can make the data, the more rows will fit in the memory that you have. I would suggest looking into Berkeley DB, especially the key-value pair API. You can then choose to create one or more tables with the properties organized in an manner that optimizes the most frequent kinds of access. Additionally, if you're using the key-value pair API, take a look at the Bulk Get functions -- this allows you to fetch and process whole groups of records at a time.
You may also want to create and maintain some "well known" statistical results (in memory and/or persisted on disk) that allow you to take "shortcuts" when the user is asking for a value that has already been computed.
Good luck in your research.
What you've described - essentially ad hoc aggregate queries on data that does not need to be realtime - is what OLAP solutions are very good at. In addition to other suggestions you've seen, you should look into whether an OLAP solution makes sense for you.

Scalability of Using MySQL as a Key/Value Database

I am interested to know the performance impacts of using MySQL as a key-value database vs. say Redis/MongoDB/CouchDB. I have used both Redis and CouchDB in the past so I'm very familiar with their use cases, and know that it's better to store key/value pairs in say NoSQL vs. MySQL.
But here's the situation:
the bulk of our applications already have lots of MySQL tables
We host everything on Heroku (which only has MongoDB and MySQL, and is basically 1-db-type per app)
we don't want to be using multiple different databases in this case.
So basically, I'm looking for some info on the scalability of having a key/value table in MySQL. Maybe at three different arbitrary tiers:
1000 writes per day
1000 writes per hour
1000 writes per second
1000 reads per hour
1000 reads per second
A practical example is in building something like MixPanel's Real-time Web Analytics Tracker, which would require writing very often depending on traffic.
Wordpress and other popular software use this all the time: Post has "Meta" model which is just key/value, so you can add arbitrary properties to an object which can be searched over.
Another option is to store a serializable hash in a blob but that seems worse.
What is your take?
I'd say that you'll have to run your own benchmark because it is only you that knows the following important aspects:
the size of the data to be stored in this KV table
the level of parallelism you want to achieve
the number of existing queries reaching your MySQL instance
I'd also say that depending on the durability requirements for this data, you'll also want to test multiple engines: InnoDB, MyISAM.
While I do expect some NoSQL solutions to be faster, based on your constraints you may find out that MySQL will perform good enough for your requirements.
SQL databases are more and more used as a persistance layer, with computations and delivery cached in Key-Value repositories.
With this in mind, those guys have done quite a test here:
InnoDB inserts 43,000 records per second AT ITS PEAK*;
TokuDB inserts 34,000 records per second AT ITS PEAK*;
This KV inserts 100 millions of records per second (2,000+ times more).
To answer your question, a Key-Value repository is more than likely to outdo MySQL by several orders of magnitude:
Processing 100,000,000 items:
kv_add()....time:....978.32 ms
kv_get().....time:....297.07 ms
kv_free()....time:........0.00 ms
OK, your test was 1,000 ops per second, but it can't hurt to be able to do 1,000 times more!
See this for further details (they also compare it with Tokyo Cabinet).
There is no doubt that using a NOSQL solution is going to be faster, since it is simpler.
NOSQL and Relational do not compete with each other, they are different tools that can solve different problems.
That being said for 1000 writes/day or per hour, MySQL will have no problem.
For 1000 per second you will need some fancy hardware to get there. For the NOSQL solution you will probably still need some distributed file system.
It also depends on what you are storing.
Check out the series of blog posts here where the author runs tests comparing MongoDB and MySQL performance, and fights through the MySQL performance tuning mess. MongoDB was doing ~100K row reads per second, MySQL in c/s mode was doing 43K max, but with the embedded library he managed to get it up to 172K row reads per second.
It sounds a little complicated to get that high on a single node, so ymmv.
The writes/second question is a little harder, but this still might give you some ideas on configs to try.
You should first implement it in the simplest way then compare that. Always test things. This means:
Create a schema that's representative of your use case.
Create queries representative of your use case.
Create significant amounts of dummy data representive of your use case.
In a variety of loops, including both random access and sequential, bench mark it.
Ensure you use concurrency (run many processes randomly hammering the server with all kinds of queries representative of your use cases).
Once you have that, measure, test. There are different ways you can go about it. Some tests can be simple but might be less realistic. Measure throughput and latency.
Then try to optimise it.
MySQL has one particular limitation for KV which is the standard Engines with persistence use indexes optimised for range lookups, not for KV, which might introduce some overhead, though it's also difficult to have things such as hash work with persistent storage due to rehashing. Memory tables support a hash index.
Many people associate certain things with being slow such as SQL, RELATIONAL, JOINS, ACID, etc.
When using an ACID capable relational database, you don't have to necessarily use ACID or relations.
While joins have a bad reputation for being slow this is usually down to misconceptions about joins. Often people simply write bad queries. This is made more difficult as SQL is declarative, it can get things wrong, especially with JOINs where there are often multiple ways to perform the join. What people are actually getting out of NoSQL in this case is imperative. NoDeclaritive would be more accurate as that's the problem with SQL a lot of people are having. Quite often people simply lack indexes. That's not an argument in favour of joins but rather to illuminate where people can get it wrong on speed.
Traditional databases can be extremely fast if you do certain special things for that such as ignoring data integrity or handling it elsewhere. You don't have to wait for the harddrive to flush writes, you don't have to enforce relations, you don't have to enforce unique constraints, you don't have to use transactions but if you do replace safety with speed then you need to know what you're doing.
NoSQL solutions by comparison first and foremost tend to be designed to support various modes of scaling out of the box. The performance of an individual node might not be quite what you expect. NoSQL solutions also struggle for general use with many having quite unusual performance characteristics or limited feature sets.

What are "SQL-Hints"?

I am an advocate of ORM-solutions and from time to time I am giving a workshop about Hibernate.
When talking about framework-generated SQL, people usually start talking about how they need to be able to use "hints", and this is supposedly not possible with ORM frameworks.
Usually something like: "We tried Hibernate. It looked promising in the beginning, but when we let it loose on our very very complex production database it broke down because we were not able to apply hints!".
But when asked for a concrete example, the memory of those people is suddenly not so clear any more ...
I usually feel intimidated, because the whole "hints"-topic sounds like voodoo to me...
So can anybody enlighten me? What is meant by SQL-hints or DB-Hints?
The only thing I know, that is somehow "hint-like" is SELECT ... FOR UPDATE. But this is supported by the Hibernate-API...
A SQL statement, especially a complex one, can actually be executed by the DB engine in any number of different ways (which table in the join to read first, which index to use based on many different parameters, etc).
An experienced dba can use hints to encourage the DB engine to choose a particular method when it generates its execution plan. You would only normally need to do this after extensive testing and analysis of the specific queries (because the DB engines are usually pretty darn good at figuring out the optimum execution plan).
Some MSSQL-specific discussion and syntax here:
Edit: some additional examples at http://geeks.netindonesia.net/blogs/kasim.wirama/archive/2007/12/31/sql-server-2005-query-hints.aspx
Query hints are used to guide the query optimiser when it doesn't produce sensible query plans by default. First, a small background in query optimisers:
Database programming is different from pretty much all other software development because it has a mechanical component. Disk seeks and rotational latency (waiting fora particular sector to arrive under the disk head) are very expensive in comparison to CPU. Different query resolution strategies will result in different amounts of I/O, often radically different amounts. Getting this right or wrong can make a major difference to the performance of the query. For an overview of query optimisation, see This paper.
SQL is declarative - you specify the logic of the query and let the DBMS figure out how to resolve it. A modern cost-based query optimiser (some systems, such as Oracle also have a legacy query optimiser retained for backward compatibility) will run a series of transformations on the query. These maintain semantic equivalence but differ in the order and choice of operations. Based on statistics collected on the tables (sizes, distribution histograms of keys) the optimiser computes an estimate of the amount of work needed for each query plan. It selects the most efficient plan.
Cost-based optimisation is heuristic, and is dependent on accurate statistics. As query complexity goes up the heuristics can produce incorrect plans, which can potentially be wildly inefficient.
Query hints can be used in this situation to force certain strategies in the query plan, such as a type of join. For example, on a query that usually returns very small result sets you may wish to force a nested loops join. You may also wish to force a certain join order of tables.
O/R mappers (or any tool that generates SQL) generates its own query, which will typically not have hinting information. In the case that this query runs inefficiently you have limited options, some of which are:
Examine the indexing on the tables. Possibly you can add an index. Some systems (recent versions of Oracle for example) allow you index joins across more than one table.
Some database management systems (again, Oracle comes to mind) allow you to manually associate a query plan with a specific query string. Query plans are cached by a hash value of the query. If the queries are paramaterised the base query string is constant and will resolve to the same hash value.
As a last resort, you can modify the database schema, but this is only possible if you control the application.
If you control the SQL you can hint queries. In practice it's fairly uncommon to actually need to do this. A more common failure mode on O/R mappers with complex database schemas is they can make it difficult to express complex query predicates or do complex operations over large bodies of data.
I tend to advocate using the O/R mapper for the 98% of work that it's suited for and dropping to stored procedures where they are the appropriate solution. If you really need to hint a query than this might be the appropriate strategy. Unless there is something unusual about your application (for example some sort of DSS) you should
only need to escape from the O/R mapper on a minority of situations. You might also
find (again, an example would be DSS tools working with the data in aggregate) that an O/R mapper is not really the appropriate strategy for the application.
While HINTS do as the other answers describe, you should only use them in rare, researched circumstances. 9 times out of 10 a HINT will result in a poor query plan. Unless you really know what you are doing, don't use them.
There is no such thing as "optimized SQL code", because SQL code is never executed.
SQL code is translated into an execution plan by the Optimizer. The Optimizer will use the information it has to choose (among other things).
the order in which tables are involved
the join method for each involved table (nested/merge/hash)
how to access a table's data (direct table access/ index with bookmark lookup/direct index access) (scan/seek)
should parallelism be used, and when to end parallelism (gather streams)
Query hints allow a programmer to over-ride (in most cases) or suggest politely (in other cases) the optimizer's choices.
Query hints can let you force off parallelism, force all joins to be implemented as nested loop, force one index to be used over another... as a few examples.
Since the optimizer is really good, if one over-rides the optimizer, one is generally asking for a non-optimal plan. Query hints are best served when the optimizer does not have the required information to make a good choice.
One place I use query hints is for table variables. Table variables are assumed to have 0 rows by the Optimizer, and so the Optimizer always joins table variables using nested loop (the best join implementation for small numbers of rows). If I have a large table variable - already ordered in a favorable way for merge join, I can specify a merge join be used by applying a query hint.
All modern RDBMS-es have some sort of query optimizer that calculates best query plan, which is sequence of read/write operations needed to execute SQL query.
Sometimes plans can be suboptimal, so RDBMS designers included "hints" in SQL. Hints are instructions you can embed in your SQL that affect query optimizer, With hints you can instruct query optimizer e.g. which indexes it should use, in what order data should be read from tables, ...
So, with hints you can resolve some bottlenecks that the query optimizer cannot solve by itself.
For example, here is list of Oracle hints.