Should I move to NoSQL? (big data) - sql

I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.

I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.

Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.

If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps

Related

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

SQL Database Query Performance Chart on Large Datasets?

What are the best docs/articles out there that show how different queries perform on large data sets? I'm trying to get a feel approaches are better than others in building a dashboard where the number and type of queries easily becomes large and complex, and slow.
Suggest you take the time to get a book on performance tuning for the database backend you are using. Performance tuning is very database specific and techiniques that are fast on one database may be dog slow on another (Cursors on Oracle and SQL Server comes to mind). If you are going to be writing complex queries, you need to understand performance tuning in depth before trying to write them. I will give you a hint - joins are usually faster than correlated subqueries whcih developers often use becasue they seem more straightforward.
It is not easy to compare the actual performance of one database architecture over another, I think the most common you'll find are benchmarks, which will vary, or theoretical proposals.
It's certainly harder to compare SQL vs NoSQL because you need a lot of data in the first place to have NoSQL be beneficial.
Minimizing joins, indexing properly, and having enough memory/cpu availability are going to be the three dominating factors.

Scalability of Using MySQL as a Key/Value Database

I am interested to know the performance impacts of using MySQL as a key-value database vs. say Redis/MongoDB/CouchDB. I have used both Redis and CouchDB in the past so I'm very familiar with their use cases, and know that it's better to store key/value pairs in say NoSQL vs. MySQL.
But here's the situation:
the bulk of our applications already have lots of MySQL tables
We host everything on Heroku (which only has MongoDB and MySQL, and is basically 1-db-type per app)
we don't want to be using multiple different databases in this case.
So basically, I'm looking for some info on the scalability of having a key/value table in MySQL. Maybe at three different arbitrary tiers:
1000 writes per day
1000 writes per hour
1000 writes per second
1000 reads per hour
1000 reads per second
A practical example is in building something like MixPanel's Real-time Web Analytics Tracker, which would require writing very often depending on traffic.
Wordpress and other popular software use this all the time: Post has "Meta" model which is just key/value, so you can add arbitrary properties to an object which can be searched over.
Another option is to store a serializable hash in a blob but that seems worse.
What is your take?
I'd say that you'll have to run your own benchmark because it is only you that knows the following important aspects:
the size of the data to be stored in this KV table
the level of parallelism you want to achieve
the number of existing queries reaching your MySQL instance
I'd also say that depending on the durability requirements for this data, you'll also want to test multiple engines: InnoDB, MyISAM.
While I do expect some NoSQL solutions to be faster, based on your constraints you may find out that MySQL will perform good enough for your requirements.
SQL databases are more and more used as a persistance layer, with computations and delivery cached in Key-Value repositories.
With this in mind, those guys have done quite a test here:
InnoDB inserts 43,000 records per second AT ITS PEAK*;
TokuDB inserts 34,000 records per second AT ITS PEAK*;
This KV inserts 100 millions of records per second (2,000+ times more).
To answer your question, a Key-Value repository is more than likely to outdo MySQL by several orders of magnitude:
Processing 100,000,000 items:
kv_add()....time:....978.32 ms
kv_get().....time:....297.07 ms
kv_free()....time:........0.00 ms
OK, your test was 1,000 ops per second, but it can't hurt to be able to do 1,000 times more!
See this for further details (they also compare it with Tokyo Cabinet).
There is no doubt that using a NOSQL solution is going to be faster, since it is simpler.
NOSQL and Relational do not compete with each other, they are different tools that can solve different problems.
That being said for 1000 writes/day or per hour, MySQL will have no problem.
For 1000 per second you will need some fancy hardware to get there. For the NOSQL solution you will probably still need some distributed file system.
It also depends on what you are storing.
Check out the series of blog posts here where the author runs tests comparing MongoDB and MySQL performance, and fights through the MySQL performance tuning mess. MongoDB was doing ~100K row reads per second, MySQL in c/s mode was doing 43K max, but with the embedded library he managed to get it up to 172K row reads per second.
It sounds a little complicated to get that high on a single node, so ymmv.
The writes/second question is a little harder, but this still might give you some ideas on configs to try.
You should first implement it in the simplest way then compare that. Always test things. This means:
Create a schema that's representative of your use case.
Create queries representative of your use case.
Create significant amounts of dummy data representive of your use case.
In a variety of loops, including both random access and sequential, bench mark it.
Ensure you use concurrency (run many processes randomly hammering the server with all kinds of queries representative of your use cases).
Once you have that, measure, test. There are different ways you can go about it. Some tests can be simple but might be less realistic. Measure throughput and latency.
Then try to optimise it.
MySQL has one particular limitation for KV which is the standard Engines with persistence use indexes optimised for range lookups, not for KV, which might introduce some overhead, though it's also difficult to have things such as hash work with persistent storage due to rehashing. Memory tables support a hash index.
Many people associate certain things with being slow such as SQL, RELATIONAL, JOINS, ACID, etc.
When using an ACID capable relational database, you don't have to necessarily use ACID or relations.
While joins have a bad reputation for being slow this is usually down to misconceptions about joins. Often people simply write bad queries. This is made more difficult as SQL is declarative, it can get things wrong, especially with JOINs where there are often multiple ways to perform the join. What people are actually getting out of NoSQL in this case is imperative. NoDeclaritive would be more accurate as that's the problem with SQL a lot of people are having. Quite often people simply lack indexes. That's not an argument in favour of joins but rather to illuminate where people can get it wrong on speed.
Traditional databases can be extremely fast if you do certain special things for that such as ignoring data integrity or handling it elsewhere. You don't have to wait for the harddrive to flush writes, you don't have to enforce relations, you don't have to enforce unique constraints, you don't have to use transactions but if you do replace safety with speed then you need to know what you're doing.
NoSQL solutions by comparison first and foremost tend to be designed to support various modes of scaling out of the box. The performance of an individual node might not be quite what you expect. NoSQL solutions also struggle for general use with many having quite unusual performance characteristics or limited feature sets.

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).
The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.
Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.
My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.
Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?
I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.
The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.
I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.
Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.
You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.
The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.
So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.
90 million rows should be about 90GB, thus your bottleneck is disk.
If you need these queries rarely, run them as is.
If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).
For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.
Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.
It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.
If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL
I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.
In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.
Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.
The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.
Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?
I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.
Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper
Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.
Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.
There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.
I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.