One of our customer is in manufacturing domain. He has multiple factories across the country. For the quality control, he is using window application deployed independently in all factories (approx 100 in count). Our customer is interested in replacing all the window applications with a single web application. Now the problem is volume of the data will be 100 times bigger and same as the velocity (in case we keep a single database for all the factories). There are lots of reporting use cases in this domain. Looking at the numbers, it looks like SQL will be not be able to handle this much load.
Is it a valid use case to move to NoSQL database?
Can Volume/Velocity alone be a deciding factor to move to NoSQL?
Would we be able to get all those reporting from NoSQL database efficiently?
Any kind of help will be appreciated.
Thanks In Advance
This is a usefull discussion.
In my opinion a well designed MS-SQL server 2012 (or Oracle server, but no experience for me) must be capable of handling 1000 complex transactions per second.
MS-SQL server 2014 with in-memory processing raises even higher expecations.
Consider multi processor, large memories, table partitioning, file mapping, multiple access paths to the SAN or to separate server discs. Use well designed transactions (consider to remove most indexes on transaction tables).
As an extra benefit you keep all functionality of the SQL server. In my opinion most NOSQL solutions are NOSQL because they are deprived of essential SQL functionality.
Switch to NOSQL databases is most usefull when you require functionality outside the transaction domain, e.g. document indexing or network indexing.
Related
I've been reading through the excellent Developing Multi-tenant Applications for the Cloud, 3rd Edition and am trying to understand how partitioning impacts on query performance in Windows Azure SQL Database.
If you use the shared schema approach and put all tenants records in a single table and separate their data using partitions, is that always going to be slower than a separate schema approach due to the larger number of records in the table, or does the partitioning effectively make each partition act like its own table?
(I appreciate query execution speed is only one of many factors to consider when choosing a multi tenancy strategy, we're not basing our decisions on performance alone.)
The approach that uses different schemas for different tenants has its problems, too. For instance, it is very easy to bloat the plan cache with this approach since each tenant gets its own set of query plans. You may end up with more recompiles (and lower performance) with this approach because of that.
I would recommend to take a look at an approach where you place each tenant in its own database. That provides for great isolation and, in combination with Elastic Database Pools in Azure SQL DB, it actually becomes quite affordable. A good entry point into the documentation is this: https://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-introduction/.
So I've been trying hard to figure out if NoSQL is really bringing that much value outside of auto-sharding and handling UNSTRUCTURED data.
Assuming I can fit my STRUCTURED data on a single machine OR have an effective 'auto-sharding' feature for SQL, what advantages do any NoSQL options offer? I've determined the following:
Document-based (MongoDB, Couchbase, etc) - Outside of it's 'auto-sharding' capabilities, I'm having a hard time understanding where the benefit is. Linked objects are quite similar to SQL joins, while Embedded objects significantly bloat doc size and causes a challenge regarding to replication (a comment could belong to both a post AND a user, and therefore the data would be redundant). Also, loss of ACID and transactions are a big disadvantage.
Key-value based (Redis, Memcached, etc) - Serves a different use case, ideal for caching but not complex queries
Columnar (Cassandra, HBase, etc ) - Seems that the big advantage here is more how the data is stored on disk, and mostly useful for aggregations rather than general use
Graph (Neo4j, OrientDB, etc) - The most intriguing, the use of both edges and nodes makes for an interesting value-proposition, but mostly useful for highly complex relational data rather than general use.
I can see the advantages of Key-value, Columnar and Graph DBs for specific use cases (Caching, social network relationship mapping, aggregations), but can't see any reason to use something like MongoDB for STRUCTURED data outside of it's 'auto-sharding' capabilities.
If SQL has a similar 'auto-sharding' ability, would SQL be a no-brainer for structured data? Seems to me it would be, but I would like the communities opinion...
NOTE: This is in regards to a typical CRUD application like a Social Network, E-Commerce site, CMS etc.
If you're starting off on a single server, then many advantages of NoSQL go out the window. The biggest advantages to the most popular NoSQL are high availability with less down time. Eventual consistency requirements can lead to performance improvements as well. It really depends on your needs.
Document-based - If your data fits well into a handful of small buckets of data, then a document oriented database. For example, on a classifieds site we have Users, Accounts and Listings as the core data. The bulk of search and display operations are against the Listings alone. With the legacy database we have to do nearly 40 join operations to get the data for a single listing. With NoSQL it's a single query. With NoSQL we can also create indexes against nested data, again with results queried without Joins. In this case, we're actually mirroring data from SQL to MongoDB for purposes of search and display (there are other reasons), with a longer-term migration strategy being worked on now. ElasticSearch, RethinkDB and others are great databases as well. RethinkDB actually takes a very conservative approach to the data, and ElasticSearch's out of the box indexing is second to none.
Key-value store - Caching is an excellent use case here, when you are running a medium to high volume website where data is mostly read, a good caching strategy alone can get you 4-5 times the users handled by a single server. Key-value stores (RocksDB, LevelDB, Redis, etc) are also very good options for Graph data, as individual mapping can be held with subject-predicate-target values which can be very fast for graphing options over the top.
Columnar - Cassandra in particular can be used to distribute significant amounts of load for even single-value lookups. Cassandra's scaling is very linear to the number of servers in use. Great for heavy read and write scenarios. I find this less valuable for live searches, but very good when you have a VERY high load and need to distribute. It takes a lot more planning, and may well not fit your needs. You can tweak settings to suite your CAP needs, and even handle distribution to multiple data centers in the box. NOTE: Most applications do emphatically NOT need this level of use. ElasticSearch may be a better fit in most scenarios you would consider HBase/Hadoop or Cassandra for.
Graph - I'm not as familiar with graph databases, so can't comment here (beyond using a key-value store as underlying option).
Given that you then comment on MongoDB specifically vs SQL ... even if both auto-shard. PostgreSQL in particular has made a lot of strides in terms of getting unstrictured data usable (JSON/JSONB types) not to mention the power you can get from something like PLV8, it's probably the most suited to handling the types of loads you might throw at a document store with the advantages of NoSQL. Where it happens to fall down is that replication, sharding and failover are bolted on solutions not really in the box.
For small to medium loads sharding really isn't the best approach. Most scenarios are mostly read so having a replica-set where you have additional read nodes is usually better when you have 3-5 servers. MongoDB is great in this scenario, the master node is automagically elected, and failover is pretty fast. The only weirdness I've seen is when Azure went down in late 2014, and only one of the servers came up first, the other two were almost 40 minutes later. With replication any given read request can be handled in whole by a single server. Your data structures become simpler, and your chances of data loss are reduced.
Again in my own example above, for a mediums sized classifieds site, the vast majority of data belongs to a single collection... it is searched against, and displayed from that collection. With this use case a document store works much better than structured/normalized data. The way the objects are stored are much closer to their representation in the application. There's less of a cognitive disconnect and it simply works.
The fact is that SQL JOIN operations kill performance, especially when aggregating data across those joins. For a single query for a single user it's fine, even with a dozen of them. When you get to dozens of joins with thousands of simultaneous users, it starts to fall apart. At this point you have several choices...
Caching - caching is always a great approach, and the less often your data changes, the better the approach. This can be anything from a set of memcache/redis instances to using something like MongoDB, RethinkDB or ElasticSearch to hold composite records. The challenge here comes down to updating or invalidating your cached data.
Migrating - migrating your data to a data store that better represents your needs can be a good idea as well. If you need to handle massive writes, or very massive read scenarios no SQL database can keep up. You could NEVER handle the likes of Facebook or Twitter on SQL.
Something in between - As you need to scale it depends on what you are doing and where your pain points are as to what will be the best solution for a given situation. Many developers and administrators fear having data broken up into multiple places, but this is often the best answer. Does your analytical data really need to be in the same place as your core operational data? For that matter do your logins need to be tightly coupled? Are you doing a lot of correlated queries? It really depends.
Personal Opinions Ahead
For me, I like the safety net that SQL provides. Having it as the central store for core data it's my first choice. I tend to treat RDBMS's as dumb storage, I don't like being tied to a given platform. I feel that many people try to over-normalize their data. Often I will add an XML or JSON field to a table so additional pieces of data can be stored without bloating the scheme, specifically if it's unlikely to ever be queried... I'll then have properties in my objects in the application code that store in those fields. A good example may be a payment... if you are currently using one system, or multiple systems (one for CC along with Paypal, Google, Amazon etc) then the details of the transaction really don't affect your records, why create 5+ tables to store this detailed data. You can even use JSON for primary storage and have computed columns derived and persisted from that JSON for broader query capability and indexing where needed. Databases like postgresql and mysql (iirc) offer direct indexing against JSON data as well.
When data is a natural fit for a document store, I say go for it... if the vast majority of your queries are for something that fits better to a single record or collection, denormalize away. Having this as a mirror to your primary data is great.
For write-heavy data you want multiple systems in play... It depends heavily on your needs here... Do you need fast hot-query performance? Go with ElasticSearch. Do you need absolute massive horizontal scale, HBase or Cassandra.
The key take away here is not to be afraid to mix it up... there really isn't a one size fits all. As an aside, I feel that if PostgreSQL comes up with a good in the box (for the open-source version) solution for even just replication and automated fail-over they're in a much better position than most at that point.
I didn't really get into, but feel I should mention that there are a number of SaaS solutions and other providers that offer hybrid SQL systems. You can develop against MySQL/MariaDB locally and deploy to a system with SQL on top of a distributed storage cluster. I still feel that HBase or ElasticSearch are better for logging and analitical data, but the SQL on top solutions are also compelling.
More: http://www.mongodb.com/nosql-explained
Schema-less storage (or schema-free). Ability to modify the storage (basically add new fields to records) without having to modify the storage 'declared' schema. RDBMSs require the explicit declaration of said 'fields' and require explicit modifications to the schema before a new 'field' is saved. A schema-free storage engine allows for fast application changes, just modify the app code to save the extra fields, or rename the fields, or drop fields and be done.
Traditional RDBMS folk consider the schema-free a disadvantage because they argue that on the long run one needs to query the storage and handling the heterogeneous records (some have some fields, some have other fields) makes it difficult to handle. But for a start-up the schema-free is overwhelmingly alluring, as fast iteration and time-to-market is all that matter (and often rightly so).
You asked us to assume that either the data can fit on a single machine, OR your database has an effective auto-sharding feature.
Going with the assumption that your SQL data has an auto-sharding feature, that means you're talking about running a cluster. Any time you're running a cluster of machines you have to worry about fault-tolerance.
For example, let's say you're using the simplest approach of sharding your data by application function, and are storing all of your user account data on server A and your product catalog on server B.
Is it acceptable to your business if server A goes down and none of your users can login?
Is it acceptable to your business if server B goes down and no one can buy things?
If not, you need to worry about setting up data replication and high-availability failover. Doable, but not pleasant or easy for SQL databases. Other types of sharding strategies (key, lookup service, etc) have the same challenges.
Many NoSQL databases will automatically handle replication and failovers. Some will do it out of the box, with very little configuration. That's a huge benefit from an operational point of view.
Full disclosure: I'm an engineer at FoundationDB, a NoSQL database that automatically handles sharding, replication, and fail-over with very little configuration. It also has a SQL layer so you you don't have to give up structured data.
I have to rewrite a large database application, running on 32 servers. The hardware is up to date, each machine has two quad core Xeon and 32 GByte RAM.
The database is multi-tenant, each customer has his own file, around 5 to 10 GByte each. I run around 50 databases on this hardware. The app is open to the web, so I have no control
on the load. There are no really complex queries, so SQL is not required if there is a better solution.
The databases get updated via FTP every day at midnight. The database is read-only.
C# is my favourite language and I want to use ASP.NET MVC.
I thought about the following options:
Use two big SQL servers running SQL Server 2012 to serve the 32 servers with data. On the 32 servers running IIS hosting providing REST services.
Denormalize the database and use Redis on each webserver. Use booksleeve as a Redis client.
Use a combination of SQL Server and Redis
Use SQL Server 2012 together with Hadoop
Use Hadoop without SQL Server
What is the best way for a read-only database, to get the best performance without loosing maintainability? Does Map-Reduce make sense at all in such a scenario?
The reason for the rewrite is, the old app written in C++ with ISAM technology is too slow, the interfaces are old fashioned and not nice to use from an website, especially when using ajax.
The app uses a relational datamodel with many tables, but it is possible to write one accerlerator table where all queries can be performed on, and all other information from the other tables are possible by a simple key lookup.
Few questions. What problems have come up that you're rewriting this? What do the query patterns look like? It sounds like you would be most comfortable with a SQLServer + caching (memcached) to address whatever issues that are causing you to rewrite this. Redis is good, but you won't need the data structure features with the db handling queries, and you don't need persistance if it's only being used as a cache. Without knowing more about the problem, I guess I'd look at MongoDB to handle data sharding, redundant storage, and caching all in one solution. There are no special machines in this setup, redundancy can be configured, and the load should balance well.
This question is almost an opinion piece. I'd personally prefer an Oracle RAC with TimesTen for caching if performance is of the utmost importance, and if volume of concurrent reads is high during the day.
There's a white paper here...
http://www.oracle.com/us/products/middleware/timesten-in-memory-db-504865.pdf
The specs of the disk subsystem and organization of indexes and data files across physical disks is probably the most important factor though.
I am asking this in the context of NoSQL - which achieves scalability and performance without being expensive.
So, if I needed to achieve massively parallel distributed computing across databases ...
What are the various methodologies available today (within the RDBMS paradigm) to achieve distributed computing with high-scalability?
Does database clustering & mirroring contribute in any way towards distributed computing?
I guess you are asking about scalability of RDBMS databases. Talking about NoSQL databases based on ( amazon dynamo, BigTable ) are a whole another topic. I am talking about HBase, Cassandra etc. There are also commerical products like Oracle Coherence thats more like a distributed cache and key value store , to put it crudely.
going back to rdbms,
Sharding
to scale RDBMS one can do cusstom sharding. Sharding is a technique where you have multiple table is possibly multiple hosts. And then you decide in a certain fashion to assign certain rows to certain tables. For example you can say that rows 1-1M goes to table1, 1M-2M goes to table2 etc. But, this is a difficult process from an administration point of view. A lot of large scale websites scale by relying on sharding. Other techniques worth mentioning are partioning and mysql federation and mysql cluster.
MPP databases
Then there are databases are there very RDBMS which does distribution and scaling for you. Terradata is the most successful of these companies. I believe they used postgres core code at some point. A significant number of fortune 500 companies and a lot of the airlines use Terradata. But, its ridiculously expensive. There are newer companies like greenplum, vertica, netezza.
Unless you're a very big company with extreme scalability requirements, you can horizontally and ACID scale up your DB by building a cluster of identical RDBMS instances and synchronizing them with JTA transactions.
Take a look to this Java/JDBC based article the JEPLayer framework is used but you can use straight JDBC and JTA code.
Within the RDBMS paradigm: Sharding.
Outside the RDBMS paradigm: Key-value stores.
My pick: (I come from an RDBMS background) Key-value stores of the tabluar type - HBase.
Within the RDBMS paradigm, sharding will not get you far.
Use the RDBMS paradigm to design your model, to get your project up and running.
Use tabular key-value stores to SCALE OUT.
Sharding:
A good way to think about sharding is to see it as user-account-oriented
DB design.
The all schema entities touched by a user-account are kept on one host.
The assignment of user to host happens when the user creates an account.
The least loaded host gets that user.
When that user signs on after account creation, he gets connected
to the host that has his data.
Each host has a set of user accounts.
The problem with this approach is that if the host gets hosed,
a fraction of users will be blacked out.
The solution to this is have a replicated standby host that
becomes the primary when the primary host encounters problems.
Also, it's a fairly rigid setup for processes where the design does
not change dramatically.
From the user standpoint, I've noticed that web sites
with a sharded DB backend are not as quick to "turn on a dime"
to create different business models on their platform.
Contrast this with web sites that have truly distributed
key-value stores. These businesses can host any range of
services. Their platform is just that - a platform.
It's not relational and it does have an API interface,
but it just seems to work.
I'm a beginner. How many number of tables are recommended in a SQL Server Express database? Mainly attaining best performance speedwise as an objective. Is it generally recommend to use two databases as compared to one for a single application?
SQL Express databases have a limit of 4GB in size. Within that limit, any number of tables is fair game. The number of tables makes absolutely no impact on performance. The only thing that drives performance of the application vis-a-vis the database is the proper design of the tables, both as logical model and as physical database structure (ie. correct choices of clustered indexes, non-clustered indexes, constraints, defaults, data types etc), and the proper querying and updating of the database ie. queries that can be satisfied (covered, efficiently) by the existing indexes.
Splitting an application database into multiple distinct databases is a bad idea. You are loosing consistency of the recovery unit (you can't backup/restore the two databases in a consistent state) and you need to replicate all the infrastructure around the database twice (security roles and permissions, maintenance activities and procedures etc). Also spliting an application database into distinct databases gaves absolutely no performance advantage.
What you can do to make things speedier:
-break up your databases so that they use multiple files across multiple, fast drives
-federate (not really something you'll do if you're running Express)
-Install memory, memory, memory
-I can't remember Express's limitations and I don't care to look them up, but on the configuration screen where you can assign the number of CPUs to dedicate to SQL, give it as many as you can. You should also be able to set affinity there (if not, then in Task Manager)
Don't run anything you don't need (scheduler, report engine, Server, DHCP Client) if you don't have to