How would I implement separate databases for reading and writing operations? - sql

I am interested in implementing an architecture that has two databases one for read operations and the other for writes. I have never implemented something like this and have always built single database, highly normalised systems so I am not quite sure where to begin. I have a few parts to this question.
1. What would be a good resource to find out more about this architecture?
2. Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
3. How do you insure that data written to one database is immediately available for reading from the second?
Any further help, tips, resources would be appreciated. Thanks.
EDIT
After some research I have found this article which I found very informative for those interested..
http://www.codefutures.com/database-sharding/
I found this highscalability article very informative

I'm not a specialist but the read/write master database and read-only slaves pattern is a "common" pattern, especially for big applications doing mostly read accesses or data warehouses:
it allows to scale (you add more read-only slaves if required)
it allows to tune the databases differently (for either efficient reads or efficient writes)
What would be a good resource to find out more about this architecture?
There are good resources available on the Internet. For example:
Highscalability.com has good examples (e.g. Wikimedia architecture, the master-slave category,...)
Handling Data in Mega Scale Systems (starting from slide 29)
MySQL Scale-Out approach for better performance and scalability as a key factor for Wikipedia’s growth
Chapter 24. High Availability and Load Balancing in PostgreSQL documentation
Chapter 16. Replication in MySQL documentation
http://www.google.com/search?q=read%2Fwrite+master+database+and+read-only+slaves
Is it just a question of replicating between two identical schemas, or would your schemas differ depending on the operations, would normalisation vary too?
I'm not sure - I'm eager to read answers from experts - but I think the schemas are identical in traditional replication scenari (the tuning may be different though). Maybe people are doing more exotic things but I wonder if they rely on database replication in that case, it sounds more like "real-time ETL".
How do you insure that data written to one database is immediately available for reading from the second?
I guess you would need synchronous replication for that (which is of course slower than asynchronous). While some databases do support this mode, not all do AFAIK. But have a look at this answer or this one for SQL Server.

You might look up data warehouses.
These serve as 'normalized for reporting' type databases, while you can keep a normalized OLTP style instance for the data maintenance.
I don't think the idea of 'immediate' equivalence will be a reality. There will be some delay while the new data and changes are migrated in to the other system. The schedule and scope will be your big decisions here.

In regards to questions 2:
It really depends on what you are trying to achieve by having two databases. If it is for performance reasons (which i suspect it may be) i would suggest you look into denormalizing the read-only database as needed for performance. If performance isn't an issue then I wouldn't mess with the read-only schema.
I've worked on similar systems where there would be a read/write database that was only lightly used by administrative users. That database would then be replicated to the read only database during a nightly process.
Question 3:
How immediate are we talking here? Less than a second? 10 seconds? Minutes?

Related

How can NoSQL databases achieve much better write throughput than some relational databases?

How is this possible? What is it about NoSQL that gives it a higher write throughput than some RDBMS? Does it boil down to scalability?
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB). These databases have no predefined data structure. Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.
But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.
Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.
Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.
All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.
This link explains a lot moreover where:
RDBMS -> data integrity is a key feature (which can slow down some operations like writing)
NoSQL -> Speed and horizontal scalability are imperative (So speed is really high with this imperatve)
AAAND... The thing about NoSQL is that NoSQl cannot be compared to SQL in any way. NoSQL is name of all persistence technologies that are not SQL. Document DBs, Key-Value DBs, Event DBs are all NoSQL. They are all different in almost all aspects, be it structure of saved data, querying, performance and available tools.
Hope it is useful to understand
In summary, NoSQL databases are built to easily scale across a large number of servers (by sharding/horizontal partitioning of data items), and to be fault tolerant (through replication, write-ahead logging, and data repair mechanisms). Furthermore, NoSQL supports achieving high write throughput (by employing memory caches and append-only storage semantics), low read latencies (through caching and smart storage data models), and flexibility (with schema-less design and denormalization).
From:
Open Journal of Databases (OJDB)
Volume 1, Issue 2, 2014
www.ronpub.com/journals/ojdb
ISSN 2199-3459
https://estudogeral.sib.uc.pt/bitstream/10316/27748/1/Which%20NoSQL%20Database.pdf - page 19
A higher write throughput can also be credited to the internal data structures that power the database storage engine.
Even though B-tree implementations used by some RDBMS have stood the test of time, LSM-trees used in some key-value datastores are typically faster for writes:
1: When a write comes, you add it to the in-memory balanced tree, called memtable.
2: When the memtable grows big, it is flushed to the disk.
To understand this data structure better, please check this video and this answer.

Distributed SQL Database

I need to decide which database to use for a system where I need AP from CAP theorem. Data I constantly but slowly going in. Big queries are expected. It should be reliable - no single point of failure. I can use up to 3 instances on different nodes. In-memory solutions are bad for me because of data size -
it should be running for years and I expect up to terabyte data sizes. Most guys in my team prefer SQL. But I understand that traditional SQL databases are not fault tolerant in terms of hardware failure. Any ideas?
Since this question was asked there have been some significant changes in the Distributed SQL or NewSQL landscape...the most noteworthy being the viability of CockroachDB. That appears to be the best option in a situation like the one referenced in this question.
No single points of failure. Easy to scale. Can handle tons of volume. You can run it wherever you want. Speaks postgres. Super fault tolerant.
Amazon Redshift seems to be the best answer(thank you kuujo). But we will try rethinkdb because it has some nice feature

Are there any REAL advantages to NoSQL over RDBMS for structured data on one machine?

So I've been trying hard to figure out if NoSQL is really bringing that much value outside of auto-sharding and handling UNSTRUCTURED data.
Assuming I can fit my STRUCTURED data on a single machine OR have an effective 'auto-sharding' feature for SQL, what advantages do any NoSQL options offer? I've determined the following:
Document-based (MongoDB, Couchbase, etc) - Outside of it's 'auto-sharding' capabilities, I'm having a hard time understanding where the benefit is. Linked objects are quite similar to SQL joins, while Embedded objects significantly bloat doc size and causes a challenge regarding to replication (a comment could belong to both a post AND a user, and therefore the data would be redundant). Also, loss of ACID and transactions are a big disadvantage.
Key-value based (Redis, Memcached, etc) - Serves a different use case, ideal for caching but not complex queries
Columnar (Cassandra, HBase, etc ) - Seems that the big advantage here is more how the data is stored on disk, and mostly useful for aggregations rather than general use
Graph (Neo4j, OrientDB, etc) - The most intriguing, the use of both edges and nodes makes for an interesting value-proposition, but mostly useful for highly complex relational data rather than general use.
I can see the advantages of Key-value, Columnar and Graph DBs for specific use cases (Caching, social network relationship mapping, aggregations), but can't see any reason to use something like MongoDB for STRUCTURED data outside of it's 'auto-sharding' capabilities.
If SQL has a similar 'auto-sharding' ability, would SQL be a no-brainer for structured data? Seems to me it would be, but I would like the communities opinion...
NOTE: This is in regards to a typical CRUD application like a Social Network, E-Commerce site, CMS etc.
If you're starting off on a single server, then many advantages of NoSQL go out the window. The biggest advantages to the most popular NoSQL are high availability with less down time. Eventual consistency requirements can lead to performance improvements as well. It really depends on your needs.
Document-based - If your data fits well into a handful of small buckets of data, then a document oriented database. For example, on a classifieds site we have Users, Accounts and Listings as the core data. The bulk of search and display operations are against the Listings alone. With the legacy database we have to do nearly 40 join operations to get the data for a single listing. With NoSQL it's a single query. With NoSQL we can also create indexes against nested data, again with results queried without Joins. In this case, we're actually mirroring data from SQL to MongoDB for purposes of search and display (there are other reasons), with a longer-term migration strategy being worked on now. ElasticSearch, RethinkDB and others are great databases as well. RethinkDB actually takes a very conservative approach to the data, and ElasticSearch's out of the box indexing is second to none.
Key-value store - Caching is an excellent use case here, when you are running a medium to high volume website where data is mostly read, a good caching strategy alone can get you 4-5 times the users handled by a single server. Key-value stores (RocksDB, LevelDB, Redis, etc) are also very good options for Graph data, as individual mapping can be held with subject-predicate-target values which can be very fast for graphing options over the top.
Columnar - Cassandra in particular can be used to distribute significant amounts of load for even single-value lookups. Cassandra's scaling is very linear to the number of servers in use. Great for heavy read and write scenarios. I find this less valuable for live searches, but very good when you have a VERY high load and need to distribute. It takes a lot more planning, and may well not fit your needs. You can tweak settings to suite your CAP needs, and even handle distribution to multiple data centers in the box. NOTE: Most applications do emphatically NOT need this level of use. ElasticSearch may be a better fit in most scenarios you would consider HBase/Hadoop or Cassandra for.
Graph - I'm not as familiar with graph databases, so can't comment here (beyond using a key-value store as underlying option).
Given that you then comment on MongoDB specifically vs SQL ... even if both auto-shard. PostgreSQL in particular has made a lot of strides in terms of getting unstrictured data usable (JSON/JSONB types) not to mention the power you can get from something like PLV8, it's probably the most suited to handling the types of loads you might throw at a document store with the advantages of NoSQL. Where it happens to fall down is that replication, sharding and failover are bolted on solutions not really in the box.
For small to medium loads sharding really isn't the best approach. Most scenarios are mostly read so having a replica-set where you have additional read nodes is usually better when you have 3-5 servers. MongoDB is great in this scenario, the master node is automagically elected, and failover is pretty fast. The only weirdness I've seen is when Azure went down in late 2014, and only one of the servers came up first, the other two were almost 40 minutes later. With replication any given read request can be handled in whole by a single server. Your data structures become simpler, and your chances of data loss are reduced.
Again in my own example above, for a mediums sized classifieds site, the vast majority of data belongs to a single collection... it is searched against, and displayed from that collection. With this use case a document store works much better than structured/normalized data. The way the objects are stored are much closer to their representation in the application. There's less of a cognitive disconnect and it simply works.
The fact is that SQL JOIN operations kill performance, especially when aggregating data across those joins. For a single query for a single user it's fine, even with a dozen of them. When you get to dozens of joins with thousands of simultaneous users, it starts to fall apart. At this point you have several choices...
Caching - caching is always a great approach, and the less often your data changes, the better the approach. This can be anything from a set of memcache/redis instances to using something like MongoDB, RethinkDB or ElasticSearch to hold composite records. The challenge here comes down to updating or invalidating your cached data.
Migrating - migrating your data to a data store that better represents your needs can be a good idea as well. If you need to handle massive writes, or very massive read scenarios no SQL database can keep up. You could NEVER handle the likes of Facebook or Twitter on SQL.
Something in between - As you need to scale it depends on what you are doing and where your pain points are as to what will be the best solution for a given situation. Many developers and administrators fear having data broken up into multiple places, but this is often the best answer. Does your analytical data really need to be in the same place as your core operational data? For that matter do your logins need to be tightly coupled? Are you doing a lot of correlated queries? It really depends.
Personal Opinions Ahead
For me, I like the safety net that SQL provides. Having it as the central store for core data it's my first choice. I tend to treat RDBMS's as dumb storage, I don't like being tied to a given platform. I feel that many people try to over-normalize their data. Often I will add an XML or JSON field to a table so additional pieces of data can be stored without bloating the scheme, specifically if it's unlikely to ever be queried... I'll then have properties in my objects in the application code that store in those fields. A good example may be a payment... if you are currently using one system, or multiple systems (one for CC along with Paypal, Google, Amazon etc) then the details of the transaction really don't affect your records, why create 5+ tables to store this detailed data. You can even use JSON for primary storage and have computed columns derived and persisted from that JSON for broader query capability and indexing where needed. Databases like postgresql and mysql (iirc) offer direct indexing against JSON data as well.
When data is a natural fit for a document store, I say go for it... if the vast majority of your queries are for something that fits better to a single record or collection, denormalize away. Having this as a mirror to your primary data is great.
For write-heavy data you want multiple systems in play... It depends heavily on your needs here... Do you need fast hot-query performance? Go with ElasticSearch. Do you need absolute massive horizontal scale, HBase or Cassandra.
The key take away here is not to be afraid to mix it up... there really isn't a one size fits all. As an aside, I feel that if PostgreSQL comes up with a good in the box (for the open-source version) solution for even just replication and automated fail-over they're in a much better position than most at that point.
I didn't really get into, but feel I should mention that there are a number of SaaS solutions and other providers that offer hybrid SQL systems. You can develop against MySQL/MariaDB locally and deploy to a system with SQL on top of a distributed storage cluster. I still feel that HBase or ElasticSearch are better for logging and analitical data, but the SQL on top solutions are also compelling.
More: http://www.mongodb.com/nosql-explained
Schema-less storage (or schema-free). Ability to modify the storage (basically add new fields to records) without having to modify the storage 'declared' schema. RDBMSs require the explicit declaration of said 'fields' and require explicit modifications to the schema before a new 'field' is saved. A schema-free storage engine allows for fast application changes, just modify the app code to save the extra fields, or rename the fields, or drop fields and be done.
Traditional RDBMS folk consider the schema-free a disadvantage because they argue that on the long run one needs to query the storage and handling the heterogeneous records (some have some fields, some have other fields) makes it difficult to handle. But for a start-up the schema-free is overwhelmingly alluring, as fast iteration and time-to-market is all that matter (and often rightly so).
You asked us to assume that either the data can fit on a single machine, OR your database has an effective auto-sharding feature.
Going with the assumption that your SQL data has an auto-sharding feature, that means you're talking about running a cluster. Any time you're running a cluster of machines you have to worry about fault-tolerance.
For example, let's say you're using the simplest approach of sharding your data by application function, and are storing all of your user account data on server A and your product catalog on server B.
Is it acceptable to your business if server A goes down and none of your users can login?
Is it acceptable to your business if server B goes down and no one can buy things?
If not, you need to worry about setting up data replication and high-availability failover. Doable, but not pleasant or easy for SQL databases. Other types of sharding strategies (key, lookup service, etc) have the same challenges.
Many NoSQL databases will automatically handle replication and failovers. Some will do it out of the box, with very little configuration. That's a huge benefit from an operational point of view.
Full disclosure: I'm an engineer at FoundationDB, a NoSQL database that automatically handles sharding, replication, and fail-over with very little configuration. It also has a SQL layer so you you don't have to give up structured data.

gDatabase Optimization: Need a really big database to test some of the features of sql server

I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.

Is DB4O Replication faster than SQL Server Merge Replication?

Does the replication system that comes with DB4O work well? Basically I would like to know if anyone has some good numbers on the record throughput of their replication system and if it handles concurrency errors gracefully or not. What is the relative performance difference between SQL Server's merge replication between two SQL servers and using DRS between two DB4O databases?
We are currently working on improving the replication system further and improving performance certainly is a goal.
I think it's quite hard to produce comparable figures. Every object that needs to be replicated requires a lookup in the UUID BTree. If you know what you are doing, you can finetune that to run completely in memory. Then again the throughput will depend very much on how many indexes you have on each side and how big indexes are. db4o and the SQL server of your choice (and any other SQL server) may scale differently with size and that may very much depend on the hardware you use (db4o loves solid state discs with short seek times).
This is like with any other benchmark: You can only find out how things really will work for you if you mock up the scenario that you think you need and run it on your hardware.
As to handling concurrency: Any conflict will call back into your code and it's your choice how you handle it. You can resolve by hand by merging changes to either side and you can also ignore objects. It's up to your code to find out what it thinks is right.
With respect to concurrency if you have a replication session running side-by-side with another live session that constantly modifies objects: Currently released dRS code is not yet strong for this case. While we implement replication between db4o and the high-end object database Versant VOD we will try to cover these kind of concurrency cases also.