Consistency/Atomicity (or even ACID) properties in multiple SQL/NoSQL databases architecture - sql

I'm rather used to use one database alone (say PostgreSQL or ElasticSearch).
But currently I'm using a mix (PG and ES) in a prototype app and may throw other kind of dbs in the mix (eg: redis).
Say some piece of data need to be persisted to each databases in a different way.
How do you keep a system consistent in the event of a failure on one of the components/databases ?
Example scenario that i'm facing:
Data update on PostgreSQL, ElasticSearch is unavailable.
At this point, the system is inconsistent, as I should have updated both databases.
As I'm using an SQL db, I can simply abort the transaction to put the system in its previous consistent state.
But what is the best way to keep the system consistent ?
Check everytime that the value has been persisted in all databases ?
In case of failure, restore the previous state ? But in some NoSQL databases there is no transaction/ACID mechanism, so I can't revert as easily the previous state.
Additionnaly, if multiple databases must be kept in sync, is there any good practice to have, like adding some kind of "version" metadata (whether a timestamp or an home made incrementing version number) so you can put your databases back in sync ? (Not talking about CouchDB where it is built-in!)
Moreover, the databases are not all updated atomically so some part are inconsistent for a short period. I think it depends on the business of the app but does anyone have some thought about the problem that my occur or the way to fix that ? I guess it must be tough and depends a lot of the configuration (for maybe very few real benefits).
I guess this may be a common architecture issue but I'm having trouble to find information on the subject.

Keep things simple.
Search engine can and will lag behind sometimes. You may fight it. You may embrace it. It's fine, and most of the times its acceptable.
Don't mix the data. If you use Redis for sessions - good. Don't store stuff from database A in B and vice versa.
Select proper database with ACID and strong consistency for your Super Important Business Data™®.
Again, do not mix the data.

Using more than one database technology in one product is a decision one shouldn't make light-hearted. The more technologies you use the more complex your project will become in development, deployment, maintenance and administration. Also, every database technology will become an individual point of failure. That means it is often much wiser to stick to one technology, even when it means that you need to make some compromises.
But when you have good(!) reason to use multiple DBMS, you should try to keep them as separated as possible. Avoid placing related data spanning multiple databases. When possible, no feature should require more than one DBMS to work (preferably a failure of the DBMS would only affect those features which use it). Storing redundant data in two different DBMS should also be avoided.
When you can't avoid redundancies and relationships spanning multiple DBMS, you should decide on one system to be the single source of truth (preferably one which you trust most regarding consistency). When there are inconsistencies between systems, they should be resolved by synchronizing the data with the SSOT.

Related

How to go from a full SQL querying to something like a NoSQL?

In one of my process I have this SQL query that take 10-20% of the total execution time. This SQL query does a filter on my Database, and load a list of PricingGrid object.
So I want to improve these performance.
So far I guessed 2 solutions :
Use a NoSQL solution, AFAIK these are good solutions for improving reading process.
But the migration seems hard and needs a lot of work (like import the data from sql server to nosql in a regular basis)
I don't have any knowledge , I even don't know which one I should use (the first I'd use is Ravendb because I follow ayende and it's done by the .net community).
I might have some stuff to change in my model to make my object ok for a nosql database
Load all my PricingGrid object in memory (in a static IEnumerable)
This might be a problem when my server won't have enough memory to load everything
I might reinvent the wheel (indexes...) invented by the NoSQL providers
I think I'm not the first one wondering this, so what would be the best solution ? Is there any tools that could help me ?
.net 3.5, SQL Server 2005, windows server 2005
Migrating your data from SQL is only the first step.
Moving to a document store (like RavenDB or MongoDB) also means that you need to:
Denormalize your data
Perform schema validation in your code
Handle concurrency of complex operations in your code since you no longer have transactions (at least not the same way)
Perform rollbacks in the event of partial commits (changes)
Depending on your updates, reads and network model you might also need to handle conflicts
You provided very limited information but it sounds like your needs include a single database server and that your data fits well in the relational model.
In such a case I would vote against a NoSQL solution, it is more likely that you can speed up your queries with database optimizations and still retain all the added value of a RDBMS.
Non-relational databases are tools for a specific job (no matter how they sell them), if you need them it is usually because your data doesn't fit well in the relational model or if you have a need to distribute your data over multiple machines (size or availability). For instance, I use MongoDB for a write-intensive high throughput job management application. It is centralized and the data is very transient so the "cost" of having low durability is acceptable. This doesn't sound like the case for you.
If prefer to use a NoSQL solution perhaps you should try using Memcached+MySQL (InnoDB) this will allow you to get the speed benefits of an in-memory cache (in the form of a memcached daemon plugin) with the underlying protection and capabilities of an RDBMS (MySQL). It should also ease data migration and somewhat reduce the amount of changes required in your code.
I myself have never used it, I find that I either need NoSQL for the reasons I stated above or that I can optimize the RDBMS using stored procedures, indexes and table views in a way which is sufficient for my needs.
Asaf has provided great information in regards to the usage of NoSQL and when it is most appropriate. Given that your main concern was performance, I would tend to agree with his opinion - it would take you much more time and effort to adopt a completely new (and very different) data persistence platform than it would to trick out your SQL Server cluster. That said, my answer is mainly to address the "how" part of your question.
Addressing misunderstandings:
Denormalizing Data - You do not need to manually denormalize your existing data. This will be done for you when it is migrated over. More than anything you need to simply think about your data in a different fashion - root aggregates, entity and value types, etc.
Concurrency/Transactions - Transactions are possible in both Mongo and Raven, they are simply done in a different fashion. One of the inherent ways Raven does this is by using an ORM-like "unit of work" pattern with its RavenSession objects. Yes, your data validation needs to be done in code, but you already should be doing it there anyway. In my experience this is an over-hyped con.
How:
Install Raven or Mongo on a primary server, run it as a service.
Create or extend an existing application that uses the database you intend to port. This application needs all the model classes/libraries that your SQL database provides persistence for.
a. In your "data layer" you likely have a repository class somewhere. Extract an interface form this, and use it to build another repository class for your Raven/Mongo persistence. Both DB's have plenty good documentation for using their APIs to push/pull/update changes in the document graphs. It's pretty damn simple.
b. Load your SQL data into C# objects in memory. Pull back your top-level objects (just the entities) and load their inner collections and related data in memory. Your repository is probably already doing this (ex. when fetching an Order object, ensure not only its properties but associated collections like Items are loaded in memory.
c. Instantiate your Raven/Mongo repository and push the data to it. Primary entities become "top level documents" or "root aggregates" serialized in JSON, and their collections' data nested within. Save changes and close the repository. Note: You may break this step down into as many little pieces as your data deems necessary.
Once your data is migrated, play around with it and ensure you are satisfied. You may want to modify your application Models a little to adjust the way they are persisted to Raven/Mongo - for instance you may want to make both Orders and Items top-level documents and simply use reference values (much like relationships in RDBMS systems). Watch out here though, as doing so sort-of goes against the principal and performance behind NoSQL as now you have to tap the DB twice to get the Order and the Items.
If satisfied, shard/replicate your mongo/raven servers across your remaining available server boxes.
Obviously there are tons of little details I did not explain, but that is the general process, and much of it depends on the applications already consuming the database and may be tricky if more than one app/system talks to it.
Lastly, just to reiterate what Asaf said... learn as much as you can about NoSQL and its best use-cases. It is an amazing tool, but not golden solution for all data persistence. In your case try to really find the bottlenecks in your current solution and see if they are solvable. As one of my systems guys says, "technology for technology's sake is bullshit"

Dynamic sql vs stored procedures - pros and cons?

I have read many strong views (both for and against) SPs or DS.
I am writing a query engine in C++ (mySQL backend for now, though I may decide to go with a C++ ORM). I cant decide whether to write a SP, or to dynamically creat the SQL and send the query to the db engine.#
Any tips on how to decide?
Here's the simple answer:
If your programmers do both database and coding work, keep the SQL with the app. It's easier to maintain that way. Otherwise, let the DB guys handle it in SPs.
You have more control over the mechanisms outside the database. The biggest win for taking care of this outside the database is simply maintenance (in my mind). It'd be slightly hard to version control the SP vs the code you generate outside the database. One more thing to keep track of.
While we're on the topic, it's similar to handling data/schema migrations. It's annoyingly complex to version/handle schema migrations, if you don't already have a mechanism for this, you will have yet another thing you'll need to manage. It comes down to simply being easier to manage/version these things outside the database.
Consider the scenario where you have a bug in your SP. Now it needs to be changed, but then you hop over to another developers database/sandbox. What version is the sandbox and the SP? Now you have to track multiple versions.
One of the main differentiators is whether you are writing the "one true front end" or whether the database is the central piece of your application.
If you are going to have multiple front ends stored procedures make a lot of sense because you reduce your maintenance overhead. If you are writing only one interface, stored procedures are a pain, because you lose a lot of flexibility in changing your data set as your front end needs change, plus you now have to do code maintenance, version control, etc. in two places. Databases are a real pain to keep in sync with code repositories.
Finally, if you are coding for multiple databases (Oracle and SQL compatible code, for example), I'd avoid stored procedures completely.
You may in certain rare circumstances, after profiling, determine that some limited stored procedures are useful to you. This situation comes up way less than people think it does.
The main scenarios when you MUST have the SP is:
1) When you have very complex set of queries with heavy compile overhead and data drift low enough that recompiling is not needed on a regular basis.
2) When the "Only True" logic for accessing the specific data set is VERY complicated, needs to be accessed from several different codebases on different platforms (so writing multiple APIs in code is much more expensive).
Any other scenario, it's debatable, and can be decided one way or another.
I must also say that the other posters' arguments about versioning are not really such a big deal in my experience - having your SPs in version control is as easy as creating a "sql/db_name" directory structure and having easy basic "database release" script which releases the SP code from the version control location to the database. Every company I worked for had some kind of setup like this, central one run by DBAs or departmental one run by developers.
The one thing you want to avoid is to have your business logic spread across multiple tiers of your application. Database DDL and DML are difficult enough to keep in sync with an application code base as it is.
My recommendation is to create a good relational schema, but all your constraints and triggers so that the data retains integrity even if somebody goes to the database and tries to do something through some command line SQL.
Put all your business logic in an application or service that calls (static/dynamic) SQL then wraps the business functionality you are are trying to expose.
Stored-procedures have two purposes that I can think of.
An aid to simplifying data access.
The Stored Procedure does not have
any business logic in it, it just
knows about the structure of the
data and exposes an interface to
isolate accessing three tables and a
view just to get a single piece of
information.
Mapping the Domain Model to the Data
Model, Stored Procedures can assist
in making the Data Model look like a
given Domain Model.
After the program has been completed and has been profiled there are often performance issues with the pre 1.0 release. Stored procedures do offer batching of SQL without traffic needing to go back and forth between the DBMS and the Application. That being said in rare and extreme cases due to performance a few business rules might need to be migrated to the Stored-Procedure side. Make sure to document any exceptions to the architectural philosophy in multiple prominent places.
Stored Procedures are ideal for:
Creating reusable abstractions over complex queries;
Enforcing specific types of insertions/updates to tables (if you also deny permissions to the table);
Performing privileged operations that the logged-in user wouldn't normally be allowed to do;
Guaranteeing a consistent execution plan;
Extending the capabilities of an ORM (batch updates, hierarchy queries, etc.)
Dynamic SQL is ideal for:
Variable search arguments or output columns:
Optional search conditions
Pivot tables
IN clauses with user-specified values
ORM implementations (most can use SPs, but can't be built entirely on them);
DDL and administrative scripts.
They solve different problems, really. Use whichever one is more appropriate to the task at hand, and don't restrict yourself to just one or the other. After you work on database code for a while you'll start to get a more intuitive feel for these things; you'll find yourself banging together some rat's nest of strings for a query and think, "this should really go in a stored procedure."
Final note: Because this question implies a certain level of inexperience with SQL, I feel obliged to say, don't forget that you still need to parameterize your queries when you write dynamic SQL. Parameters aren't just for stored procedures.
DS is more flexible. SP approach makes your system more manageable.

Relationships and constraints across databases

HI There,
what are the possible ways in which i can maintain relationships across instances of databases . i know relationships across DB's is bad approach , but i have to do this way.
i am using SQL SERVER 2005.
Thanks
DEE
As far as I know it is impossible to do.
Your options are:
Set up replication between the datases so that the tables you want to define a relationship with are available in your local database. But that could get messy.
Create a UDF that does the checking and use that as a contraint.
Triggers.
However, this is such a bad idea that you really should re-evaluate whatever reasoning drove you to create multiple databases in the first place.
If you were looking for a business-ey solution, you probably want to look at having one or other of the databases be the clearing house for the "for business process purposes" existence/deletion of (a) records id'ed by keys (b) possibly-with-scope conditions expressed by particular rule data. This might or might not come under an IT heading instead and may or may not involve programming.
I am assuming of course in answering this question that the constraints management problem is substantial enough that people couldn't manage it on the back of an envelope; I am trying to answer the question as stated as generally as possible.
The programming source code types perhaps relevant to constraint management here are I would suggest...
triggers and stored procedures in T/SQL
validation in 3GL/4GL app code writing to databases
validation in message-data-mapping and BRE business rule engine and other types of server systems
The constraint validation checks against either:
a separate set of keys and rule data
the main body of the data
It is possible to set up live or near-to-live links of the whole data set or a subset of it, but this rarely goes on in practice for four reasons:
(a) performance, overall system reliability, maintenance and cost issues of the system tend to get big very quickly; it has to be worth doing
(b) the integration issues of adding further systems are bigger
(c) there is the potential for people unaware of the necessity to maintain database links to go into a system and update data in a way which they think is helpful but which causes knock-on problems; so the system as a whole then requires stricter management processes which can be expensive
(d) delayed data throgh batch updates is often (though it is not unproblematic in itself!) sufficient for most business systems
From the system analyst's point of view, enough checkpoints need to be set up in the processes to validate the data, and enough synchronisation and/or integrity violation management steps need to occur.
EDIT
Replaced text "near-to-real time" with "delayed data".
EDIT
Replaced text "delayed data" with "delayed data through batch updates".
EDIT
Really replaced text "delayed data" with "delayed data through batch updates".

Is using MS SQL Identity good practice?

Is using MS SQL Identity good practice in enterprise applications? Isn't it make difficulties in creating business logic, and migrating database from one to another?
Personally I couldn't live without identity columns and use them everywhere however there are some reasons to think about not using them.
Origionally the main reason not to use identity columns AFAIK was due to distributed multi-database schemas (disconnected) using replication and/or various middleware components to move data. There just was no distributed synchronization machinery avaliable and therefore no reliable means to prevent collisions. This has changed significantly as SQL Server does support distributing IDs. However, their use still may not map into more complex application controlled replication schemes.
They can leak information. Account ID's, Invoice numbers, etc. If I get an invoice from you every month I can ballpark the number of invoices you send or customers you have.
I run into issues all the time with merging customer databases and all sides still wanting to keep their old account numbers. This sometimes makes me question my addiction to identity fields :)
Like most things the ultimate answer is "it depends" specifics of a given situation should necessarily hold a lot of weight in your decision.
Yes, they work very well and are reliable, and perform the best. One big benefit of using identity fields vs non, is they handle all of the complex concurrency issues of multiple callers attempting to reserve new id's. This may seem like something trivial to code but it's not.
These links below offer some interesting information about identity fields and why you should use them whenever possible.
DB: To use identity column or not?
http://www.codeproject.com/KB/database/AgileWareNewGuid.aspx?display=Print
http://www.sqlmag.com/Article/ArticleID/48165/sql_server_48165.html
The question is always:
What are the chances that you're realistically going to migrate from one database to another? If you're building a multi-db app it's a different story, but most apps don't ever get ported over to a new db midstream - especially when they start out with something as robust as SQL Server.
The identity construct is excellent, and there's really very few reasons why you shouldn't use it. If you're interested, I wrote a blog article on some of the common myths surrounding identity values.
The IDENTITY Property: A Much-Maligned Construct in SQL Server
Yes.
They generally works as intended, and you can use the DBCC CHECKIDENT command to manipulate and work with them.
The most common idea of an identity is to provide an ordered list of numbers on which to base a primary key.
Edit: I was wrong about the fill factor, I didn't take into account that all of the inserts would happen on one side of the B-tree.
Also, In your revised question, you asked about migrating from one DB to another:
Identities are perfectly fine as long as the migrating is a one-way replication. If you have two databases that need to replicate to each other, a UniqueIdentifier column may be your best bet.
See: When are you truly forced to use UUID as part of the design? for a discussion on when to use a UUID in a database.
Good article on identities, http://www.simple-talk.com/sql/t-sql-programming/identity-columns/
IMO, migrating to another RDBMS is rarely needed these days. Even if it is needed, the best way to develop portable applications is to develop a layer of stored procedures isolating your application from proprietary features:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/02/24/writing-ansi-standard-sql-is-not-practical.aspx

Does an ORM integrate with existing applications or do I not understand?

Assume Hibernate for the ORM.
I'm not sure how to ask this. I want to build an application that can replace part of another. For example, say I have an application with various modules, called the "big" app. This application may handle HR, financial, purchases, skill sets, etc. But maybe, for whatever reason, I don't like the skill set module, but I like the rest of the application. I want to build an app that uses the same database that the rest of the "big" app uses but use my software as the front end for that piece.
I could build my app and have it hit the database directly with no ORM. My question is is there an advantage to using an ORM here. I'm thinking there is because if the "big" app goes away and another app is purchased, we could continue to use my version of skill set because I am using hibernate instead of hitting things directly. I'm still learning but I thought that my application used objects that I named and that in the case I just described I'd have to change my mapping files only or/and my code very little.
Here is another example. I have a legacy application and legacy database. It uses database X. I decide that I no longer like the old terminal emulator application that is used to get the data and that I want a graphical version. I can use hibernate with my application and when I finally decide to get rid of the legacy database and change to the latest Oracle or SQL Server, I can do so with minimal headache? Or is my database going to change so much that it wouldn't have matter anyway (I'm suggesting that upon changing to a new database more information will want to be captured)?
I was hoping for comments, if I am misunderstanding why hibernate/ORM might or might not be a benefit.
Thank you.
I do not think you will have a huge benefit frmo hibernate if the database schema changes to something completely different, you might have to change more than just your mapping - especially if more "structure" is added to the database (tables, column and such schema things). That said, if the database was structured mostly the same way, but lets say just the column names and tables names changes and a couple of tables are merged or something like that - you can get by with just changing your mapping.
But I would really recommend using hbernate for database agnosticity, that's is a pretty easy path.
AND then just because it doesn't exactly helps you if your entire database is changed, it such an incredible amount of other forces, that I would choose that over direct DB access most of the time.
Lastly you could think about using a service-layer such as the repository pattern that abstracs away the data access, so the business of your appilcation wouldn't need to change if the database changes.
Switching from one DBMS to another (ala Oracle to SQL Server) is one thing that using an ORM would certainly make much easier.
As for switching from one "big app" to another "big app", I doubt if using an ORM would help that much. It's likely that the database structure and business logic would be different enough that you would find yourself rewriting lots of code anyways.
You can generate domain objects with Hibernate Tools, if you do that than it will be painless and fast. however if you write all the objects by hand you will die. i think its good idea to rewrite part of the app and get to know hibernate better.
I think it's generally a bad idea to make any decision based on the
unknowns versus the knowns. Whether you're deciding on a data
access/persistence strategy, what car to buy, or what college to go
to, you should put the most weight on the things you know you want
today, rather than worrying about what may or may not happen tomorrow.
So when considering ORMs, I wouldn't worry too much about things such as apps
"going away" or DBMSs changing (unless that's either already been talked about, or
there's a history of this in your company). I'm not saying that these aren't things that will never happen, but rather that they should take a back seat to the generally much more important considerations of maintainability, performance, and developer productivity.
So in short, choose an ORM based on its ability to solve the problems and satisfy the requirements that you have today.