Is using MS SQL Identity good practice? - sql

Is using MS SQL Identity good practice in enterprise applications? Isn't it make difficulties in creating business logic, and migrating database from one to another?

Personally I couldn't live without identity columns and use them everywhere however there are some reasons to think about not using them.
Origionally the main reason not to use identity columns AFAIK was due to distributed multi-database schemas (disconnected) using replication and/or various middleware components to move data. There just was no distributed synchronization machinery avaliable and therefore no reliable means to prevent collisions. This has changed significantly as SQL Server does support distributing IDs. However, their use still may not map into more complex application controlled replication schemes.
They can leak information. Account ID's, Invoice numbers, etc. If I get an invoice from you every month I can ballpark the number of invoices you send or customers you have.
I run into issues all the time with merging customer databases and all sides still wanting to keep their old account numbers. This sometimes makes me question my addiction to identity fields :)
Like most things the ultimate answer is "it depends" specifics of a given situation should necessarily hold a lot of weight in your decision.

Yes, they work very well and are reliable, and perform the best. One big benefit of using identity fields vs non, is they handle all of the complex concurrency issues of multiple callers attempting to reserve new id's. This may seem like something trivial to code but it's not.
These links below offer some interesting information about identity fields and why you should use them whenever possible.
DB: To use identity column or not?
http://www.codeproject.com/KB/database/AgileWareNewGuid.aspx?display=Print
http://www.sqlmag.com/Article/ArticleID/48165/sql_server_48165.html

The question is always:
What are the chances that you're realistically going to migrate from one database to another? If you're building a multi-db app it's a different story, but most apps don't ever get ported over to a new db midstream - especially when they start out with something as robust as SQL Server.
The identity construct is excellent, and there's really very few reasons why you shouldn't use it. If you're interested, I wrote a blog article on some of the common myths surrounding identity values.
The IDENTITY Property: A Much-Maligned Construct in SQL Server

Yes.
They generally works as intended, and you can use the DBCC CHECKIDENT command to manipulate and work with them.
The most common idea of an identity is to provide an ordered list of numbers on which to base a primary key.
Edit: I was wrong about the fill factor, I didn't take into account that all of the inserts would happen on one side of the B-tree.
Also, In your revised question, you asked about migrating from one DB to another:
Identities are perfectly fine as long as the migrating is a one-way replication. If you have two databases that need to replicate to each other, a UniqueIdentifier column may be your best bet.
See: When are you truly forced to use UUID as part of the design? for a discussion on when to use a UUID in a database.

Good article on identities, http://www.simple-talk.com/sql/t-sql-programming/identity-columns/
IMO, migrating to another RDBMS is rarely needed these days. Even if it is needed, the best way to develop portable applications is to develop a layer of stored procedures isolating your application from proprietary features:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/02/24/writing-ansi-standard-sql-is-not-practical.aspx

Related

Consistency/Atomicity (or even ACID) properties in multiple SQL/NoSQL databases architecture

I'm rather used to use one database alone (say PostgreSQL or ElasticSearch).
But currently I'm using a mix (PG and ES) in a prototype app and may throw other kind of dbs in the mix (eg: redis).
Say some piece of data need to be persisted to each databases in a different way.
How do you keep a system consistent in the event of a failure on one of the components/databases ?
Example scenario that i'm facing:
Data update on PostgreSQL, ElasticSearch is unavailable.
At this point, the system is inconsistent, as I should have updated both databases.
As I'm using an SQL db, I can simply abort the transaction to put the system in its previous consistent state.
But what is the best way to keep the system consistent ?
Check everytime that the value has been persisted in all databases ?
In case of failure, restore the previous state ? But in some NoSQL databases there is no transaction/ACID mechanism, so I can't revert as easily the previous state.
Additionnaly, if multiple databases must be kept in sync, is there any good practice to have, like adding some kind of "version" metadata (whether a timestamp or an home made incrementing version number) so you can put your databases back in sync ? (Not talking about CouchDB where it is built-in!)
Moreover, the databases are not all updated atomically so some part are inconsistent for a short period. I think it depends on the business of the app but does anyone have some thought about the problem that my occur or the way to fix that ? I guess it must be tough and depends a lot of the configuration (for maybe very few real benefits).
I guess this may be a common architecture issue but I'm having trouble to find information on the subject.
Keep things simple.
Search engine can and will lag behind sometimes. You may fight it. You may embrace it. It's fine, and most of the times its acceptable.
Don't mix the data. If you use Redis for sessions - good. Don't store stuff from database A in B and vice versa.
Select proper database with ACID and strong consistency for your Super Important Business Data™®.
Again, do not mix the data.
Using more than one database technology in one product is a decision one shouldn't make light-hearted. The more technologies you use the more complex your project will become in development, deployment, maintenance and administration. Also, every database technology will become an individual point of failure. That means it is often much wiser to stick to one technology, even when it means that you need to make some compromises.
But when you have good(!) reason to use multiple DBMS, you should try to keep them as separated as possible. Avoid placing related data spanning multiple databases. When possible, no feature should require more than one DBMS to work (preferably a failure of the DBMS would only affect those features which use it). Storing redundant data in two different DBMS should also be avoided.
When you can't avoid redundancies and relationships spanning multiple DBMS, you should decide on one system to be the single source of truth (preferably one which you trust most regarding consistency). When there are inconsistencies between systems, they should be resolved by synchronizing the data with the SSOT.

SQL Azure Federations and the Atomic Unit Identity

I've started work on my first Azure application, and I'm learning a lot as I go. One of the features I discovered recently was Federations in SQL Azure, essentially the SQL Azure sharding implementation so we can scale horizontally.
My project started using SQL Server, and was already largely grouped by user Profile, so I decided that makes the most sense to federate on. I've created the federation, including all of the child tables with one snag - Identity is not supported. I get why it's not supported, what I'm not sure on is what best to replace it with. This seems like a huge problem that someone else must have solved, but I haven't been able to find much.
I could just use UniqueIdentifier, but I read that can be a pain to split on. I'm also not too sure of what other performance issues I could run into using a GUID as my Primary Key for federated tables.
I'm using this with Entity Framework, but haven't got to the point of making that federation friendly yet. From what I can tell, it's not much more complicated than executing some code to select your federation before writing your LINQ query, but I'll cross that bridge when I get to it.
For the moment, I have no idea how best to actually add items to my federation, because there is no good solution to generating an identity.
Any advice would be greatly appreciated.
I'm using the GUID when using the SQL Azure Federation, it's almost the best choice when data sharding. Assuming if you are using Identity in many federation members this will cause the duplicate of your primary value. When you need to merge the data back, or archive, how do you deal with these records.
People thought the GUID is low performance when data insert, especially if we use it as a clustered index. But I never met this problem. Or I should say, there are many tuning places we can do rather than this one.
So I can't talk to the EF question. But I can't comment on the idea of using Uniqueidentifier as your key type. This, in my mind, is the best choice. UniqueIdentifier is actually very easy to split on... the reason people think it's hard is they forget what a UniqueIdentifier is. The GUID that we all know and love is a Hex representation of a 128 bit integer. This means that we can use standard Integer operations with it and thus it's actually as easy to work with as the Int (aut number) you know and love.
While it's not specifically about SQL Azure federations (it's about Windows Azure Storage) this blog post of mine on using the GUID type for sharding should give you all you need to know.
http://www.syringe.net.nz/CommentView,guid,cebe3e19-85e6-4d5b-bc24-afb6f66aaeb1.aspx

How to go from a full SQL querying to something like a NoSQL?

In one of my process I have this SQL query that take 10-20% of the total execution time. This SQL query does a filter on my Database, and load a list of PricingGrid object.
So I want to improve these performance.
So far I guessed 2 solutions :
Use a NoSQL solution, AFAIK these are good solutions for improving reading process.
But the migration seems hard and needs a lot of work (like import the data from sql server to nosql in a regular basis)
I don't have any knowledge , I even don't know which one I should use (the first I'd use is Ravendb because I follow ayende and it's done by the .net community).
I might have some stuff to change in my model to make my object ok for a nosql database
Load all my PricingGrid object in memory (in a static IEnumerable)
This might be a problem when my server won't have enough memory to load everything
I might reinvent the wheel (indexes...) invented by the NoSQL providers
I think I'm not the first one wondering this, so what would be the best solution ? Is there any tools that could help me ?
.net 3.5, SQL Server 2005, windows server 2005
Migrating your data from SQL is only the first step.
Moving to a document store (like RavenDB or MongoDB) also means that you need to:
Denormalize your data
Perform schema validation in your code
Handle concurrency of complex operations in your code since you no longer have transactions (at least not the same way)
Perform rollbacks in the event of partial commits (changes)
Depending on your updates, reads and network model you might also need to handle conflicts
You provided very limited information but it sounds like your needs include a single database server and that your data fits well in the relational model.
In such a case I would vote against a NoSQL solution, it is more likely that you can speed up your queries with database optimizations and still retain all the added value of a RDBMS.
Non-relational databases are tools for a specific job (no matter how they sell them), if you need them it is usually because your data doesn't fit well in the relational model or if you have a need to distribute your data over multiple machines (size or availability). For instance, I use MongoDB for a write-intensive high throughput job management application. It is centralized and the data is very transient so the "cost" of having low durability is acceptable. This doesn't sound like the case for you.
If prefer to use a NoSQL solution perhaps you should try using Memcached+MySQL (InnoDB) this will allow you to get the speed benefits of an in-memory cache (in the form of a memcached daemon plugin) with the underlying protection and capabilities of an RDBMS (MySQL). It should also ease data migration and somewhat reduce the amount of changes required in your code.
I myself have never used it, I find that I either need NoSQL for the reasons I stated above or that I can optimize the RDBMS using stored procedures, indexes and table views in a way which is sufficient for my needs.
Asaf has provided great information in regards to the usage of NoSQL and when it is most appropriate. Given that your main concern was performance, I would tend to agree with his opinion - it would take you much more time and effort to adopt a completely new (and very different) data persistence platform than it would to trick out your SQL Server cluster. That said, my answer is mainly to address the "how" part of your question.
Addressing misunderstandings:
Denormalizing Data - You do not need to manually denormalize your existing data. This will be done for you when it is migrated over. More than anything you need to simply think about your data in a different fashion - root aggregates, entity and value types, etc.
Concurrency/Transactions - Transactions are possible in both Mongo and Raven, they are simply done in a different fashion. One of the inherent ways Raven does this is by using an ORM-like "unit of work" pattern with its RavenSession objects. Yes, your data validation needs to be done in code, but you already should be doing it there anyway. In my experience this is an over-hyped con.
How:
Install Raven or Mongo on a primary server, run it as a service.
Create or extend an existing application that uses the database you intend to port. This application needs all the model classes/libraries that your SQL database provides persistence for.
a. In your "data layer" you likely have a repository class somewhere. Extract an interface form this, and use it to build another repository class for your Raven/Mongo persistence. Both DB's have plenty good documentation for using their APIs to push/pull/update changes in the document graphs. It's pretty damn simple.
b. Load your SQL data into C# objects in memory. Pull back your top-level objects (just the entities) and load their inner collections and related data in memory. Your repository is probably already doing this (ex. when fetching an Order object, ensure not only its properties but associated collections like Items are loaded in memory.
c. Instantiate your Raven/Mongo repository and push the data to it. Primary entities become "top level documents" or "root aggregates" serialized in JSON, and their collections' data nested within. Save changes and close the repository. Note: You may break this step down into as many little pieces as your data deems necessary.
Once your data is migrated, play around with it and ensure you are satisfied. You may want to modify your application Models a little to adjust the way they are persisted to Raven/Mongo - for instance you may want to make both Orders and Items top-level documents and simply use reference values (much like relationships in RDBMS systems). Watch out here though, as doing so sort-of goes against the principal and performance behind NoSQL as now you have to tap the DB twice to get the Order and the Items.
If satisfied, shard/replicate your mongo/raven servers across your remaining available server boxes.
Obviously there are tons of little details I did not explain, but that is the general process, and much of it depends on the applications already consuming the database and may be tricky if more than one app/system talks to it.
Lastly, just to reiterate what Asaf said... learn as much as you can about NoSQL and its best use-cases. It is an amazing tool, but not golden solution for all data persistence. In your case try to really find the bottlenecks in your current solution and see if they are solvable. As one of my systems guys says, "technology for technology's sake is bullshit"

How to create a multi-tenant database with shared table structures?

Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.

Does an ORM integrate with existing applications or do I not understand?

Assume Hibernate for the ORM.
I'm not sure how to ask this. I want to build an application that can replace part of another. For example, say I have an application with various modules, called the "big" app. This application may handle HR, financial, purchases, skill sets, etc. But maybe, for whatever reason, I don't like the skill set module, but I like the rest of the application. I want to build an app that uses the same database that the rest of the "big" app uses but use my software as the front end for that piece.
I could build my app and have it hit the database directly with no ORM. My question is is there an advantage to using an ORM here. I'm thinking there is because if the "big" app goes away and another app is purchased, we could continue to use my version of skill set because I am using hibernate instead of hitting things directly. I'm still learning but I thought that my application used objects that I named and that in the case I just described I'd have to change my mapping files only or/and my code very little.
Here is another example. I have a legacy application and legacy database. It uses database X. I decide that I no longer like the old terminal emulator application that is used to get the data and that I want a graphical version. I can use hibernate with my application and when I finally decide to get rid of the legacy database and change to the latest Oracle or SQL Server, I can do so with minimal headache? Or is my database going to change so much that it wouldn't have matter anyway (I'm suggesting that upon changing to a new database more information will want to be captured)?
I was hoping for comments, if I am misunderstanding why hibernate/ORM might or might not be a benefit.
Thank you.
I do not think you will have a huge benefit frmo hibernate if the database schema changes to something completely different, you might have to change more than just your mapping - especially if more "structure" is added to the database (tables, column and such schema things). That said, if the database was structured mostly the same way, but lets say just the column names and tables names changes and a couple of tables are merged or something like that - you can get by with just changing your mapping.
But I would really recommend using hbernate for database agnosticity, that's is a pretty easy path.
AND then just because it doesn't exactly helps you if your entire database is changed, it such an incredible amount of other forces, that I would choose that over direct DB access most of the time.
Lastly you could think about using a service-layer such as the repository pattern that abstracs away the data access, so the business of your appilcation wouldn't need to change if the database changes.
Switching from one DBMS to another (ala Oracle to SQL Server) is one thing that using an ORM would certainly make much easier.
As for switching from one "big app" to another "big app", I doubt if using an ORM would help that much. It's likely that the database structure and business logic would be different enough that you would find yourself rewriting lots of code anyways.
You can generate domain objects with Hibernate Tools, if you do that than it will be painless and fast. however if you write all the objects by hand you will die. i think its good idea to rewrite part of the app and get to know hibernate better.
I think it's generally a bad idea to make any decision based on the
unknowns versus the knowns. Whether you're deciding on a data
access/persistence strategy, what car to buy, or what college to go
to, you should put the most weight on the things you know you want
today, rather than worrying about what may or may not happen tomorrow.
So when considering ORMs, I wouldn't worry too much about things such as apps
"going away" or DBMSs changing (unless that's either already been talked about, or
there's a history of this in your company). I'm not saying that these aren't things that will never happen, but rather that they should take a back seat to the generally much more important considerations of maintainability, performance, and developer productivity.
So in short, choose an ORM based on its ability to solve the problems and satisfy the requirements that you have today.