Virtual instance of a database schema on DB2 - sql

My question is pretty straight forward:
I'm developing a system, which will be installed in several companies. I don't want to use a field in every table to separate data for one company to another, I want to use a more scalable solution.
My idea is to model a single database schema and have an "instance" of it for every company I instance in my system, thus backing-up a single database or base schema and also propagating changes for every schema just by modifying my base schema. In short, there would be only one base or physical schema and many data instances of it, referenced by a unique schema name.
Is this possible? What is this technique called? (I would be using this concept for the first time) How do I implement it on IBM DB2?

If I understand you correctly, I don't think it's possible. I'm pretty sure it's not practical.
Imagine adding a column and an index on that column to your base schema. In one company, where they don't have a lot of data, that change is fairly trivial. But a company that has many millions of rows might find itself suddenly out of disk space.
A DBA might manage that problem by moving an index or table to a different disk or a different filesystem, but that might not be possible if the database were "instanced" off your base schema. (Depends on what you mean by instanced, which has a DB2-specific meaning as well as a more general OOP meaning.)
The term you're looking for, I think, is "multi-tenant". (Tenant and customer mean just about the same thing here.) SO has a "multi-tenant" tag; I'll add it for you, and you can delete it if it's not appropriate. IBM's developerWorks library has an introductory article on the various multi-tenant architectures.
Multi-tenant architecture ranges from "shared nothing" (each tenant gets their own database) to "shared everything" (all tenants share the tables in a single database, each row has a column that identifies the tenant who "owns" that row).
Between "shared nothing" and "shared everything" is "shared schema" (tenants share a database; each tenant has a private schema).
The most obvious differences are in customization, disaster recovery, and data isolation. These differences also drive the choice of architecture.
"Shared nothing" makes customization easy; changing the database can't affect any other tenants. Disaster recovery is simple, too; you just restore a whole database from backup. Permissions can be applied at the database level for near-perfect isolation.
"Shared everything" makes customization hard; every change affects every tenant to some degree. Disaster recovery is also hard; for a single-company disaster, you have to restore some rows--just the rows that single company "owns"--to every table. Data isolation is harder, too, because every view and query you deploy has to correctly account for the tenant identifier. Forget that just once, and you might be out of business.

Related

Creating "Archive" database to unload application "Main" database

I want to create a web application that is supposed to contain a lot of data. I want to ask if anyone of you have ever met a system that contained two databases - main and archive. I want to create a mechanism that will move old data from main database to archive database in order to unload it. For instance, when I have a table of user accounts, I want to move the ones that weren't used for, say, more than three months to an archive database. Having this done, main database may be significantly unloaded so I expect it to work faster. However such mechanism has to work in two directions - not only migrating from main to archive but also from archive to main db in order to allow user's to "refresh" their accounts. Of course in such scenario I will use GUID's instead of BIGINT's as PRIMARY KEY. What do you think about it? Is such concept right or I shouldn't bother about it and assume that there should be only one database? Thanks in advance.
Having archive database never hurts, but usually it's used for restoring or reporting. I think in most cases partitioning will serve your purpose better. Also, many RDMS systems propose different solutions out of the box, like database clustering, mirroring, etc.

Database Client Specific Tables v/s Relational Tables

I have a scenario, my application is a SAAS based app catering to multiple clients. Data Integrity to clients is very essential.
Is it better to keep my Tables
Client specific
OR
Relational Tables
For Ex: I have a mapping table with fields MapField1,MapField2. I need this kind of data for each client.
Should I have tables like MappingData_
or a Single Table with mapping to the ClientId
MappingData with Fields MapField1,MapField2,ClientId
I would have a separate database for each customer. (Multiple databases in a single SQL Server instance.)
This would allow you to design it once, with a single schema.
No dynamically named tables compromising test & development
Upgrades and maintenance can be designed and tested in one DB, then rolled out to all
A single customer's data can be backed-up, restored or dropped exceedingly simply
Bugs discovered/exploited in one DB won't comprise the integrity of other DBs
Data access (read and write) can be managed using SQL Logins (No re-inventing the wheel)
If there is a need for globally shared data, that would go in another database, with it's own set of permissions for the different SQL Logins.
The use of a single database, with all users in it is my next best choice. You still have a single schema. But you don't get to partition the customers' data, you need to manage access rights and permissions yourself, and a whole host of other additional design and testing work.
I would never go near dynamically creating new tables for additional customers. A new table name means all your queries need to be updated with the new table name, and a whole host of other maintenance head-aches.
I'm pretty much of the opinion that if you want to create tables dynamically during the Business As Usual use of an application/service, you've designed it badly.
SO has a tag for the thing you're describing: "multi-tenant".
Visualize the architecture for supporting a multi-tenant database application as a spectrum. At one extreme of the spectrum is "shared nothing", which means each tenant has its own database. At the other extreme of the spectrum is "shared everything", which means tenants share tables, and each row in each table belongs to one tenant. (Each row contains a tenant identifier.)
Terminology seems to overlap, so read carefully. What one writer means by shared schema might be identical to what another writer means by shared everything.
This SO answer, also written by me, describes the differences and the tradeoffs in terms of cost, data isolation and protection, maintenance, and disaster recovery. It also links to a fairly good introductory article.

Single or multiple databases

SQL Server 2008 database design problem.
I'm defining the architecture for a service where site users would manage a large volume of data on multiple websites that they own (100MB average, 1GB maximum per site). I am considering whether to split the databases up such that the core site management tables (users, payments, contact details, login details, products etc) are held in one database, and the database relating to the customer's own websites is held in a separate database.
I am seeing a possible gain in that I can distribute the hardware architecture to provide more meat to the heavy lifting done in the websites database leaving the site management database in a more appropriate area. But I'm also conscious of losing the ability to directly relate the sites to the customers through a Foreign key (as far as I know this can't be done cross database?).
So, the question is two fold - in general terms should data in this sort of scenario be split out into multiple databases, or should it all be held in a single database?
If it is split into multiple, is there a recommended way to protect the integrity and security of the system at the database layer to ensure that there is a strong relationship between the two?
Thanks for your help.
This question and thus my answer may be close to the gray line of subjective, but at the least I think it would be common practice to separate out the 'admin' tables into their own db for what it sounds like you're doing. If you can tie a client to a specific server and db instance then by having separate db instances, it opens up some easy paths for adding servers to add clients. A single db would require you to monkey with various clustering approaches if you got too big.
[edit]Building in the idea early that each client gets it's own DB also just sets the tone for how you develop when it is easy to make structural and organizational changes. Discovering 2 yrs from now you need to do it will become a lot more painful. I've worked with split dbs plenty of times in the past and it really isn't hard to deal with as long as you can establish some idea of what the context is. Here it sounds like you already have the idea that the client is the context.
Just my two cents, like I said, you could be close to subjective on this one.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support per customer customization. I don't see the value if otherwise.
You can use link to connect the databases.
Your architecture is smart.
If you can't use a link, you can always replicate critical data to the website database from the users database in a read only mode.
concerning security - The best way is to have a service layer between ASP (or other web lang) and the database - so your databases will be pretty much isolated.
If you expect to have to split the databases across different hardware in the future because of heavy load, I'd say split it now. You can use replication to push copies of some of the tables from the main database to the site management databases. For now, you can run both databases on the same instance of SQL Server and later on, when you need to, you can move some of the databases to a separate machine as your volume grows.
Imagine we have infinitely fast computers, would you split your databases? Of course not. The only reason why we split them is to make it easy for us to scale out at some point. You don't really have any choice here, 100MB-1000MB per client is huge.

What good are SQL Server schemas?

I'm no beginner to using SQL databases, and in particular SQL Server. However, I've been primarily a SQL 2000 guy and I've always been confused by schemas in 2005+. Yes, I know the basic definition of a schema, but what are they really used for in a typical SQL Server deployment?
I've always just used the default schema. Why would I want to create specialized schemas? Why would I assign any of the built-in schemas?
EDIT: To clarify, I guess I'm looking for the benefits of schemas. If you're only going to use it as a security scheme, it seems like database roles already filled that.. er.. um.. role. And using it a as a namespace specifier seems to have been something you could have done with ownership (dbo versus user, etc..).
I guess what I'm getting at is, what do Schemas do that you couldn't do with owners and roles? What are their specifc benefits?
Schemas logically group tables, procedures, views together. All employee-related objects in the employee schema, etc.
You can also give permissions to just one schema, so that users can only see the schema they have access to and nothing else.
Just like Namespace of C# codes.
They can also provide a kind of naming collision protection for plugin data. For example, the new Change Data Capture feature in SQL Server 2008 puts the tables it uses in a separate cdc schema. This way, they don't have to worry about a naming conflict between a CDC table and a real table used in the database, and for that matter can deliberately shadow the names of the real tables.
I know it's an old thread, but I just looked into schemas myself and think the following could be another good candidate for schema usage:
In a Datawarehouse, with data coming from different sources, you can use a different schema for each source, and then e.g. control access based on the schemas. Also avoids the possible naming collisions between the various source, as another poster replied above.
If you keep your schema discrete then you can scale an application by deploying a given schema to a new DB server. (This assumes you have an application or system which is big enough to have distinct functionality).
An example, consider a system that performs logging. All logging tables and SPs are in the [logging] schema. Logging is a good example because it is rare (if ever) that other functionality in the system would overlap (that is join to) objects in the logging schema.
A hint for using this technique -- have a different connection string for each schema in your application / system. Then you deploy the schema elements to a new server and change your connection string when you need to scale.
At an ORACLE shop I worked at for many years, schemas were used to encapsulate procedures (and packages) that applied to different front-end applications. A different 'API' schema for each application often made sense as the use cases, users, and system requirements were quite different. For example, one 'API' schema was for a development/configuration application only to be used by developers. Another 'API' schema was for accessing the client data via views and procedures (searches). Another 'API' schema encapsulated code that was used for synchronizing development/configuration and client data with an application that had it's own database. Some of these 'API' schemas, under the covers, would still share common procedures and functions with eachother (via other 'COMMON' schemas) where it made sense.
I will say that not having a schema is probably not the end of the world, though it can be very helpful. Really, it is the lack of packages in SQL Server that really creates problems in my mind... but that is a different topic.
I tend to agree with Brent on this one... see this discussion here. http://www.brentozar.com/archive/2010/05/why-use-schemas/
In short... schemas aren't terribly useful except for very specific use cases. Makes things messy. Do not use them if you can help it. And try to obey the K(eep) I(t) S(imple) S(tupid) rule.
I don't see the benefit in aliasing out users tied to Schemas. Here is why....
Most people connect their user accounts to databases via roles initially, As soon as you assign a user to either the sysadmin, or the database role db_owner, in any form, that account is either aliased to the "dbo" user account, or has full permissions on a database. Once that occurs, no matter how you assign yourself to a scheme beyond your default schema (which has the same name as your user account), those dbo rights are assigned to those object you create under your user and schema. Its kinda pointless.....and just a namespace and confuses true ownership on those objects. Its poor design if you ask me....whomever designed it.
What they should have done is created "Groups", and thrown out schemas and role and just allow you to tier groups of groups in any combination you like, then at each tier tell the system if permissions are inherited, denied, or overwritten with custom ones. This would have been so much more intuitive and allowed DBA's to better control who the real owners are on those objects. Right now its implied in most cases the dbo default SQL Server user has those rights....not the user.
I think schemas are like a lot of new features (whether to SQL Server or any other software tool). You need to carefully evaluate whether the benefit of adding it to your development kit offsets the loss of simplicity in design and implementation.
It looks to me like schemas are roughly equivalent to optional namespaces. If you're in a situation where object names are colliding and the granularity of permissions is not fine enough, here's a tool. (I'd be inclined to say there might be design issues that should be dealt with at a more fundamental level first.)
The problem can be that, if it's there, some developers will start casually using it for short-term benefit; and once it's in there it can become kudzu.
In SQL Server 2000, objects created were linked to that particular user, like if a user, say
Sam creates an object, say, Employees, that table would appear like: Sam.Employees. What
about if Sam is leaving the compnay or moves to so other business area. As soon you delete
the user Sam, what would happen to Sam.Employees table? Probably, you would have to change
the ownership first from Sam.Employees to dbo.Employess. Schema provides a solution to
overcome this problem. Sam can create all his object within a schemam such as Emp_Schema.
Now, if he creates an object Employees within Emp_Schema then the object would be
referred to as Emp_Schema.Employees. Even if the user account Sam needs to be deleted, the
schema would not be affected.
development - each of our devs get their own schema as a sandbox to play in.
Here a good implementation example of using schemas with SQL Server. We had several ms-access applications. We wanted to convert those to a ASP.NET App portal. Every ms-access application is written as an App for that portal. Every ms-access application has its own database tables. Some of those are related, we put those in the common dbo schema of SQL Server. The rest gets its own schemas. That way if we want to know what tables belong to an App on the ASP.NET app portal that can easily be navigated, visualised and maintained.

Ideas for Combining Thousand Databases into One Database

We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?
Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.
Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.
The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.
Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.
what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.
If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.
Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.
To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.