When to use separate SQL databases on the same server? - sql

I've worked in several SQL environments. In one environment, the different tables holding business data were split across several different SQL databases, all on the same server.
In another environment, almost all the tables are kept on one single SQL database.
I'm creating a new project that is closely related to another project, and I've been wondering if I should put the new tables in the same SQL database or a new SQL database.
This all runs on MS SQL Server.
What factors do I need to consider as I make this decision?

It's tough from your question to tell what your actual requirements are, or what data you would consider to store in different databases. But in addition to Gordon's points I can address a couple of additional reasons why you might want to use separate databases for data belonging to different customers / users (and this answer assumes that one possible separation of data, whether by database or schema, would be by customer):
As I mentioned in a comment, some customers will demand that their data be stored separately, and you may need to agree to that in writing before you see a penny or are able to secure their business. So you may as well be prepared for that inevitability.
Keeping each customer in their own database makes it very easy to move them if they outgrow your current server. At my previous job we designed the system in this way, and it saved our bacon later - we were able to move customers completely to a different server with what essentially amounted to a metadata operation. During a maintenance window, backed up their database, set the original to offline, restored the backup to a new server, and updated a config table that told all the apps where to find that database. This is much more flexible than trying to extract all of their data from a database shared by others...
Separate databases also allow you to handle maintenance differently. One customer needs point-in-time restore, and another doesn't? Perfect, you can just use a different recovery model on separate databases. Much easier than separating by filegroups and trying to implement some filegroup-level backup solution, and much more efficient than just treating one big database in full recovery.
This isn't free, of course, it's about trade-offs. Multiple databases scares some people away but having managed such a system for 13 years I can tell you that managing 100 or 500 databases that are largely identical is not that much more complicated than managing 500 schemas in one massive database (in fact I would say it is less so in a lot of respects).

A database is the unit of backup and recovery, so that should be the first consideration when designing database structures. If the data has different back up and recovery requirements, then they are very good candidates for separate databases.
That is only half the problem, though. In most environments, backup/recovery is pretty much the same for all databases. It becomes a question of application design. In other words, the situation becomes quite subjective.
In the environment that I'm working in right now, here are some criteria for splitting data into different databases:
(1) Publishing tables to a wide audience. We "publish" data in tables and put these into a database, separate from other tables used for building them or for special purposes. Admittedly, SQL Server claims that "schema" are the unit of security. However, databases seem to do a good job in the real world.
(2) Strict security requiremeents. Some data is so sensitive that lawyers have to approve who can see it. This goes into its own database, with its own access.
(3) Separation of data tables (which users can see) and tables that describe the production system.
(4) Separation of tables used for general querying by a skilled group of analysts (the published tables) versus tables used for specific reports/applications.
Finally, I would add this. If some of the data is being updated continuously throughout the day and other data is used for reporting, I would tend to put them in different databases. This helps separate them in the case of problems.

Related

Database design - Sharing data between two databases?

I am thinking and exploring options on designing database for my new application. In general, I will have registered users and info about them. They will be able to do some things in app and that data will be in the sam DB as users data (so I can have FK's shared and stuff)
But, then I plan to have second database that will be in logic totally independent of the first database except it will share userID as FK.
I don't know should I even put that second logic in an extra DB or should I have everything in the same database. I plan to have subdomain in my app for second logic (it is like app in app) but what if I discover they should share more data? Will that cross querying drop my peformances? And is that a way to go actually, is there a real reason to separate databases ?
As soon as you have two databases you have potential complexity. You have not given any particular reason why you need two databases. So keep it simple until you have a reason.
An example of what folks do: have a "current" database, small, holding just the data needed right now. That might be where orders are taken and fulfilled. Once the data is no longer current, say some days or weeks after the order is filled move the data to a "historic" database. There marketing and mangement folks can look at overall trends in the history without affecting performance of the "current" database, whose performance might be critical to keeping your customers happy.
As an example of complexity: any time you have two databases you need to consider consistency between them, this is much harder to ensure than it might appear. Databases do offer Two-Phase Transactional capabilities, or you can devise batch processes but there are always subtleties that are hard to catch.
I would just keep all in one database. Unless you have dozens of tables there should be no real performance problems, imho. It will however facilitate your life greatly, only having to work with one database connection & not having to worry about merging information from two queries,
Also agree that unless volume of your data is going to be huge (judging by the question, doesn't seem like that is the case here), you can use single database to store your data without performance issues.
For "visual" separation of data structure, you can always create tables in two schemas of single database.

Is there a good reason to split related datasets into different SQL databases rather than just tables?

This question is motivated by a recent update to some business software a friend of mine is using. Their architecture was based on Access databases until now, which was awfully slow. They had split their datasets into multiple mdb files (sales.mdb, products.mdb, stock.mdb, ...). They are now moving on to SQL Server Express and keeping this structure. Instead of using a table for each of these datasets, they created different databases on the same instance of SQL Server 2008 Express.
From my (admittedly limited) understanding of SQL Server, this does not seem sensible, as it prevents JOINs between different tables and requires a program that needs sales and stock data to maintain two DB connections instead of just one.
One of the software vendor's consultants claimed that this would circumvent SQL Express' RAM limit of 1GB physical memory - he says that is per-database, but from what I gather from MSDN, it's actually per-instance, so they win nothing here.
Is there a good reason for splitting data in the same business domain into databases rather than tables?
(One that I can think of is that you can restrict access per database, but not per table - but this is irrelevant in this particular case, all modules of the program can access all databases.)
You can write joins across databases, so that's not a major issue. Generally, I would suggest keeping everything in one database unless there was a very good reason to split it, and those reasons might have something to do with compatibility, say where you had an application that had to run against a database in compatibility mode 80 or something - you might choose to separate some data into a separate database at that compatibility level. Or if you had a major chunk of functionality that you wanted to be able to easily move to another server - say data migration, or ETL.
It sounds like the limitation is in the application.
Depending on your version of SQL Server Express, splitting the databases up could allow for more data storage (2005 has a 4GB db limitation, 2008 is 10GB).
An addition to the 1GB RAM per instance limit, I believe SQL Server Express is also limited to 1 CPU per instance as well.
I would agree with HABO's comment. You cannot enforce referential integrity across databases, that will all have to be managed within the application.

How to create a multi-tenant database with shared table structures?

Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.

Help with setting up a Database

My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My question is, am I better off lumping all products into one database and using an ID to distinguish between the sites, or should I set up a table and /or DB per site?
Here are my thoughts
SEPARATE DATABASES
Easier to read from a backend
Categorised better
Makes backups more difficult
If I need to make a change to the schema, it will need to be pushed out to all databases
SAME DATABASES
All in one place
Could get unwieldy
One database will have a massive file size and lookups could suffer
Can someone please offer me some advice on which way is best and why?
You didn't give too many details (which makes it difficult to provide a good answer), though the words you chose to use in your question lead me to believe that this is a single application with different "skins".
My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My assumption is that you will have a single web store with several different store fronts: cool-widgets.com, awesome-sprockets.com, neato-things.com, etc. These will all be the same, save for maybe a CSS skin or something simple like that. The store admin stuff will all be done in some central system, and the domain name will simply act as a category name.
As such, splitting the same data into two different containers using an arbitrary criterion (category_ name=='cool-widges.com') is data partitioning, which is an anti-pattern. Just as you wouldn't have two different user tables based on the user name ([Users$A-to-M] and [Users$N-to-Z]), it makes little sense to have two different tables (or databases) for category names.
There is, and will be, lots of code common among the categories: user management, admin, order processing, data import, etc. It will be far more difficult to aggregate the multiple datastores in the common code than it will be to segregate the categories in the store display code. Not only that, the segregation bugs will be much more obvious: the price comparison page shows items from all three stores. The aggregation bugs will be much less: only three of the four stores were updated. This is why it's an anti-pattern.
Side note: yes, before you say that data portioning has its uses (which it does), those uses come in far after performance problems occur. Many serious database platforms allow behind-the-scenes partitioning as not to create a goofy data model.
If data needs to be shared among all the sites, then it will be recommended to share the same database since data transfer is eliminated. Also data is more centralized.
If data does not need to be shared among all sites, it'll be good to split up one database per site. Talking about difficulty to update table structures, you can just simply record down the database changes (saving the ALTER, UPDATE, DELETE queries in a SQL file) you make for one, and update the other databases with the same SQL file.
Storing in different databases might also help with security. You can set different user permissions for each of the site. and if one gets compromised, you protect the other sites.
Also, you are able to easily maintain and track database when the databases are clearly split up.
As you already say, both options have their pros and cons. Since you're talking about two stores, it probably doesn't matter much.
However, a few questions you might want to ask yourself:
Will it really be two stores, or possibly more? If more, one database might be smarter.
Are the products really the same? If you're gonna have to squeeze products in one general database, because they are of a different kind (eg. cars vs. food; the amount and nature of the details you want to store are completely different), then don't; use two databases / tables instead.
The central question is: what is most likely to become more elaborate in the future: the stores, or the products?
I think separate databases will be easier. You can have a quick-start template database from which you can build a new store database. You can even create a common database and contain common tables and list of stores and their databases. After all you can access to any database within the same server using qualified name, observe:
SELECT value FROM CommonDB.currencies WHERE type='euro';
SELECT price FROM OldTownDB.Products WHERE id=newtownprodid;

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....