My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My question is, am I better off lumping all products into one database and using an ID to distinguish between the sites, or should I set up a table and /or DB per site?
Here are my thoughts
SEPARATE DATABASES
Easier to read from a backend
Categorised better
Makes backups more difficult
If I need to make a change to the schema, it will need to be pushed out to all databases
SAME DATABASES
All in one place
Could get unwieldy
One database will have a massive file size and lookups could suffer
Can someone please offer me some advice on which way is best and why?
You didn't give too many details (which makes it difficult to provide a good answer), though the words you chose to use in your question lead me to believe that this is a single application with different "skins".
My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My assumption is that you will have a single web store with several different store fronts: cool-widgets.com, awesome-sprockets.com, neato-things.com, etc. These will all be the same, save for maybe a CSS skin or something simple like that. The store admin stuff will all be done in some central system, and the domain name will simply act as a category name.
As such, splitting the same data into two different containers using an arbitrary criterion (category_ name=='cool-widges.com') is data partitioning, which is an anti-pattern. Just as you wouldn't have two different user tables based on the user name ([Users$A-to-M] and [Users$N-to-Z]), it makes little sense to have two different tables (or databases) for category names.
There is, and will be, lots of code common among the categories: user management, admin, order processing, data import, etc. It will be far more difficult to aggregate the multiple datastores in the common code than it will be to segregate the categories in the store display code. Not only that, the segregation bugs will be much more obvious: the price comparison page shows items from all three stores. The aggregation bugs will be much less: only three of the four stores were updated. This is why it's an anti-pattern.
Side note: yes, before you say that data portioning has its uses (which it does), those uses come in far after performance problems occur. Many serious database platforms allow behind-the-scenes partitioning as not to create a goofy data model.
If data needs to be shared among all the sites, then it will be recommended to share the same database since data transfer is eliminated. Also data is more centralized.
If data does not need to be shared among all sites, it'll be good to split up one database per site. Talking about difficulty to update table structures, you can just simply record down the database changes (saving the ALTER, UPDATE, DELETE queries in a SQL file) you make for one, and update the other databases with the same SQL file.
Storing in different databases might also help with security. You can set different user permissions for each of the site. and if one gets compromised, you protect the other sites.
Also, you are able to easily maintain and track database when the databases are clearly split up.
As you already say, both options have their pros and cons. Since you're talking about two stores, it probably doesn't matter much.
However, a few questions you might want to ask yourself:
Will it really be two stores, or possibly more? If more, one database might be smarter.
Are the products really the same? If you're gonna have to squeeze products in one general database, because they are of a different kind (eg. cars vs. food; the amount and nature of the details you want to store are completely different), then don't; use two databases / tables instead.
The central question is: what is most likely to become more elaborate in the future: the stores, or the products?
I think separate databases will be easier. You can have a quick-start template database from which you can build a new store database. You can even create a common database and contain common tables and list of stores and their databases. After all you can access to any database within the same server using qualified name, observe:
SELECT value FROM CommonDB.currencies WHERE type='euro';
SELECT price FROM OldTownDB.Products WHERE id=newtownprodid;
Related
I've research this topic and I'm relatively sure in most practices the answer is "No", but I would like some second opinions specific to my case.
We're currently working on a multi user web-app where each user will basically have there own copy "portal/app" within the web-app. It's not performance I'm worried about, but security.
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
Would you still recommend against this method ?
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
More manageable? That sounds like a joke. Your database will end up with a zillion different tables. Any operation that you want to do across all users will be a nightmare:
Declaring foreign key constraints.
Defining a new index on the tables.
Adding a new column.
Restructuring the tables.
And so on. And so on.
Your users may be limited to a single table. But the application developer and DBA need to deal with all of them. I cringe thinking about trying to figure out where performance bottlenecks are in such a system.
I should add that databases are optimized for big tables not lots of tables, so multiple tables are typically less efficient. And even less efficient when you think about all the half-filled pages in all those tables.
The same entities should not be spread among multiple tables, unless you have a really, really good reason. This is not a really good reason. One simple solution is to prevent users from having access to the base tables. Just give them access to views or user-defined table functions -- and have all of these filter on user ids.
There are some edge cases where you do want separate tables for each user. Typically, each user would have a very complex tables (think B2B application) and, in fact, they might have their own database. There may also be legal requirements to separate data. In these cases, though, the "separateness" would typically be at the database level, not the table level.
I am thinking and exploring options on designing database for my new application. In general, I will have registered users and info about them. They will be able to do some things in app and that data will be in the sam DB as users data (so I can have FK's shared and stuff)
But, then I plan to have second database that will be in logic totally independent of the first database except it will share userID as FK.
I don't know should I even put that second logic in an extra DB or should I have everything in the same database. I plan to have subdomain in my app for second logic (it is like app in app) but what if I discover they should share more data? Will that cross querying drop my peformances? And is that a way to go actually, is there a real reason to separate databases ?
As soon as you have two databases you have potential complexity. You have not given any particular reason why you need two databases. So keep it simple until you have a reason.
An example of what folks do: have a "current" database, small, holding just the data needed right now. That might be where orders are taken and fulfilled. Once the data is no longer current, say some days or weeks after the order is filled move the data to a "historic" database. There marketing and mangement folks can look at overall trends in the history without affecting performance of the "current" database, whose performance might be critical to keeping your customers happy.
As an example of complexity: any time you have two databases you need to consider consistency between them, this is much harder to ensure than it might appear. Databases do offer Two-Phase Transactional capabilities, or you can devise batch processes but there are always subtleties that are hard to catch.
I would just keep all in one database. Unless you have dozens of tables there should be no real performance problems, imho. It will however facilitate your life greatly, only having to work with one database connection & not having to worry about merging information from two queries,
Also agree that unless volume of your data is going to be huge (judging by the question, doesn't seem like that is the case here), you can use single database to store your data without performance issues.
For "visual" separation of data structure, you can always create tables in two schemas of single database.
I've worked in several SQL environments. In one environment, the different tables holding business data were split across several different SQL databases, all on the same server.
In another environment, almost all the tables are kept on one single SQL database.
I'm creating a new project that is closely related to another project, and I've been wondering if I should put the new tables in the same SQL database or a new SQL database.
This all runs on MS SQL Server.
What factors do I need to consider as I make this decision?
It's tough from your question to tell what your actual requirements are, or what data you would consider to store in different databases. But in addition to Gordon's points I can address a couple of additional reasons why you might want to use separate databases for data belonging to different customers / users (and this answer assumes that one possible separation of data, whether by database or schema, would be by customer):
As I mentioned in a comment, some customers will demand that their data be stored separately, and you may need to agree to that in writing before you see a penny or are able to secure their business. So you may as well be prepared for that inevitability.
Keeping each customer in their own database makes it very easy to move them if they outgrow your current server. At my previous job we designed the system in this way, and it saved our bacon later - we were able to move customers completely to a different server with what essentially amounted to a metadata operation. During a maintenance window, backed up their database, set the original to offline, restored the backup to a new server, and updated a config table that told all the apps where to find that database. This is much more flexible than trying to extract all of their data from a database shared by others...
Separate databases also allow you to handle maintenance differently. One customer needs point-in-time restore, and another doesn't? Perfect, you can just use a different recovery model on separate databases. Much easier than separating by filegroups and trying to implement some filegroup-level backup solution, and much more efficient than just treating one big database in full recovery.
This isn't free, of course, it's about trade-offs. Multiple databases scares some people away but having managed such a system for 13 years I can tell you that managing 100 or 500 databases that are largely identical is not that much more complicated than managing 500 schemas in one massive database (in fact I would say it is less so in a lot of respects).
A database is the unit of backup and recovery, so that should be the first consideration when designing database structures. If the data has different back up and recovery requirements, then they are very good candidates for separate databases.
That is only half the problem, though. In most environments, backup/recovery is pretty much the same for all databases. It becomes a question of application design. In other words, the situation becomes quite subjective.
In the environment that I'm working in right now, here are some criteria for splitting data into different databases:
(1) Publishing tables to a wide audience. We "publish" data in tables and put these into a database, separate from other tables used for building them or for special purposes. Admittedly, SQL Server claims that "schema" are the unit of security. However, databases seem to do a good job in the real world.
(2) Strict security requiremeents. Some data is so sensitive that lawyers have to approve who can see it. This goes into its own database, with its own access.
(3) Separation of data tables (which users can see) and tables that describe the production system.
(4) Separation of tables used for general querying by a skilled group of analysts (the published tables) versus tables used for specific reports/applications.
Finally, I would add this. If some of the data is being updated continuously throughout the day and other data is used for reporting, I would tend to put them in different databases. This helps separate them in the case of problems.
I am creating an app with a WPF frontend and a PostgreSQL database. The data includes patient addresses and supplier addresses. There is an average of about 3 contacts per mailing address listed. I'm estimating 10,000 - 15,000 contact records per database.
When designing the database structure, it occurred to me that rather than storing mailing addresses in a single "contacts" table, I could have one table storing names and other individual data, with a second table holding addresses. I could then create a relationship between the tables, to match addresses with contacts.
I have a pretty good idea how I can neatly organise situations such as changing the address of a single contact, where the other contacts are staying at the same address.
The question is: is it worth it? Can I expect to save much in the way of storage size? Will this impact the speed of queries adversley? How about if I was using something other than PostgreSQL?
I would strongly suggest normalizing this. You never know what kind of trouble you will run into. LedgerSMB has a relatively decent entity/user/contact/location schema that creates a very flexible environment. You can see it here (starts at line 363):
http://ledger-smb.svn.sourceforge.net/viewvc/ledger-smb/trunk/sql/Pg-database.sql?revision=3042&view=markup
Unless you think a large number of your users will be sharing addresses and they'll be often changing, I don't see the need to normalize out the address portion. In the various places I've worked and see users tables, sometimes it is, sometimes it isn't - never really seemed to make a terrible amount of trouble one way or another.
In terms of performance, with just 10-15k records and proper indexes, I can't imagine you'd notice too much difference one way or the other on modern hardware (although technically the separate table should be slower).
I agree with Joshua. Once it's set up properly (normalized) it's very easy to manage any changes in your app in the future.
Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.