Multi-tenancy with SQL/WCF/Silverlight - sql

We're building a Silverlight application which will be offered as SaaS. The end product is a Silverlight client that connects to a WCF service. As the number of clients is potentially large, updating needs to be easy, preferably so that all instances can be updated in one go.
Not having implemented multi tenancy before, I'm looking for opinions on how to achieve
Easy upgrades
Data security
Scalability
Three different models to consider are listed on msdn
Separate databases. This is not easy to maintain as all schema changes will have to be applied to each customer's database individually. Are there other drawbacks? A pro is data separation and security. This also allows for slight modifications per customer (which might be more hassle than it's worth!)
Shared Database, Separate Schemas. A TenantID column is added to each table. Ensuring that each customer gets the correct data is potentially dangerous. Easy to maintain and scales well (?).
Shared Database, Separate Schemas. Similar to the first model, but each customer has its own set of tables in the database. Hard to restore backups for a single customer. Maintainability otherwise similar to model 1 (?).
Any recommendations on articles on the subject? Has anybody explored something similar with a Silverlight SaaS app? What do I need to consider on the client side?

Depends on the type of application and scale of data. Each one has downfalls.
1a) Separate databases + single instance of WCF/client. Keeping everything in sync will be a challenge. How do you upgrade X number of DB servers at the same time, what if one fails and is now out of sync and not compatible with the client/WCF layer?
1b) "Silos", separate DB/WCF/Client for each customer. You don't have the sync issue but you do have the overhead of managing many different instances of each layer. Also you will have to look at SQL licensing, I can't remember if separate instances of SQL are licensed separately ($$$). Even if you can install as many instances as you want, the overhead of multiple instances will not be trivial after a certain point.
3) Basically same issues as 1a/b except for licensing.
2) Best upgrade/management scenario. You are right that maintaining data isolation is a huge concern (1a technically shares this issue at a higher level). The other issue is if your application is data intensive you have to worry about data scalability. For example if every customer is expected to have tens/hundreds millions rows of data. Then you will start to run into issues and query performance for individual customers due to total customer base volumes. Clients are more forgiving for slowdowns caused by their own data volume. Being told its slow because the other 99 clients data is large is generally a no-go.
Unless you know for a fact you will be dealing with huge data volumes from the start I would probably go with #2 for now, and begin looking at clustering or moving to 1a/b setup if needed in the future.

We also have a SaaS product and we use solution #2 (Shared DB/Shared Schema with TenandId). Some things to consider for Share DB / Same schema for all:
As mention above, high volume of data for one tenant may affect performance of the other tenants if you're not careful; for starters index your tables properly/carefully and never ever do queries that force a table scan. Monitor query performance and at least plan/design to be able to partition your DB later on based some criteria that makes sense for your domain.
Data separation is very very important, you don't want to end up showing a piece of data to some tenant that belongs to other tenant. every query must have a WHERE TenandId = ... in it and you should be able to verify/enforce this during dev.
Extensibility of the schema is something that solutions 1 and 3 may give you, but you can go around it by designing a way to extend the fields that are associated with the documents/tables in your domain that make sense (ie. Metadata for tables as the msdn article mentions)

What about solutions that provide an out of the box architecture like Apprenda's SaaSGrid? They let you make database decisions at deploy and maintenance time and not at design time. It seems they actively transform and manage the data layer, as well as provide an upgrade engine.

I've similar case, but my solution is take both advantage.
Where data and how data being placed is the question from tenant. Being a tenant of course I don't want my data to be shared, I want my data isolated, secure and I can get at anytime I want.
Certain data it possibly share eg: company list. So database should be global and tenant database, just make sure to locked in operation tenant database schema, and procedure to update all tenant database at once.
Anyway SaaS model everything delivered as server / web service, so no matter where the database should come to client as service, then only render by client GUI.
Thanks

Existing answers are good. You should look deeply into the issue of upgrading and managing multiple databases. Without knowing the specific app, it might turn out easier to have multiple databases and not have to pay the extra cost of tracking the TenantID. This might not end up being the right decision, but you should certainly be wary of the dev cost of data sharing.

Related

MariaDB data separation in public and private, database design

I am working at a company that merged with another company a while ago.
There we have several business units that are basically equivalent. One in Europe, one in China, each. We already had an in-house MariaDB database, which we want to start sharing.
The problem is that there are different GDPR regulations and contracts that prohibit sharing certain data across sites. So what I can't do, is replicate data across sites and then just hide in from the user in the frontend. The private data has to stay at the facility, it belongs to.
So my idea was to separate each table that we have now and where possibly sensitive information is contained into two tables each.
One say table_contracts_private and table_contracts_public.
This would still seem pretty doable with basic database replication and replicating the public tables across sites. But how would you go about publishing private data? Also how would I best combine private and public data? Just using a view
I just could not find any good mechanisms for this, especially because we would also like to avoid data duplication, so the private entries would need to be removed and replaced by the public ones, which would entail also changing all referencing IDs.
Is this a possible application of sharding?
I'd be really grateful, if someone could point me in the right direction, or if someone has a demo project with similar requirements that I could check out.
Cheers
Is this a possible application of sharding?
I wouldn't think so. Sharding is a performance optimization method. What you need is to support legal constraints. Those are two very different problems.
I think you are on the right track. I call this a "walled garden" approach. You create a database with all non-PII information, using ids only. Nothing that remotely directly identifies people, their addresses, phones, credit cards, and so on. This can be tricky. In some jurisdictions combinations of demographics can be PII.
Some of those ids then refer to another database where you store all the sensitive information; this is the "walled garden". I would recommend that this second database be on a separate server. It has a very restricted access list. And this is where you implement requirements for things like "forgetting" a customer.
In any case, the point is that sharding is not the right approach. You want an application redesign with privacy and security as the top priorities. Happily, this is not actually that hard to implement, although if the databases are changing, you may need period auditing. For instance, in one database I worked with, we discovered that "coupon codes" sometimes contained unencrypted email addresses. Arrgggh!

Handling relational model in Cassandra

Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?
If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me

Is it considered a poor practiced to use a single database for different uses across different applcations

What if you had one large database to server all your apps. So your website that needs to store customer orders can use the same database that your game uses to store registered users. Different applications could have tables only for them to use. Some may say that this could be a security issue, because if someone cracks your database, they could attack all your applications. But in a lot of databases you could use a line like the following to restrict access:
deny select on aTable to aUser;
I am wondering if this central database would be considered a poor practice, and if so why?
They way I look at it, a web application is nothing more than a collection of web pages. Because of this, it really doesn't matter if one page is about, say, cooking, while the other page is about computer programming.
If you also consider it, this is very similar to Openid, which I use to log into my SO account!
If you have your fundamental security implemented correctly, it doesn't matter how the user is interacting with your website. Where I would make this distinction is in two cases:
Don't mix http with https. On a shared host, this isn't going to be an issue anyway; if you buy the certificate for https, make everything that way (excluding the rare case where this might affect performance).
E-commerce or financial data should be handled fundamentally in a different way. If you look at your typical bank, they have multiple log-in protocols, picture verification and short log-in times. This builds confidence in user's securities. It would be a pain in the butt for a game site, or most other non-mission critical applications.
Regarding structure, if you do mix applications into one large database, you should consider the other maintenance issues, such as:
Keep tables separate; consider a prefix for every table unique to each application. Following my example above, you would then start the cooking DB table names with 'ck', and the computer programming DB table names with 'pg'. This would allow you to easily separate the applications if you need to in the future.
Use a matching table to identify which ID goes to which web application.
Consider what you would do and how to handle it if a user decided to register for both applications. Do you want to offer transparency that they can share the same username?
Keep an eye on both your data storage limit AND your bandwidth limit.
If you are counting on these applications to drive revenue, you are putting "all your eggs in one basket". Make sure if it goes down, you have options to restore or move to another host.
These are just a few of the things to consider. But fundamentally, outside of huge (big data) applications there is nothing wrong with sharing resources/databases/hardware between applications.
Conceptually, it could be done.
Implementation-wise, to make the various parts distinct from one another, you could use both naming conventions (as per #Sable Foste) and/or separate database schemas (table Finance.Users, GameApp.Users, etc.)
Management-wise, things could get tricky. Repeating some points, adding others:
One application could use a disproportionally large share of resources (disk space, I/O, CPU)
Tracking versions could be tricky (App is v4, finance is v7) -- depends on how many application instances you have to support.
Disaster recovery-wise, everything is lumped together. It all gets backed up as one set, it all gets restored as one set. Finance corrupt? Restore from backup... and lose your more recent game data.
Single point of failure. One database goes down, all your applications are down.
These (and other similar issues) are trade-offs you'll want to consider. Plan ahead, to lessen the chance that what's reasonable and economic today becomes a major headache tomorrow.

Should I create separate SQL Server database for each user?

I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.

Merge multiple databases into one

I have a desktop app that clients are using at the moment and each client has access to their own local network database.
My manager has decided that its best to merge these databases and only have one. All clients would then access that one database through a webservice that sits on the cloud. I would like to weight the pros and cons before we go ahead with this decision.
The one option we have is to have a ClientID in each of the tables which will result in each table having a composite key .
I have heard that another option would be to use schemas .Please advise how the schema way would work and is this the best way in comparison to having a composite key in each table.
Thank you.
This is a seriously difficult and time consuming task. You will need to have extensive regression tests already built because the risk of things breaking is huge.
Let me tell you a story of a client that had a separate database on a separate suerver that got merged with another database that contained many clients. It took several months to make all the changes to convert the data. Everything looked good and it was pushed to prod. Unfortunately the developer missed one place where client id needed to be referenced (It usually wasn't in the old code since they were the only client on the server). The first day in production a process that sent out emails, sent client proprietary data not only to the client sales reps but to the sales reps of many of their competitors. Of all the places that the change could have been missed, this was the worst possible one. It not only harmed our relationship with the first client but with all the clients that got some other client's info by mistake.
There is also the problem of migrating the data, the project for that alone (without the code changes the application will need) will take months and then you have consider that the clients will be adding data as you go and the final push may run into unexpected hiccups due to new data. You may also have to turn off the odl system for at least a weekend to do the production change.
Using schemas won't make it any easier as you will then have to adjust the code to hit the correct schema per client. And when you change somethign you wil have to change it for each individual schema, so it tends to make the database much more difficult to maintain.
While I am a great fan of having multiple clients in one database, when you didn't start out that way, it is extremely risky and expensive to change. I would not do it al all unless I had these things:
Code in source control
Extensive Unit and regression tests
Separate dev, QA and prod environments
A process for client UAT testing
Extensive knowledge of how cloud computing and webservices works (everyone I know who has moved stuff to the cloud has had some real gotchas)
A QA department
Six months to one year time frame for the project
At least one senior data analyst on the team.