Merge multiple databases into one - sql

I have a desktop app that clients are using at the moment and each client has access to their own local network database.
My manager has decided that its best to merge these databases and only have one. All clients would then access that one database through a webservice that sits on the cloud. I would like to weight the pros and cons before we go ahead with this decision.
The one option we have is to have a ClientID in each of the tables which will result in each table having a composite key .
I have heard that another option would be to use schemas .Please advise how the schema way would work and is this the best way in comparison to having a composite key in each table.
Thank you.

This is a seriously difficult and time consuming task. You will need to have extensive regression tests already built because the risk of things breaking is huge.
Let me tell you a story of a client that had a separate database on a separate suerver that got merged with another database that contained many clients. It took several months to make all the changes to convert the data. Everything looked good and it was pushed to prod. Unfortunately the developer missed one place where client id needed to be referenced (It usually wasn't in the old code since they were the only client on the server). The first day in production a process that sent out emails, sent client proprietary data not only to the client sales reps but to the sales reps of many of their competitors. Of all the places that the change could have been missed, this was the worst possible one. It not only harmed our relationship with the first client but with all the clients that got some other client's info by mistake.
There is also the problem of migrating the data, the project for that alone (without the code changes the application will need) will take months and then you have consider that the clients will be adding data as you go and the final push may run into unexpected hiccups due to new data. You may also have to turn off the odl system for at least a weekend to do the production change.
Using schemas won't make it any easier as you will then have to adjust the code to hit the correct schema per client. And when you change somethign you wil have to change it for each individual schema, so it tends to make the database much more difficult to maintain.
While I am a great fan of having multiple clients in one database, when you didn't start out that way, it is extremely risky and expensive to change. I would not do it al all unless I had these things:
Code in source control
Extensive Unit and regression tests
Separate dev, QA and prod environments
A process for client UAT testing
Extensive knowledge of how cloud computing and webservices works (everyone I know who has moved stuff to the cloud has had some real gotchas)
A QA department
Six months to one year time frame for the project
At least one senior data analyst on the team.

Related

Is it considered a poor practiced to use a single database for different uses across different applcations

What if you had one large database to server all your apps. So your website that needs to store customer orders can use the same database that your game uses to store registered users. Different applications could have tables only for them to use. Some may say that this could be a security issue, because if someone cracks your database, they could attack all your applications. But in a lot of databases you could use a line like the following to restrict access:
deny select on aTable to aUser;
I am wondering if this central database would be considered a poor practice, and if so why?
They way I look at it, a web application is nothing more than a collection of web pages. Because of this, it really doesn't matter if one page is about, say, cooking, while the other page is about computer programming.
If you also consider it, this is very similar to Openid, which I use to log into my SO account!
If you have your fundamental security implemented correctly, it doesn't matter how the user is interacting with your website. Where I would make this distinction is in two cases:
Don't mix http with https. On a shared host, this isn't going to be an issue anyway; if you buy the certificate for https, make everything that way (excluding the rare case where this might affect performance).
E-commerce or financial data should be handled fundamentally in a different way. If you look at your typical bank, they have multiple log-in protocols, picture verification and short log-in times. This builds confidence in user's securities. It would be a pain in the butt for a game site, or most other non-mission critical applications.
Regarding structure, if you do mix applications into one large database, you should consider the other maintenance issues, such as:
Keep tables separate; consider a prefix for every table unique to each application. Following my example above, you would then start the cooking DB table names with 'ck', and the computer programming DB table names with 'pg'. This would allow you to easily separate the applications if you need to in the future.
Use a matching table to identify which ID goes to which web application.
Consider what you would do and how to handle it if a user decided to register for both applications. Do you want to offer transparency that they can share the same username?
Keep an eye on both your data storage limit AND your bandwidth limit.
If you are counting on these applications to drive revenue, you are putting "all your eggs in one basket". Make sure if it goes down, you have options to restore or move to another host.
These are just a few of the things to consider. But fundamentally, outside of huge (big data) applications there is nothing wrong with sharing resources/databases/hardware between applications.
Conceptually, it could be done.
Implementation-wise, to make the various parts distinct from one another, you could use both naming conventions (as per #Sable Foste) and/or separate database schemas (table Finance.Users, GameApp.Users, etc.)
Management-wise, things could get tricky. Repeating some points, adding others:
One application could use a disproportionally large share of resources (disk space, I/O, CPU)
Tracking versions could be tricky (App is v4, finance is v7) -- depends on how many application instances you have to support.
Disaster recovery-wise, everything is lumped together. It all gets backed up as one set, it all gets restored as one set. Finance corrupt? Restore from backup... and lose your more recent game data.
Single point of failure. One database goes down, all your applications are down.
These (and other similar issues) are trade-offs you'll want to consider. Plan ahead, to lessen the chance that what's reasonable and economic today becomes a major headache tomorrow.

Syncing Postgres Database Instances

I have a queer situation. I am managing an e-commerce site built on Django with Postgresql. It has two versions - English and Japanese. Because of a release that has brought a huge number of users, the site (specifically Postgres) is overloaded and crashing. The only safe solution which I can think of is to put these two separately on two separate servers so that En and Jp traffic gets their own dedicated server. Now, the new server is ready but during the time of domain propagation, and during half-propagated stages (new one being seen from some countries and old one from some) there will be transactions on both. Users are buying digital stuff in hundreds of numbers every minute. So, there is no way to turn the server off for a turnover.
Is there a way to sync the two databases at a later stage (because if both share a database, the new server will be pointless). The bottleneck is Postgres, and has already been tuned for maximum possible connections on this server, and kernel.shmmax is at its limit. DB pooling also will need time to setup and some downtime as well, which am not permitted to do at the moment. What I mean by sync is that once full propagation occurs, I wish to unify the DB dump files from both and make one which has all records of both synced in time. The structure is rather complex so many tables will need sync. Is this do-able ..?
Thanks in advance !

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

Running the same web app on 2 or more physically separate servers?

I am not sure if I should be posting this question here or over at ServerFault so apologies if it is in the wrong place.
I have a small web app that is starting to get some more business.
Currently I have a single dedicated LAMP server for this, and this has worked well - the single server is able to handle all of our traffic.
However... Recently I have been approached by some potential customers who are interested in using the app, but only if their data can be stored on a server in the same province as they are (legal reasons).
I could migrate the server, but I am reluctant to do this. I like where it is now.
So, I am wondering what is involved in having multiple servers in physically separate datacentres far apart, running the same web app? Data between the servers would not need to stay synced, necessarily.
I have never done anything like this before, and am not sure how complicated a job it is. Any suggestions on how and where to start looking into this would be much appreciated.
Thanks (in advance) for your advice.
As long as each customer has their own set of data you can just install another copy of the application in the other datacenter. It will require you to get some structure to your source control and deployment process, but it works. This option will give you two separate databases.
If you have to have one common database for all the customers (e.g. some kind of booking/reservation system of common resources) then you're up to a completely other level of complexity with replicating databases etc. It's doable, but it's hard.

Multi-tenancy with SQL/WCF/Silverlight

We're building a Silverlight application which will be offered as SaaS. The end product is a Silverlight client that connects to a WCF service. As the number of clients is potentially large, updating needs to be easy, preferably so that all instances can be updated in one go.
Not having implemented multi tenancy before, I'm looking for opinions on how to achieve
Easy upgrades
Data security
Scalability
Three different models to consider are listed on msdn
Separate databases. This is not easy to maintain as all schema changes will have to be applied to each customer's database individually. Are there other drawbacks? A pro is data separation and security. This also allows for slight modifications per customer (which might be more hassle than it's worth!)
Shared Database, Separate Schemas. A TenantID column is added to each table. Ensuring that each customer gets the correct data is potentially dangerous. Easy to maintain and scales well (?).
Shared Database, Separate Schemas. Similar to the first model, but each customer has its own set of tables in the database. Hard to restore backups for a single customer. Maintainability otherwise similar to model 1 (?).
Any recommendations on articles on the subject? Has anybody explored something similar with a Silverlight SaaS app? What do I need to consider on the client side?
Depends on the type of application and scale of data. Each one has downfalls.
1a) Separate databases + single instance of WCF/client. Keeping everything in sync will be a challenge. How do you upgrade X number of DB servers at the same time, what if one fails and is now out of sync and not compatible with the client/WCF layer?
1b) "Silos", separate DB/WCF/Client for each customer. You don't have the sync issue but you do have the overhead of managing many different instances of each layer. Also you will have to look at SQL licensing, I can't remember if separate instances of SQL are licensed separately ($$$). Even if you can install as many instances as you want, the overhead of multiple instances will not be trivial after a certain point.
3) Basically same issues as 1a/b except for licensing.
2) Best upgrade/management scenario. You are right that maintaining data isolation is a huge concern (1a technically shares this issue at a higher level). The other issue is if your application is data intensive you have to worry about data scalability. For example if every customer is expected to have tens/hundreds millions rows of data. Then you will start to run into issues and query performance for individual customers due to total customer base volumes. Clients are more forgiving for slowdowns caused by their own data volume. Being told its slow because the other 99 clients data is large is generally a no-go.
Unless you know for a fact you will be dealing with huge data volumes from the start I would probably go with #2 for now, and begin looking at clustering or moving to 1a/b setup if needed in the future.
We also have a SaaS product and we use solution #2 (Shared DB/Shared Schema with TenandId). Some things to consider for Share DB / Same schema for all:
As mention above, high volume of data for one tenant may affect performance of the other tenants if you're not careful; for starters index your tables properly/carefully and never ever do queries that force a table scan. Monitor query performance and at least plan/design to be able to partition your DB later on based some criteria that makes sense for your domain.
Data separation is very very important, you don't want to end up showing a piece of data to some tenant that belongs to other tenant. every query must have a WHERE TenandId = ... in it and you should be able to verify/enforce this during dev.
Extensibility of the schema is something that solutions 1 and 3 may give you, but you can go around it by designing a way to extend the fields that are associated with the documents/tables in your domain that make sense (ie. Metadata for tables as the msdn article mentions)
What about solutions that provide an out of the box architecture like Apprenda's SaaSGrid? They let you make database decisions at deploy and maintenance time and not at design time. It seems they actively transform and manage the data layer, as well as provide an upgrade engine.
I've similar case, but my solution is take both advantage.
Where data and how data being placed is the question from tenant. Being a tenant of course I don't want my data to be shared, I want my data isolated, secure and I can get at anytime I want.
Certain data it possibly share eg: company list. So database should be global and tenant database, just make sure to locked in operation tenant database schema, and procedure to update all tenant database at once.
Anyway SaaS model everything delivered as server / web service, so no matter where the database should come to client as service, then only render by client GUI.
Thanks
Existing answers are good. You should look deeply into the issue of upgrading and managing multiple databases. Without knowing the specific app, it might turn out easier to have multiple databases and not have to pay the extra cost of tracking the TenantID. This might not end up being the right decision, but you should certainly be wary of the dev cost of data sharing.