Database design - Sharing data between two databases? - sql

I am thinking and exploring options on designing database for my new application. In general, I will have registered users and info about them. They will be able to do some things in app and that data will be in the sam DB as users data (so I can have FK's shared and stuff)
But, then I plan to have second database that will be in logic totally independent of the first database except it will share userID as FK.
I don't know should I even put that second logic in an extra DB or should I have everything in the same database. I plan to have subdomain in my app for second logic (it is like app in app) but what if I discover they should share more data? Will that cross querying drop my peformances? And is that a way to go actually, is there a real reason to separate databases ?

As soon as you have two databases you have potential complexity. You have not given any particular reason why you need two databases. So keep it simple until you have a reason.
An example of what folks do: have a "current" database, small, holding just the data needed right now. That might be where orders are taken and fulfilled. Once the data is no longer current, say some days or weeks after the order is filled move the data to a "historic" database. There marketing and mangement folks can look at overall trends in the history without affecting performance of the "current" database, whose performance might be critical to keeping your customers happy.
As an example of complexity: any time you have two databases you need to consider consistency between them, this is much harder to ensure than it might appear. Databases do offer Two-Phase Transactional capabilities, or you can devise batch processes but there are always subtleties that are hard to catch.

I would just keep all in one database. Unless you have dozens of tables there should be no real performance problems, imho. It will however facilitate your life greatly, only having to work with one database connection & not having to worry about merging information from two queries,

Also agree that unless volume of your data is going to be huge (judging by the question, doesn't seem like that is the case here), you can use single database to store your data without performance issues.
For "visual" separation of data structure, you can always create tables in two schemas of single database.

Related

Create SQL tables for each user as security measure

I've research this topic and I'm relatively sure in most practices the answer is "No", but I would like some second opinions specific to my case.
We're currently working on a multi user web-app where each user will basically have there own copy "portal/app" within the web-app. It's not performance I'm worried about, but security.
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
Would you still recommend against this method ?
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
More manageable? That sounds like a joke. Your database will end up with a zillion different tables. Any operation that you want to do across all users will be a nightmare:
Declaring foreign key constraints.
Defining a new index on the tables.
Adding a new column.
Restructuring the tables.
And so on. And so on.
Your users may be limited to a single table. But the application developer and DBA need to deal with all of them. I cringe thinking about trying to figure out where performance bottlenecks are in such a system.
I should add that databases are optimized for big tables not lots of tables, so multiple tables are typically less efficient. And even less efficient when you think about all the half-filled pages in all those tables.
The same entities should not be spread among multiple tables, unless you have a really, really good reason. This is not a really good reason. One simple solution is to prevent users from having access to the base tables. Just give them access to views or user-defined table functions -- and have all of these filter on user ids.
There are some edge cases where you do want separate tables for each user. Typically, each user would have a very complex tables (think B2B application) and, in fact, they might have their own database. There may also be legal requirements to separate data. In these cases, though, the "separateness" would typically be at the database level, not the table level.

When to use separate SQL databases on the same server?

I've worked in several SQL environments. In one environment, the different tables holding business data were split across several different SQL databases, all on the same server.
In another environment, almost all the tables are kept on one single SQL database.
I'm creating a new project that is closely related to another project, and I've been wondering if I should put the new tables in the same SQL database or a new SQL database.
This all runs on MS SQL Server.
What factors do I need to consider as I make this decision?
It's tough from your question to tell what your actual requirements are, or what data you would consider to store in different databases. But in addition to Gordon's points I can address a couple of additional reasons why you might want to use separate databases for data belonging to different customers / users (and this answer assumes that one possible separation of data, whether by database or schema, would be by customer):
As I mentioned in a comment, some customers will demand that their data be stored separately, and you may need to agree to that in writing before you see a penny or are able to secure their business. So you may as well be prepared for that inevitability.
Keeping each customer in their own database makes it very easy to move them if they outgrow your current server. At my previous job we designed the system in this way, and it saved our bacon later - we were able to move customers completely to a different server with what essentially amounted to a metadata operation. During a maintenance window, backed up their database, set the original to offline, restored the backup to a new server, and updated a config table that told all the apps where to find that database. This is much more flexible than trying to extract all of their data from a database shared by others...
Separate databases also allow you to handle maintenance differently. One customer needs point-in-time restore, and another doesn't? Perfect, you can just use a different recovery model on separate databases. Much easier than separating by filegroups and trying to implement some filegroup-level backup solution, and much more efficient than just treating one big database in full recovery.
This isn't free, of course, it's about trade-offs. Multiple databases scares some people away but having managed such a system for 13 years I can tell you that managing 100 or 500 databases that are largely identical is not that much more complicated than managing 500 schemas in one massive database (in fact I would say it is less so in a lot of respects).
A database is the unit of backup and recovery, so that should be the first consideration when designing database structures. If the data has different back up and recovery requirements, then they are very good candidates for separate databases.
That is only half the problem, though. In most environments, backup/recovery is pretty much the same for all databases. It becomes a question of application design. In other words, the situation becomes quite subjective.
In the environment that I'm working in right now, here are some criteria for splitting data into different databases:
(1) Publishing tables to a wide audience. We "publish" data in tables and put these into a database, separate from other tables used for building them or for special purposes. Admittedly, SQL Server claims that "schema" are the unit of security. However, databases seem to do a good job in the real world.
(2) Strict security requiremeents. Some data is so sensitive that lawyers have to approve who can see it. This goes into its own database, with its own access.
(3) Separation of data tables (which users can see) and tables that describe the production system.
(4) Separation of tables used for general querying by a skilled group of analysts (the published tables) versus tables used for specific reports/applications.
Finally, I would add this. If some of the data is being updated continuously throughout the day and other data is used for reporting, I would tend to put them in different databases. This helps separate them in the case of problems.

Help with setting up a Database

My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My question is, am I better off lumping all products into one database and using an ID to distinguish between the sites, or should I set up a table and /or DB per site?
Here are my thoughts
SEPARATE DATABASES
Easier to read from a backend
Categorised better
Makes backups more difficult
If I need to make a change to the schema, it will need to be pushed out to all databases
SAME DATABASES
All in one place
Could get unwieldy
One database will have a massive file size and lookups could suffer
Can someone please offer me some advice on which way is best and why?
You didn't give too many details (which makes it difficult to provide a good answer), though the words you chose to use in your question lead me to believe that this is a single application with different "skins".
My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My assumption is that you will have a single web store with several different store fronts: cool-widgets.com, awesome-sprockets.com, neato-things.com, etc. These will all be the same, save for maybe a CSS skin or something simple like that. The store admin stuff will all be done in some central system, and the domain name will simply act as a category name.
As such, splitting the same data into two different containers using an arbitrary criterion (category_ name=='cool-widges.com') is data partitioning, which is an anti-pattern. Just as you wouldn't have two different user tables based on the user name ([Users$A-to-M] and [Users$N-to-Z]), it makes little sense to have two different tables (or databases) for category names.
There is, and will be, lots of code common among the categories: user management, admin, order processing, data import, etc. It will be far more difficult to aggregate the multiple datastores in the common code than it will be to segregate the categories in the store display code. Not only that, the segregation bugs will be much more obvious: the price comparison page shows items from all three stores. The aggregation bugs will be much less: only three of the four stores were updated. This is why it's an anti-pattern.
Side note: yes, before you say that data portioning has its uses (which it does), those uses come in far after performance problems occur. Many serious database platforms allow behind-the-scenes partitioning as not to create a goofy data model.
If data needs to be shared among all the sites, then it will be recommended to share the same database since data transfer is eliminated. Also data is more centralized.
If data does not need to be shared among all sites, it'll be good to split up one database per site. Talking about difficulty to update table structures, you can just simply record down the database changes (saving the ALTER, UPDATE, DELETE queries in a SQL file) you make for one, and update the other databases with the same SQL file.
Storing in different databases might also help with security. You can set different user permissions for each of the site. and if one gets compromised, you protect the other sites.
Also, you are able to easily maintain and track database when the databases are clearly split up.
As you already say, both options have their pros and cons. Since you're talking about two stores, it probably doesn't matter much.
However, a few questions you might want to ask yourself:
Will it really be two stores, or possibly more? If more, one database might be smarter.
Are the products really the same? If you're gonna have to squeeze products in one general database, because they are of a different kind (eg. cars vs. food; the amount and nature of the details you want to store are completely different), then don't; use two databases / tables instead.
The central question is: what is most likely to become more elaborate in the future: the stores, or the products?
I think separate databases will be easier. You can have a quick-start template database from which you can build a new store database. You can even create a common database and contain common tables and list of stores and their databases. After all you can access to any database within the same server using qualified name, observe:
SELECT value FROM CommonDB.currencies WHERE type='euro';
SELECT price FROM OldTownDB.Products WHERE id=newtownprodid;

SQL and Flat Files... In harmony?

I was just thinking, how quick it would be to store the actual data of an application in a flat file.
Now, you can't just go storing everything in a flat file... sometimes sorts and searches are required, and to go through directories and files recursively could be a pain.
Now, imagine, you stored all your search-able data in a database, and had a pointer field, that pointed to a data file?
This would be very specific per app, however- so long as all my search-able data is stored in the database, why should I store the actual data in a database?
(Locking, Data integrity aside) it would be faster, I am sure... but how much, and is it worth doing it?
Well you often want to do things in queries beyond search on the data. For instance you might might not search on a field called cost_center, but you might have a case statment that processes things differently depending on the information in the field. Or you might need to concatenate information together. You might update one field based onthe information in another field. You might not search on a field today and need to search on it tomorrow.
A properly designed relational database can easily perform well with terrabytes of data.
And frankly you should never even consider "data integrity aside". If you don't have data integrity you don't have data.
As to whether what you want is a good idea, it depends on the type of data you are storing and the types of things you intend to do with it. There isn't enough information to say for sure.
Well "Locking, Data integrity aside" should mean a faster system. If you drop constraints you should improve performance.
But in practical terms, I don't think it's going to be faster. There's lot of development time behind RDBMSs and that's why they are quick. Sure, non-relational databases are performing better than them in highly parallel situations and scenarios which take advantage of their qualities, for instance. However, your idea does not offer an improvement such as exploiting parallelism... any performance advantage would come from dropping the qualities of RDBMSs...
As well as other answers...
Sharing of data: how are multiple clients going to access data on a share?
Backup/Restore: synching of text and "searchable"
Security/permissions on text data
Change anomalies
There is no need to implement a SQL database just to perform searches. Lots of applications store their data in XML, and you can search in many ways, e.g., using Lucene. How fast it is entirely depends on the quantity of data and how you structure it - just like a database.
It can perform very fast, but can complicate things when you want to run more than one app server.
BTrieve was essential what you describe. Back in the DOS days it was a very fast database.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.