Is there a database design pattern name for reducing duplicate join table data? - sql

I have two tables with a join table to allow a many-to-many relationship.
It's a very familiar design pattern. It indicates which Branches each Member has access to.
As the number of members and branches increases I end up with a lot of data in the join table that is duplicated across members. Members tend to have access to the same groups of Branches as other Members.
So I'm looking at normalizing my data by creating a MemberProfile table that is effectively immutable. And rather than creating MemberBranch records for every Member I check for a matching MemberProfile, use if it already exists, or create one if it doesn't:
The idea being if I have a million Members with only a hundred access profiles this will save me a lot of space in my database.
I'm happy that it all works and that the development effort is worth is.
My question is "Is this a standard database design pattern, and if so, what is it called?"
EDIT: It's been pointed out that this is compressing the data not normalizing it. Which is the intent behind the design.

Unless your many:many table is always the join of particular other base tables, one is not normalizing. You aren't normalizing here. Normalization does not introduce new column names. It just rearranges the current ones among different base tables.
You are just compressing/encoding your data. There is not necessarily any benefit in this, since now some queries and updates will be slower although your database is smaller. (You have reported that it is worth it in your case.)

I understand you'd like to put a label on that precise transformation, but unfortunately, there aren't many books that discuss database design or refactoring patterns. One of the few is Martin Fowler's Refactoring Databases, which you may know for his work on analysis patterns (he also has a great blog, worth following!). In that book, Martin presents a bunch of refactoring patterns that can be applied to databases and has put a name on common database transformations, including the one you have presented, which he called Split Table.
Split Table. Vertically split (e.g. by columns) an existing table into one or more tables.
A catalog of the database refactorings presented in that book are available here.

Hi I don't know about a pattern name but I've used the same principle before.
To keep this performing well, introduce a checksum to memberProfile based upon the branches for the profile, this way a lookup for an existing profile is plain easy and fast.
But do remember that the checksum is not necessarily unique, in case of collisions you will still have to check the branches, but only for the profiles sharing the same checksum.
Cleanup can be a scheduled task is is nothing more then deleting the profiles without users.

Related

Create SQL tables for each user as security measure

I've research this topic and I'm relatively sure in most practices the answer is "No", but I would like some second opinions specific to my case.
We're currently working on a multi user web-app where each user will basically have there own copy "portal/app" within the web-app. It's not performance I'm worried about, but security.
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
Would you still recommend against this method ?
I'm considering partitioning the data with a prefix userid_table1, userid_table2 to make it more manageable and ensure no security validation oversight is made by the team in development as we can easily add a validation to ensure that queries can only be run against tables with userid_*.
More manageable? That sounds like a joke. Your database will end up with a zillion different tables. Any operation that you want to do across all users will be a nightmare:
Declaring foreign key constraints.
Defining a new index on the tables.
Adding a new column.
Restructuring the tables.
And so on. And so on.
Your users may be limited to a single table. But the application developer and DBA need to deal with all of them. I cringe thinking about trying to figure out where performance bottlenecks are in such a system.
I should add that databases are optimized for big tables not lots of tables, so multiple tables are typically less efficient. And even less efficient when you think about all the half-filled pages in all those tables.
The same entities should not be spread among multiple tables, unless you have a really, really good reason. This is not a really good reason. One simple solution is to prevent users from having access to the base tables. Just give them access to views or user-defined table functions -- and have all of these filter on user ids.
There are some edge cases where you do want separate tables for each user. Typically, each user would have a very complex tables (think B2B application) and, in fact, they might have their own database. There may also be legal requirements to separate data. In these cases, though, the "separateness" would typically be at the database level, not the table level.

Refactor schema with multiple many-to-many relationships to semantically different but structurally similar entities?

I have a schema with a number of many to many relationships and what I'm seeing is a alot of similar data structure spread out among tables with different names. The intuition I have is that there is a more efficient/desirable way to achieve the same result but I'm not sure what alternative approaches fall into reasonable design/best practices.
Note: Counries, TrafficTypes and People - as they exist now - could all be represented by Id and Name columns, but in the future may have additional fields. Maybe what I'm after is some kind of technique akin to inheritance?
Here's the diagram of what I've got:
Don't lump together things which you think are similar; they may diverge later when you need to store more information about each entity.
Is there a problem with the number of tables you have in your database?
You are probably thinking about the problem from an object oriented design position and thinking you can use some sort of "parent" table to represent the common parts - databases don't work that way.
If you are not careful you will end up with a MUCK or OTLT table.
Off the bat, I would keep seperate enties/objects/(cars vs animals) in seperate tables.
The chances of such enties overlapping properties are slim at best.
Thing is, once those entites start to evolve in your system, you will find that a single table will have hundreds of columns with singles populated per entity.
I don't see how inheritance applies to your case. But if you are interested in inheritance, as it applies to SQL tables or relations, here are some things to look up:
At the level of ER modeling look up "ER model specialization". This is the way the extended ER model diagrams "is A" relationships.
At the level of table design, look up "Class Table Inheritance" and "Shared primary key" for a couple of techniques that, used together, sort of mimic what inheritance does for you in an OOP. You might also want to look up "single table inheritance" for an alternative that's simpler, but can be more wasteful.
Don't worry. You're doing it right.
In a highly normalized schema, you're going to have tons of tables that are nearly identical.
As the others have said, what you're starting to consider (a generic table for multiple things) is a very bad idea and has many drawbacks.
The biggest drawback is that your relationships are made useless by it, in terms of maintaining data integrity. There's nothing stopping someone from assigning a CountryCode where a TrafficTypeId should be, and so on.
Another drawback would be that having one larger table will likely perform worse than many smaller, specialized tables; due to extra, unnecessary blocking.
Your may still want to implement some type of inheritance concept, but that'll be best done in whatever code accesses the database.

T-SQL database design and tables

I'd like to hear some opinions or discussion on a matter of database design. Me and my colleagues are developing a complex application in finance industry that is being installed in several countries.
Our contractors wanted us to keep a single application for all the countries so we naturally face the difficulties with different workflows in every one of them and try to make the application adjustable to satisfy various needs.
The issue I've encountered today was a request from the head of the IT department from the contractors side that we keep the database model in terms of tables and columns they consist of.
For examlpe, we got a table with different risks and we needed to add a flag column IsSomething (BIT NOT NULL ...). It fully qualifies to exists within the risk table according to the third normal form, no transitive dependency to the key, a non key value ...
BUT, the guy said that he wants to keep the tables as they are so we had to make a new table "riskinfo" and link the data 1:1 to the new column.
What is your opinion ?
We add columns to our tables that are referenced by a variety of apps all the time.
So long as the applications specifically reference the columns they want to use and you make sure the new fields are either nullable or have a sensible default defined so it doesn't interfere with inserts I don't see any real problem.
That said, if an app does a select * then proceeds to reference the columns by index rather than name you could produce issues in existing code. Personally I have confidence that nothing referencing our database does this because of our coding conventions (That and I suspect the code review process would lynch someone who tried it :P), but if you're not certain then there is at least some small risk to such a change.
In your actual scenario I'd go back to the contractor and give your reasons you don't think the change will cause any problems and ask the rationale behind their choice. Maybe they have some application-specific wisdom behind their suggestion, maybe just paranoia from dealing with other companies that change the database structure in ways that aren't backwards-compatible, or maybe it's just a policy at their company that got rubber-stamped long ago and nobody's challenged. Till you ask you never know.
This question is indeed subjective like what Binary Worrier commented. I do not have an answer nor any suggestion. Just sharing my 2 cents.
Do you know the rationale for those decisions? Sometimes good designs are compromised for the sake of not breaking currently working applications or simply for the fact that too much has been done based on the previous one. It could also be many other non-technical reasons.
Very often, the programming community is unreasonably concerned about the ripple effect that results from redefining tables. Usually, this is a result of failure to understand data independence, and failure to guard the data independence of their operations on the data. Occasionally, the original database designer is at fault.
Most object oriented programmers understand encapsulation better than I do. But these same experts typically don't understand squat about data independence. And anyone who has learned how to operate on an SQL database, but never learned the concept of data independence is dangerously ignorant. The superficial aspects of data independence can be learned in about five minutes. But to really learn it takes time and effort.
Other responders have mentioned queries that use "select *". A select with a wildcard is more data dependent than the same select that lists the names of all the columns in the table. This is just one example among dozens.
The thing is, both data independence and encapsulation pursue the same goal: containing the unintended consequences of a change in the model.
Here's how to keep your IT chief happy. Define a new table with a new name that contains all the columns from the old table, and also all the additional columns that are now necessary. Create a view, with the same name as the old table, that contains precisely the same columns, and in the same order, that the old table had. Typically, this view will show all the rows in the old table, and the old PK will still guarantee uniqueness.
Once in a while, this will fail to meet all of the IT chief's needs. And if the IT chief is really saying "I don't understand databases; so don't change anything" then you are up the creek until the IT chief changes or gets changed.

How to handle column growth of wide, flat tables

How would you DBA's handle this? I have taken ownership of an existing app (VB6) and database that was written in 1999. The database design is fairly 'flat', meaning the main tables are fairly wide (100+ columns) and developers have continued to tack on additional columns to the end of the tables. This has resulted in columns that have a lot of Nulls since they don't directly relate to the primary key.
I am considering splitting the main table out as a way to abstract myself from the years and years of 'column explosion'. I am certain that new fields will continue to be added as new requirements come up.
So the question is, as new fields are needed, do you continue to grow the width of the existing table? Or do you STOP extending an existing table and split it out into a separate supporting table that will house new fields, thereby creating a 1-to-1 relationship? If you were to split the main table, what would your naming scheme be?
Let's assume for this example I have a table called 'Foreclosure' with 150 fields.
What is a good name for the new 1-to-1 table? 'ForeclosureExtended'? ForeclosureOtherInfo'?
By the way, there are Views and Stored Procs that will need to be modified to support any new tables, but that is inevitable anyway when columns are added.
thanks in advance for any thoughts.
80% of the time, your nulls have definite patterns.
These patterns define subclasses of your table. In your case, they will be subclasses of Foreclosure.
Your splitting should be based on these subclass relationships.
Say, for example, some Foreclosure instances have a bunch of fields related to legal proceeding that are nearly all filled in. And other Foreclosure instances have the legal proceeding fields entirely filled with nulls.
You have two classes. You need to work out the relationship between them -- are they superclass-subclass or are they peer subclasses of some other superclass?
This tells you how to partition your table to make useful stuff happen.
You may have proper superclass subclass relationships
You may have found a thing (a LegalProceeding) which should have been a separate table all along. It should not have been permanently joined into Foreclosure. This is remarkably common.
You now have some relational implementation choices.
One common choice is to put all subclasses into a single, massive table with a lot of nulls. This is what you have today, and it isn't working.
One choice is to split the two subclass relationship tables into peers, duplicating the common information.
One choice is to have a superclass table with an optional FK reference to the additional information in the subclass.
One choice is to have a subclass table with a mandatory FK reference to the superclass information.
Unless you are really brave, app is very small/simple, or there are major performance issues do not fix the schema. If it ain't broke, don't fix it.
Just create a new table ForeclosureExtended, as you suggest with the same key and start adding columns. Or, you could make proper tables with grouped columns as new columns appear. Either way, if the schema is this bad, I'll bet the code is very fragile.
Why do you feel that you have a problem? To my mind it's easier to deal with one table that has a lot of columns than it is to deal with a ton of narrower tables and all the associated views you have to maintain.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.