SQL relationship table - sql

I have theoretical question about SQL design. When I have 10s of tables, between which I need relations m:n, is better approach to do relation table for each required pair or is it possible (or from performance view better) to have one relation table with columns (id,table1,row1,table2,row2), with integers in it?

I am an Informatics student and based on what I have learned in class it is always better to create a table in between the two tables in order for it to hold the primary keys of each of those two tables. This table will hold the relations like the following:
student: student_id, first_name, last_name
classes: class_id, name, teacher_id
student_classes: class_id, student_id # the relations table

Of course it would be better to have a more detailed example, at least for a smaller scope, which are tables, why do you need that relation etc.
But generally, in my opinion, database design should follow logic of your data, so relations should be where they are needed.
Other thing, if you will have only one table for all relations, it will be kind of bottle neck for the whole application. So then, how critical this situation is, also depends, on how are you going to use it in application (mostly reads, or lots of writes/updates/deletes, how many rows.. etc).
Also this consideration also could be somewhat even dependent on what RDBMS you're using.

Depends on the case, sometimes it's better to follow normal forms for designing (https://www.sqlshack.com/what-is-database-normalization-in-sql-server/), but after years working in software... from experience that is lost with time, the original design gets messy unless there is a team for that.

Related

Simple SQL schema design? When do we implement redundant relationships?

I often find myself in a situation where I don't know how redundant my SQL relationships should be. Here's a simple example:
In both cases, all data is accessible via joins. If I want to query the school in the first example (if I only know the student_id), I would need two joins. In the second example, I'd need one less join but the relationships are more complicated.
I'm not sure if both approaches are valid or if one is an antipattern in SQL?
(of course, this is a simplified example)
It's worth thinking through the implications of both designs.
I assume this is a hypothetical scenario - in practice, you're likely to have many-to-many relationships between school and class, and between class and student.
The first option you present is a normalized design. This means you're making explicit statements about the problem domain:
A school has 0 or more classes
A class belongs to exactly 1 school
A class has 0 or more students
A student belongs to exactly on class
The information about the relationships between your entities is contained in a single place. This has several benefits - you only have to update it in one place if a student changes class, or a class changes school. This is much less likely to cause bugs in future. The schema is easy to understand for new developers who are used to a normalized database model - you don't have to worry about bugs, or remember to update two different tables when relationships change.
The "drawback" is that you need joins to answer queries like "find all students for a school". I've put quotes around drawback, because joins are inherent to the relational model - it's like saying feet are a drawback of having legs. As long as you have indexes on the identifying columns, there's usually no noticable performance impact.
Option 2 is de-normalized by duplicating the information about the relationship between class and school in the student table. De-normalization is not quite an anti-pattern, but it brings risks. The duplication of information means you have to maintain it in two places, and handle bugs where that information gets out of sync. For non-trivial systems, this can be a lot of work.
Denormalization is used when there are performance challenges with the normalized model. On modern hardware, that's typically in databases with hundreds of millions of records or more.
Check what is normalization. The second schema is often used in star schema in datawarehouses. This is because the difference implicit in oltp vs olap systems. In oltp relations using joins are required, but if it's an "only read" (olap) database joins don't carry a good performance.
I would follow the second way as it is more realistic to a real-world situation.
Eg:
Considering a university like VTU which has 4 regions and different colleges in each region, the student is identified with a USN in the format region_code + college_code + year + branch_code + student_code .
ex: 4 XX 10 YY ZZZ
I could easily query something like:-
All students of a region.
All students of a college.
All students of a region of some branch.
etc...

Should I use one-to-one relationships to avoid repetition even if the new table wasn't really needed?

I'm designing a database from scratch and I'm wondering if the way I'm using one-to-one relationships is correct.
Imagine I have a table that needs the columns city and country_id, the first being alphanumeric and the second being a foreign key to another table. Should I place these in a locations table and use a one-to-one relationship?
Another example:
I have a table with the factory information of a device like the serial number and other fields. These will later be used to register a device in another table. Of course this is a one-to-one relationship, but should the columns of the first table be in the second table instead? Have in mind that the registrations table has another 4-5 columns.
I've read a lot of times that these relationships can often be omitted. However, I like the separation of concerns that creating a new table can give, in some cases.
Thanks in advance!
This may be a duplicate question, e.g. see:
SQL one to one relationship vs. single table
There's no perfect answer to this question as it depends on use case, but here's the rule of thumb I recommend abiding by: if you can already envision the potential need for separate tables, then I would err on the side of splitting them and using a 1:1 relationship. For example, imagine in the future that you want to have some kind of one-to-many relationship between some new table and the country table, or between some new table and the device table. In these cases you probably don't want city information mixed up with the former, and you probably don't want device registration information mixed up with the latter. By keeping your DB schema normalized, you can better future-proof it and you can mitigate the need for (likely extremely painful) updates that may have otherwise cropped up.

SQL one to one relationship vs. single table

Consider a data structure such as the below where the user has a small number of fixed settings.
User
[Id] INT IDENTITY NOT NULL,
[Name] NVARCHAR(MAX) NOT NULL,
[Email] VNARCHAR(2034) NOT NULL
UserSettings
[SettingA],
[SettingB],
[SettingC]
Is it considered correct to move the user's settings into a separate table, thereby creating a one-to-one relationship with the users table? Does this offer any real advantage over storing it in the same row as the user (the obvious disadvantage being performance).
You would normally split tables into two or more 1:1 related tables when the table gets very wide (i.e. has many columns). It is hard for programmers to have to deal with tables with too many columns. For big companies such tables can easily have more than 100 columns.
So imagine a product table. There is a selling price and maybe another price which was used for calculation and estimation only. Wouldn't it be good to have two tables, one for the real values and one for the planning phase? So a programmer would never confuse the two prices. Or take logistic settings for the product. You want to insert into the products table, but with all these logistic attributes in it, do you need to set some of these? If it were two tables, you would insert into the product table, and another programmer responsible for logistics data would care about the logistic table. No more confusion.
Another thing with many-column tables is that a full table scan is of course slower for a table with 150 columns than for a table with just half of this or less.
A last point is access rights. With separate tables you can grant different rights on the product's main table and the product's logistic table.
So all in all, it is rather rare to see 1:1 relations, but they can give a clearer view on data and even help with performance issues and data access.
EDIT: I'm taking Mike Sherrill's advice and (hopefully) clarify the thing about normalization.
Normalization is mainly about avoiding redundancy and relateded lack of consistence. The decision whether to hold data in only one table or more 1:1 related tables has nothing to do with this. You can decide to split a user table in one table for personal information like first and last name and another for his school, graduation and job. Both tables would stay in the normal form as the original table, because there is no data more or less redundant than before. The only column used twice would be the user id, but this is not redundant, because it is needed in both tables to identify a record.
So asking "Is it considered correct to normalize the settings into a separate table?" is not a valid question, because you don't normalize anything by putting data into a 1:1 related separate table.
Creating a new table with 1-1 relationships is not a reasonable solution. You might need to do it sometimes, but there would typically be no reason to have two tables where the user id is the primary key.
On the other hand, splitting the settings into a separate table with one row per user/setting combination might be a very good idea. This would be a three-table solution. One for users, one for all possible settings, and one for the junction table between them.
The junction table can be quite useful. For instance, it might contain the effective and end dates of the setting.
However, this assumes that the settings are "similar" to each other, in a SQL sense. If the settings are different such as:
Preferred location as latitude/longitude
Preferred time of day to receive an email
Flag to be excluded from certain contacts
Then you have a data-type problem when storing them in a table. So, the answer is "it depends". A lot of the answer depends on what the settings look like, how they will be used, and the type of constraints on them.
You're all wrong :) Just kidding.
On a very high load, high volume, heavily updated system splitting a table by 1:1 helps optimize I/O.
For example, this way you can place heavily read columns onto separate physical hard-drives to speed-up parallel reads (the 1-1 tables have to be in different "filegroups" for this). Or you can optimize table-level locks. Etc. Etc.
But this type of optimization usually does not happen until you have millions of rows and huge read/write concurrency
Splitting tables into distinct tables with 1:1 relationships between them is usually not practiced, because :
If the relationship is really 1:1, then integrity enforcement boils down to "inserts being done in all concerned tables, or none at all". Achieving this on the server side requires systems that support deferred constraint checking, and AFAIK that's a feature of the rather high-end systems. So in many cases the 1:1 enforcement is pushed over to the application side, and that approach has its own obvious downsides.
A case when splitting tables is nonetheless advisable, is when there are security perspectives, i.e. when not all columns can be updated by one user. But note that by definition, in such cases the relationship between the tables can never be strictly 1:1 .
(I also suggest you read carefully the discussion between Thorsten/Mike. You used the word 'normalization' but normalization has very little to do with your scenario - except if you were considering 6NF, which I think is rather unlikely.)
It makes more sense that your settings are not only in a separate table, but also use a on-to-many relationship between the ID and Settings. This way, you could potentially have a as many (or as few) setting as required.
UserSettings
[Settings_ID]
[User_ID]
[Settings]
In fact, one could make the same argument for the [Email] field.

Hierarchical Database, multiple tables or column with parent id?

I need to store info about county, municipality and city in Norway in a mysql database. They are related in a hierarchical manner (a city belongs to a municipality which again belongs to a county).
Is it best to store this as three different tables and reference by foreign key, or should I store them in one table and relate them with a parent_id field?
What are the pros and cons of either solution? (both structural end efficiency wise)
If you've really got a limit of these three levels (county, municipality, city), I think you'll be happiest with three separate tables with foreign keys reaching up one level each. This will make queries almost trivial to write.
Using a single table with a parent_id field referencing the same table allows you to represent arbitrary tree structures, but makes querying to extract the full path from node to root an iterative process best handled in your application code.
The separate table solution will be much easier to use.
three different tables:
more efficient, if your application mostly accesses information about only one entity (county, municipality, city)
owner-member-relationship is a clear and elegant model ;)
County, Municipality, and City don't sound like they are the same kind of data ; so, I would use three different tables : one per data-type.
And, then, I would indeed use foreign keys between those.
Efficiency-speaking, not sure it'll change much :
you'll do joins on 3 tables instead of joining 3 times on the same table ; I suppose it's quite the same.
it might make a little difference when you need to work on only one of those three type of data ; but with the right indexes, the differences should be minimal.
But, structurally speaking, if those are three different kind of entities, it makes sense to use three different tables.
I would recommend for using three different tables as they are three different entities.
I would use only one table in those cases you don´t know the depth of the hierarchy, but it is not case.
I would put them in three different tables, just on the grounds that it is 3 different concepts. This will hamper speed and will complicate your queries. However given that MySQL does not have any special support for hirachical queries (like Oracle's connect by statement) these would be complicated anyway.
Different tables: it's just "right". I doubt you'll see any performance gains/losses either way but this is one where modelling it properly up-front will probably save you lots of headaches later on. For one thing it'll make SQL SELECTs easier to write and read.
You'll get different opinions coming back to you on this but my personal preference would be to have separate tables because they are separate entities.
In reality you need to think about the queries you will doing on this data and usually your answer will come from that. With separate tables your queries will look much cleaner and in the end your not saving yourself anything because you'll still be joining tables together, even if they are the same table.
I would use three separate tables, since you know exactly what categories of information you are working with, and won't need to dynamically alter the 'depth' of your hierarchy.
It'll also make the data simpler to manage, as you'll be able to tell if the data is for a city, municipality or a county just by knowing the table (and without having to discern the 'depth' of a record in the hierarchy first!).
Since you'll probably be doing self joins anyway to get the hierarchy to work, I'd doubt there would be any benefits from having all the data in a single table.
In dataware housing applications, adherents of the Kimball methodology might place these fields in the same attribute table:
create table city (
id int not null,
county varchar(50) not null,
municipality varchar(50),
city varchar(50),
primary key(id)
);
The idea being that attibutes should never be more than l join away from the fact table.
I just state this as an alternative view. I would go with the 3 table design personally.
This is a case of ‘Database Normalization’, which is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. The purpose is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Multiple tables will help in the situation if the task has been distributed among different developers, or users at different levels require different rights to view and change the data or the small tables help when you need this data for other purposes as well or so.
My vote would be for multiple tables - with data appropriately distributed.

Why to have a join table for 1:m relation in SQL

What is the benefit of having junction tables between the first 1:m and the second 1:m relations in the following database?
alt text http://dl.getdropbox.com/u/175564/db/db-simple.png
The book Joe Celko's trees and hierarchies in SQL for Smarties says that the reason is to have unique relations in 1:m's. For instance, the following tables resrict users to ask the exactly same question twice and to give exactly the same answer twice, respectively.
The first 1:m relation
users-questions
===============
user_id REFERENCES users( user_id )
question_id REFERENCES questions ( question_id )
PK( user_id, question_id) // User is not allowed to ask same question twice
The second 1:m relation
questions-answers
=================
question_id REFERENCES questions( question_id)
answer_id REFERENCES answers( aswer_id )
PK( question_id, answer_id ) // Question is not allowed to have to same answers
This benefit about uniqueness does not convince me to make my code more challenging.
I cannot understand why I should restrict the possibility of having questions or answers with the same ID in the db, since I can perhaps use PHP to forbid that.
Well, the unique relations thing seems nonsensical to me, probably because I'm used to DBMSes where you can define unique keys other than the primary key. In my world, mapping tables like those are how you implement a many-to-many relationship, and using them for a one-to-many relationship is madness — I mean, if you do that, maybe you intend for the relationship to be used as one-to-many, but what you've actually implemented is many-to-many support.
I don't agree with what you're saying about there being no utility to unique compound keys in the persistence layer because you can enforce that in the application layer, though. Persistence-layer uniqueness constraints have a lot of difficult-to-replicate benefits, such as, in MySQL, the ability to take advantage of INSERT ... ON DUPLICATE KEY UPDATE.
Its usually due to duplication of data.
As for your reasoning, yes you can enforce this in the business layer, but if you make a mistake, it could break a significant amount of code. The issue you have is your data model may have only a few tables. Lucky you. When your data model grows, if you can't make sense of the structure and you have to put all the logic to maintain denormalised tables in your GUI layer you could very easily run into problems. Note that it is hard to make things threadsafe on a GUI for your SQL Database without using locking which will destroy your performance.
DBMS are very very good at dealing with these problems. You can keep your data model clean and use indexing to provide you with the speed you need. Your goal should be to get it right first, and only denormalise your tables when you can see a clear need to do so (for performance etc.)
Believe it or not, there are many situations where having normalised data makes your life easier, not harder when it comes to your application. For instance, if you have one big table with questions and answers, you have to write code to check if it is unique. If you have a table with a primary key, you simply write
insert into table (col1, col2) values (#id, #value) --NOTE: You would probably
--make the id column an autonumber so you dont have to worry about this
The database will prevent you from inserting if you have a non unique value there OR if you are placing in an answer with no question. All you need to do is check whether the insertion worked, nothing more. Which one do you think is less code?
I agree that the join table for a one-to-many in this situation doesn't seem to add much benefit, and as #chaos says, you actually end up implementing many-to-many support. But Joe Celko is a smart guy - is this really the exact answer he gives?
One other possible reason for implementing a join table on a one-to-many is that it completely separates questions/answers from a dependence on users.
For example, say you added a Dogs tables and an Deities table. We all know that dogs can't register as users because they don't have email addresses, and gods don't register as users because, well, it's beneath them. Maybe dogs and gods still ask questions though, but to do that you might want to implement a dogs-questions table and a deities-questions table. In theory this is still many-to-many, but in practice you do it so that you can have multiple one-to-manys.