I am setting up a database where I'd like to have many-to-many relationships between some tables. There's no user interface for this database; we will be putting data into the tables using R scripts and retrieving it using Python scripts.
The entities involved are projects and cost forecasts. Multiple projects may use the same forecast. For each forecast, there are costs to develop a project in each of several future years. I need to be able to retrieve the cost forecast for each future year for each individual project.
I think the tables below would be a fairly standard way to represent these relationships. Note that "pk" means "primary key" and "fk" means "foreign key".
forecast_id (fk)
forecast_id (pk)
forecast_id (fk)
To retrieve the forecast for a particular project, I would just retrieve all the rows from COST that have a matching forecast_id. I don't need the FORECAST table for anything, except as a home for the forecast_id that establishes the many-to-many relationship between PROJECT and COST.
So my main question is, can I just drop the FORECAST table and have a direct many-to-many relationship between PROJECT and COST, using the forecast_id? I know this is physically possible, but many discussions use language along the lines that "many-to-many relationships aren't possible without a bridge table." But why would I want to add the bridge table, if I can do all my queries without it and it is one more table I would have to maintain?
Going further, many discussions of many-to-many relationships (including #mike-organek's comment below) suggest a structure similar to this:
project_id (pk)
project_id (fk)
cost_id (fk)
cost_id (pk)
While this seems like a commonly preferred approach, it suits my needs even less well. Now every time I add a new project, instead of just assigning the forecast_id corresponding to a particular forecast, I have to add a bunch of link records to the PROJECT_COST table, one for each future year. This will also require a lot of management, and allows potential creation of relationships I don't want (e.g., one project uses costs from one forecast for the first two years, then costs from a different forecast for the next two years).
So my second question is, is there anything preferable about the second approach over the first approach, or over my simplified approach (using just the PROJECT and COST tables)?
There seems to be some confusion about what I'm asking here. So I've revised the question significantly to try to make it clearer. Note that I renamed cost_group to forecast as part of this.

The second approach (with the project_cost table containing two foreign keys) is the correct way to model a many-to-many relationship.
But your idea with the shared forecast_id (with or without forecast table) exhibits that you are not thinking of a many-to-many relationship in the ordinary sense: if one project is associated with a certain set of costs, all other projects must either be associated with the same or a disjoint set of costs.
If that is what you want, I see no problem with removing the forecast table. There is no referential integrity you are losing that way.
If you have additional requirements, for example that there has to be at least a cost and a project for each existing forecast_id, things may change. That could be guaranteed with foreign keys from the forecast table, but not without that table.


Determining if an entity is weak or not

I'm creating a relational database of a store and its stock of products.
In the brief, it says "products can be returned under agreed terms e.g. expiry date or manufacturers error", based on this I created a weak entity "Terms" with product_ID as the foreign key and errors & expiry as two attributes.
My logic was that the terms only exist if the product exists, therefore it is a weak attribute as every product has terms, but you wouldn't have terms not associated with a product.
Looking at it though, the "Terms" table would basically be Product ID (1) ---> Errors (No) ---> Expiry (01/01/23), and now I'm starting to think those two attributes should be attributes of the product table and not a separate entity, mainly because "Terms" doesn't have a partial/discriminator key that could be used as a composite primary.
Does anyone have any thoughts about which way is correct?
I think this answer really comes down to the trade-offs in terms of performance.
To make sure I understand your question correctly - you basically have two tables:
The main product table
A "lookup" table that just has Product_ID (FK), Errors, and Expiry as the columns
If this is the case, you have two options:
Just add Errors and Expiry as columns to the primary product table
Keep the two tables separated as you have them, and just JOIN that data when needed.
Option 1 has the benefit of keeping all the data in one table, assuming that "Expiry" and "Errors" are unique to the product_ID; if they're not, you may end up duplicating data, and it's better to keep these fields in your separate table to have a 1:Many relationship. The other drawback would be that if your main Product table is beefy, you've slowed down the query even further by adding these columns.
Option 2 can circumvent the two shortcomings of Option 1 - by keeping this data separate, your Product table is much lighter, and if you have a 1:many relationship, you don't duplicate data (saving you more memory overall!). The drawback with Option 2 is that your EDR gets a bit more complicated - you have one more table to keep track of.
Based on these, I recommend keeping your separate "lookup" table - the benefits of separating this data out will help you in the long run - but ultimately you'll need to weight the pros and cons since I don't know the extent of your project.

SQL one to one relationship vs. single table

Consider a data structure such as the below where the user has a small number of fixed settings.
Is it considered correct to move the user's settings into a separate table, thereby creating a one-to-one relationship with the users table? Does this offer any real advantage over storing it in the same row as the user (the obvious disadvantage being performance).
You would normally split tables into two or more 1:1 related tables when the table gets very wide (i.e. has many columns). It is hard for programmers to have to deal with tables with too many columns. For big companies such tables can easily have more than 100 columns.
So imagine a product table. There is a selling price and maybe another price which was used for calculation and estimation only. Wouldn't it be good to have two tables, one for the real values and one for the planning phase? So a programmer would never confuse the two prices. Or take logistic settings for the product. You want to insert into the products table, but with all these logistic attributes in it, do you need to set some of these? If it were two tables, you would insert into the product table, and another programmer responsible for logistics data would care about the logistic table. No more confusion.
Another thing with many-column tables is that a full table scan is of course slower for a table with 150 columns than for a table with just half of this or less.
A last point is access rights. With separate tables you can grant different rights on the product's main table and the product's logistic table.
So all in all, it is rather rare to see 1:1 relations, but they can give a clearer view on data and even help with performance issues and data access.
EDIT: I'm taking Mike Sherrill's advice and (hopefully) clarify the thing about normalization.
Normalization is mainly about avoiding redundancy and relateded lack of consistence. The decision whether to hold data in only one table or more 1:1 related tables has nothing to do with this. You can decide to split a user table in one table for personal information like first and last name and another for his school, graduation and job. Both tables would stay in the normal form as the original table, because there is no data more or less redundant than before. The only column used twice would be the user id, but this is not redundant, because it is needed in both tables to identify a record.
So asking "Is it considered correct to normalize the settings into a separate table?" is not a valid question, because you don't normalize anything by putting data into a 1:1 related separate table.
Creating a new table with 1-1 relationships is not a reasonable solution. You might need to do it sometimes, but there would typically be no reason to have two tables where the user id is the primary key.
On the other hand, splitting the settings into a separate table with one row per user/setting combination might be a very good idea. This would be a three-table solution. One for users, one for all possible settings, and one for the junction table between them.
The junction table can be quite useful. For instance, it might contain the effective and end dates of the setting.
However, this assumes that the settings are "similar" to each other, in a SQL sense. If the settings are different such as:
Preferred location as latitude/longitude
Preferred time of day to receive an email
Flag to be excluded from certain contacts
Then you have a data-type problem when storing them in a table. So, the answer is "it depends". A lot of the answer depends on what the settings look like, how they will be used, and the type of constraints on them.
You're all wrong :) Just kidding.
On a very high load, high volume, heavily updated system splitting a table by 1:1 helps optimize I/O.
For example, this way you can place heavily read columns onto separate physical hard-drives to speed-up parallel reads (the 1-1 tables have to be in different "filegroups" for this). Or you can optimize table-level locks. Etc. Etc.
But this type of optimization usually does not happen until you have millions of rows and huge read/write concurrency
Splitting tables into distinct tables with 1:1 relationships between them is usually not practiced, because :
If the relationship is really 1:1, then integrity enforcement boils down to "inserts being done in all concerned tables, or none at all". Achieving this on the server side requires systems that support deferred constraint checking, and AFAIK that's a feature of the rather high-end systems. So in many cases the 1:1 enforcement is pushed over to the application side, and that approach has its own obvious downsides.
A case when splitting tables is nonetheless advisable, is when there are security perspectives, i.e. when not all columns can be updated by one user. But note that by definition, in such cases the relationship between the tables can never be strictly 1:1 .
(I also suggest you read carefully the discussion between Thorsten/Mike. You used the word 'normalization' but normalization has very little to do with your scenario - except if you were considering 6NF, which I think is rather unlikely.)
It makes more sense that your settings are not only in a separate table, but also use a on-to-many relationship between the ID and Settings. This way, you could potentially have a as many (or as few) setting as required.
In fact, one could make the same argument for the [Email] field.

Normalize SQL database

I'm creating a database for a project and I'm a little confused about how normalization applies to my schema. Everytime a loan is aproved for a customer, they have 2 options a check or an EFT, so I want to know wheter the loan was a check or EFT.
This are my 3 tables:
id_loan (PK)
id_check (PK)
id_eft (PK)
Then I created a 4th table to establish a relationship between loans and money disposal.
id_payment (PK)
id_loan (FK loans)
id_disposal (FK checks or EFT)
In this table I store whether the loan is related to a check or an EFT, disposal_type field is a varchar with two possible values "check" or "EFT". id_disposal field acts as a foreign key for two tables.
The problem is that I think my database isn't normalized with this structure, am I right? What would be the best way to solve this?
You need something like the attached. Note that the customer_loans table is kind of extraneous and overkill, but if there's any columns that relate to the customer and the loan, and not the customer's loan payments, that's where it would go.
In the object world, you'd use inheritance for this. There would be a base type Disposal which CheckDisposal and EftDisposal would derive from. Modern O/RMs support several techniques for mapping this to a relational structure.
TablePerHierarchy puts all of the records into a single table with a discriminator column to identify what type a specific record holds and maps to. The advantage is that it requires fewer joins to get a record. Disadvantage is that it requires app logic to enforce data integrity.
TablePerType maps records into different tables with a fk relationship back to the base table. Of course this requires more joins (especially for deep or wide hierarchies) but data integrity can be enforced in the DB.

Storing Revisions of Relational Objects in an Efficient Way

I'm not sure if this type of question has been answered before. In my database I have a product table and specifications table. Each product can have multiple specifications. Here I need to store the revisions of each product in database in order to query them later on for history purposes.
So I need an efficient way to store the products' relations to specifications each time users make changes to these relations. Also the amount of data can become very big. For example, suppose there are 100000 products in database: each product can have 30 specifications and also there are minimum of 20 revisions on each product. So by storing all the data in a single table the amount of data becomes enormously high.
Any suggestions?
If this is purely for 'archival' purposes then maybe a separate table for the revisions is better.
However if you need to treat previous revisions equally to current revisions (for example, if you want to give users the ability to revert a product to a previous revision), then it is probably best to keep a single products table, rather than copying data between tables. If you are worried about performance, this is what indexes are for.
You can create a compound primary key for the product table, e.g. PRIMARY KEY (product_id, revision). Maybe a stored proc to find the current revision—by selecting the row with the highest revision for a particular product_id—will be useful.
I would recommend having a table, exact copy of current table with a HistoryDate column, and store the revisions in this table. This you can do for all 3 tables in question.
By keeping the revision separate from the main tables, you will not incur any performance penalties when querying the main tables.
You can also look at keeping a record to indicate the user that changed the data.

Do these database design styles (or anti-pattern) have names?

Consider a database with tables Products and Employees. There is a new requirement to model current product managers, being the sole employee responsible for a product, noting that some products are simple or mature enough to require no product manager. That is, each product can have zero or one product manager.
Approach 1: alter table Product to add a new NULLable column product_manager_employee_ID so that a product with no product manager is modelled by the NULL value.
Approach 2: create a new table ProductManagers with non-NULLable columns product_ID and employee_ID, with a unique constraint on product_ID, so that a product with no product manager is modelled by the absence of a row in this table.
There are other approaches but these are the two I seem to encounter most often.
Assuming these are both legitimate design choices (as I'm inclined to believe) and merely represent differing styles, do they have names? I prefer approach 2 and find it hard to convey the difference in style to someone who prefers approach 1 without employing an actual example (as I have done here!) I'd would be nice if I could say, "I'm prefer the inclination-towards-6NF (or whatever) style myself."
Assuming one of these approaches is in fact an anti-pattern (as I merely suspect may be the case for approach 1 by modelling a relationship between two entities as an attribute of one of those entities) does this anti-pattern have a name?
Well the first is nothing more than a one-to-many relationship (one employee to many products). This is sometimes referred to as a O:M relationship (zero to many) because it's optional (not every product has a product manager). Also not every employee is a product manager so its optional on the other side too.
The second is a join table, usually used for a many-to-many relationship. But since one side is only one-to-one (each product is only in the table once) it's really just a convoluted one-to-many relationship.
Personally I prefer the first one but neither is wrong (or bad).
The second would be used for two reasons that come to mind.
You envision the possibility that a product will have more than one manager; or
You want to track the history of who the product manager is for a product. You do this with, say a current_flag column set to 'Y' (or similar) where only one at a time can be current. This is actually a pretty common pattern in database-centric applications.
It looks to me like the two model different behaviour. In the first example, you can have one product manager per product and one employee can be product manager for more than one product (one to many). The second appears to allow for more than one product manager per product (many to many). This would suggest the two solutions are equally valid in different situations and which one you use would depend on the business rule.
There is a flaw in the first approach. Imagine for a second, that the business requirements have changed and now you need to be able to set 2 Product Manager to a product. What will you do? Add another column to the table Product? Yuck. This obviously violates 1NF then.
Another option the second approach gives is an ability to store some attributes for a certain Product Manager <-> Product relation. Like, if you have two Product Manager for a product, then you can set one of them as a primary...
Or, for example, an employee can have a phone number, but as a product manager he/she can have another phone number... This also goes to the special table then.
Approach 1)
Slows down the use of the Product table with the additional Product Manager field (maybe not for all databases but for some).
Linking from the Product table to the Employee table is simple.
Approach 2)
Existing queries using the Product table are not affected.
Increases the size of your database. You've now duplicated the Product ID column to another table as well as added unique constraints and indexes to that table.
Linking from the Product table to the Employee table is more cumbersome and costly as you have to ink to the intermediate table first.
How often must you link between the two tables?
How many other queries use the Product table?
How many records in the Product table?
in the particular case you give, i think the main motivation for two tables is avoiding nulls for missing data and that's how i would characterise the two approaches.
there's a discussion of the pros and cons on wikipedia.
i am pretty sure that, given c date's dislike of this, he defines relational theory so that only the multiple table solution is "valid". for example, you could call the single table approach "poorly typed" (since the type of null is unclear - see quote on p4).