Im trying to figure out what the appropriate way to setup a database would be given this scenario:
I am creating a Movie / TV database. A movie may have multiple genres and a tv show may have multiple genres.
Essentially what I am wondering is if you have a Movie table, TV table... should you:
have a MovieHasGenre table consisting of a foreign key to the Movie table and a regular field for the genre value
or
have a MovieHasGenre table AND a Genre table where the MovieHasGenre has two foreign keys, one pointing to the Movie in the Movie table the other pointing to the Genre in the Genre table
Im really not sure if this is something standardized or just involves preference. Do we have concerns with speed as it seems removing the Genre table is one less join.
Go with option 2.
It's useful to store each Genre once, and make reference to it via the MoveHasGenre table. That way, if you have other attribute columns for a genre, you don't have to store those attribute redundantly on each row where a given genre is mentioned.\
Re your comment:
Another case is if you want to change the spelling of a genre, and have it apply to all rows that reference it, with no chance you forget some.
Option 2 is how you will normalize your data.
The problems with option 1 is data redundancy. Opposed to using a few bytes of data to store and INT you are using a potentially large value to store the name of the genre. The other problem like Bill said is that you have the potential for data inconsistency since you will have to update multiple fields if a genre changes instead of just one column.
However, what you have in option 1 is a denormalized version of option 2 which would have performance benefits over option 2 but I would imagine, given the seemingly small size of this database, that there won't be a significant performance change.
Related
I'm modelling a tier-list database using PostgreSQL. This is how it works:
A user can create a new Tier List;
A user can add as many tiers he wants to the list;
A user can add as many items as he can. Initially, the items are added to an "unranked" section (not assigned to any tier), then the user can rank them as he wants.
Modeling details:
A tier necessarily belongs to a tier_list;
An item can be in multiple tier_lists and in multiple tiers as well;
An item added to a tier_list has not necessarily been added to one of the tiers.
For modelling the relations between item-tier and item-tier_list, I thought about two scenarios:
Creating a junction with a composite PFK key of item and tier_list with a nullable tier FK. The records with no tier value would be the unranked ones, while the ones with an assigned tier would be the ranked;
Creating two M-N relations: one between item and tier, storing ranked items, and another between item and tier_list, storing unranked items.
I feel like the first option would be easier to deal with when having to persist things like moving a product between tiers (or even unranking it), while the second looks more compliant to SQL standards. Am I missing something?
First proposed solution model:
Second proposed solution model:
You can create a joint key using 3 different fields.
First of all, why using smallint and not int? Not fluent in Posgres, but it's usually better to have the biggest integer possible as primary key (things can grow faster than you expect).
Second, I strongly suggest to put ID_ before and not after the name of the filed used for lookup. It makes it easier to read.
As how to build your tables:
Item
ID PK
Title
Descriptions
I see no problems here. I'd just change the name in tblProducts, for easier reading.
Tier_List
ID PK
Description
Works fine too. Again I'll look for a better name. I'd call this one tblTiers or tblLegues instead. Usign similar names can bring troubles in 2-3 years when you have to add things and you're not sure what's what. Better use distinctive names for the tables.
Tier (suggesting tblTiers or tblRankings)
ID PK
Tier_List_ID PK FK
Title
Description
Here I see a HUGE problem. For experience, I don't really understand why you create a combination key here with ID and Tier_List_ID. Do you need to reuse the same ID for different tiers? If that ID has a meaning bring it out from the PK absolutely! PK must be simple counters, that will NEVER be changed. I saw people using the ID with a meaning for the end-user. It was a total disaster! I can't even start describing the quantity of garbage data that that DB was containing.
I suppose, because you were talking about ranking, that the ID there is a Rank, a level or something like that.
The table should become
ID PK uuid
Tier_List_ID FK
Rank smallint
Title
Description
There's another reason why I had you do this: when you have a combined PK, certain DBRMs require you to use the same combined key in the lookup tables, and that can become messy fast!
Now, the lookup table:
tier_list_item (tblRankingLookup?)
ID_Product FK PK
ID_Tier_List FK PK
ID_Tier FK PK
You don't need anything else to make it work smoothly! At least, that's how I'd envision it.
Instead I'd add an ID_User (because I'm not sure if all users can see all tiers and all rankings, or they can see only theirs).
Addendum: if you need to have unique combinations of different elements, I'm pretty sure you can create a combined index and mark it as "unique" (don't remember the correct syntax, not sure it is the same in Postgres).
In exmple, if you don't want the Tier table to have the rank repeated only once per tier_list_ID, you can create an index using tier_list_ID and Ranking and mark it unique. This way a two tiers in the same tier_list will not have the same value for the field Rank (rank can still be null).
I am trying to make simple app for chess tournaments, but I have problem with database, I have users that participate in tournament (thats fine) but how do I give users to the round and match, should i make another relations user_tournament-round-tournament, user_tournament-match-round?
Please see this answers a food for though rather than a solution. In your question there is not enough information to fully cover all use cases, so the answer below contains a lot of speculation.
In my over simplistic view and picking up on your initial model, the tournament_competitors (renaming from user_tournament as we have competitors and not users) table would create a unique id for each enrolled competitor. This id would be used as a reference in a tournament_matches table (the table would link twice to the tournament_competitors this table would connect two opponents - constraint warning). The table would also register the match type.
For the match type, I see two possibilities.
The matches table would list all possible match types (final, semi-final, quarter-final, elimination rounds, etc.) and these would be referred to in the tournament_matches table via id (composite key in the form tournament_id-competitor_id-group_id). This approach, specially for the elimination round matches, requires the need to find a way to link the number of competitors in each elimination group with then number of matches each competitor has to through before they are considered eliminated or not - creating a round number. I see this as a business logic part so not on the DB. The group_id also needs to be calculated and it would be best done on the application.
The alternative is to have the various match types in the tournament_matches table as a free field - populated by the application. The tournament structure (Number of Groups, number of opponents in each group, etc.) would be defined as attributes in the tournaments table. In this view there is no need for the rounds table.
I'm not very familiar with relational databases but here is my question.
I have some raw data that's collected as a result of a customer survey. For each customer who participated, there is only one record and that's uniquely identifiable by the CustomerId attribute. All other attributes I believe fall under the non-prime key description as no other attribute depends on another, apart from the non-composite candidate key. Also, all columns are atomic, as in, none can be split into multiple columns.
For example, the columns are like CustomerId(non-sequential), Race, Weight, Height, Salary, EducationLevel, JobFunction, NumberOfCars, NumberOfChildren, MaritalStatus, GeneralHealth, MentalHealth and I have 100+ columns like this in total.
So, as far as I understand we can't talk about any form of normalization for this kind of dataset, am I correct?
However, given the excessive number of columns, if I wanted to split this monolithic table into tables with fewer columns, ie based on some categorisation of columns like demographics, health, employment etc, is there a specific name for such a structure/approach in the literature? All the tables are still going to be using the CustomerId as their primary key.
Yes, this is part of an assignment and as part of a task, it's required to fit this dataset into a relational DB, not a document DB which I don't think would gain anything in this case anyway.
So, there is no direct question as such as I worded above but creating a table with 100+ columns doesn't feel right to me. Therefore, what I am trying to understand is how the theory approaches such blobs. Some concept names or potential ideas for further investigation would be appreciated as I even don't know how to look this up.
In relational databases using all information in a table is not a good usage.
As you mentioned groping some columns in other tables and join all tables with master table is well. In this usage you can also manage one to many, many to one and many to many relationships. Such as customers could have more than one address or phone numbers.
An other usage is making a table like customer_properities and use columns like property_type and property_value and store data by rows.
But the first usage is more effective and most common usage
customer_id property_type properity_value
1 num_of_child 3
1 age 22
1 marial_status Single
.
.
.
Are there any drawbacks to storing addition data in my join table
/ junction table.
For example, I am working on a database of trucking companies and I have 3 tables:
Table 1 - company,
Table 2 - trailer_type,
Table 3 - junction_table,
Each company can have more than one trailer type, but I also need a trailer count of each trailer type per company. The most logical place to put the trailer count would seem to be in the junction table with the company.id and trailer_type.id.
Are there any drawbacks to doing it this way, and, if so, is there a better way?
From the way you phrased the question, I think your intuition is mostly correct. You identified the junction table as the place to keep your counts. But you're hesitating, apparently because it's a "junction table".
All tables are created equal. From the point of view of SQL, there are no fact tables, no dimension tables, no junction tables. There are only tables.
A normalized design denotes a minimal key to identify each row. In your case, the natural key of the junction table is something like {company_id, trailer_type_id}. Is there information that's functionally dependent on that key? Why, yes, there is: ntrailers. Thus a column is born.
So don't worry about what kind of table it is. Think about what the row means, and how it's identified. That will keep you on the fairway.
First, it is fine to store additional information in the junction table. For instance, the date created is often very interesting. And, there are many examples of entities that you can think of as junction tables -- say a billing record that connects a customer to an invoice -- and the records naturally have additional fields.
However, the count is not something that you would store under ordinary circumstances. This type of information should go in the Company table, particularly if it is fixed over time.
If it is not fixed, two options come to mind. The first is a slowly changing dimension where you store the number of trailers for a company along with effective and end dates for the value. A viable alternative is to store the value in the junction table. It is not the first choice, but it might be a good solution under many circumstances.
You may wish to think of your schema a little differently than as a many-to-many relationship between companies and trailers: that's the reason you have a junction table.
How about thinking of it this way:
Companies have a one-to-many relationship with Trailers
Trailers have a many-to-one relationship with TrailerDescriptions.
So, your Trailer table will have these columns:
company_id
trailer_description_id
count
etc.
You might even choose to enumerate the individual trailers, by adding columns and setting "count" to 1.
trailer_id
company_id
trailer_description_id
count = 1
date_placed_in_service
license_plate_number
etc.
I am working on a research project, using the IMDb dataset as my source of secondary data. I downloaded the entire database in text format from .ftp servers provided by IMDb itself, and used the IMDbPY python package to compile all of the unsorted information into a relational database. I chose to use SQLite as my SQL engine, as it seemed like the least cumbersome option thanks to its ability to create locally-stored databases. After a bit of poking around and a lot of documentation-reading, I ended up with a 9.04 GB im.db file, hosting the entirety of IMDb.
Now I need to isolate my dataset according to my requirements, but due to my lack of experience with SQL I'm finding it difficult to figure out the most optimal way of doing so.
Specifically, I want to look at:
Movies only (i.e. exclude TV series, episodes, etc.);
Produced in the period between 2000-2015, inclusive;
Of feature length (i.e. running time over 40 minutes);
Non-adult (I didn't even know IMDb hosts information on these, but apparently so);
Produced in the USA;
With complete information on crew.
Here's a representation of my database schema. I was confused by some of the database design choices that IMDbPY creators made, but I'm no SQL expert, and this is what I get to work with. Some clarifications:
The title table holds basic information about every instance of films, shows, episodes, and so on, 3,673,485 rows in total. The id column is an auto-incremented primary key, which is referenced as the movie_id foreign key in all other relevant tables. However, it seems like that none of the foreign keys in other tables are indexed properly, so I can't use simple query statements to properly get necessary information just by knowing a particular film's id value.
Running SELECT count(*) FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015; tells me that there are 442,135 instances of movies, produced between 2000-2015. So far so good.
The complete_cast and comp_cast_type tables hold info about the completion status of a film's crew/cast list. Since I only need to consider films with complete crew information, I need to isolate only those instances, where (i) movie_id exists in my previous query (i.e. out of the 442,135 movie rows); (ii) subject_id=2; and (iii) status_id=3 or 4.
This is where it gets tricky for me. The movie_info table holds 20 million rows of information about films and TV shows, including runtimes, genres, countries of production, years of production, etc. Basically all of the information that I need to isolate my dataset. Within that table (i) id is an arbitrary auto-incremented primary key; (ii) movie_id refers to the id values from title; (iii) info_type_id refers to one of the 113 types of information as listed in the table info_type; (iv) info holds the actual information, as integers or strings.
For example: Running SELECT id FROM title WHERE title='2001: A Space Odyssey' AND kind_id=1; returns '2484213'. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=1; returns '142, 161, 149', indicating the running times in minutes of the three available versions of the film. Running SELECT info FROM movie_info WHERE movie_id=2484213 AND info_type_id=8 returns 'USA, UK', indicating the countries involved in production. And so on.
Basically I'm trying to create a new table, populated only with films that fall under my requirements, and I'm having a hard time figuring out the most efficient way of doing so. Here's how I translated my requirements into basic SQL syntax:
SELECT * FROM title WHERE kind_id=1 AND production_year BETWEEN 2000 AND 2015;
Then a bunch of requirements from the movie_info table, which cross-references only those instances, where movie_id exists as id in the query above, and (i) info_type_id=1 AND info>40; (ii) info_type_id=3 AND info!='Adult'; (iii) info_type_id=8 AND info='USA';
Finally, I need to make sure that all of the selections exist in the complete_cast table, and WHERE subject_id=2 AND status_id=3 OR 4;
I've been reading SQLite documentation, and suspect that I need to use some combination of INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements, but not sure how to approach this exactly. I would like to write this code efficiently, since brute-forcing queries requirement by requirement takes a while for my computer to process. Thank you in advance for your help.
TL;DR. I can't figure out an efficient way of using INNER/LEFT OUTER JOIN, EXISTS and UNION/INTERSECT/EXCEPT statements to help me isolate a smaller dataset in accordance with multiple requirements, to satisfy which I need to cross-query a number of existing tables without properly indexed foreign keys.
Is an inner join with all the required tables too slow for your needs?
You could create tables that just contain the subset data that you need and then run an inner join on those.
So create a table "movie" and insert only those records from "title" with kind_id of 1. Then do something like
Select *
FROM
movie m
inner join movie_info mi
on m.id = mi.movie_id
inner join complete_cast cc
on m.id = cc.id
WHERE
...
Providing your new tables don't have the same kind of volume of data, it should perform better.