How to handle multivalued fields in MYSQL database for movie collection? - sql

I have a large number of movies and TV series, which I currently keep track of in an MS Excel worksheet. Due to the large number of records and various data required, it is no longer a convenient option, so I want to switch to a MYSQL database, accessed through a GUI programmed in Java using Netbeans IDE.
I have the following tables in Excel:
Media_Library,
To_Be_Watched,
Statistics,
Wish_List,
Orders
Each film and TV series in my collection is in the Media_Library table, which has the following fields:
Sorting_Title
Title
Collection
Genre
Release_Year
Director
Age_Rating
Country
Runtime (min)
Watched
Media_Type
Format
For example: 'Alien 2', 'Aliens', 'Alien: Anthology', 'Action/Horror/Sci-Fi', 1986, 'James Cameron', 'M', 'America', 137, 'Yes', 'Movie', '4K UHD'
I'm stuck on what to do for the following fields: Genre, Director, Country, Runtime
Those 4 fields can each have multiple values, and I don't know how best to handle that; e.g. most films only have 1 runtime, but many have multiple (2 of the films have 4 different cuts). Also anthology films can have something like 6 different directors. I want to include all relevant genres, directors, countries and runtimes, but I don't know how to best do that.
I've tried adding a column for each value; genre1, genre2, ... This results in many blank values though. In the spreadsheet in Excel I put all applicable genres in a single field as one string, e.g. 'comedy/horror'.
What would be the easiest way to resolve this issue? Can I do a many-to-many relationship to achieve what I want?

Simply put a hard limit on the amount of genres.
For instance, while you may want the user to be able to enter as many genres as they want, is it rational to go above 20 genres? That doesn't make much sense and will only make searches much more time intensive.
For other possible duplicates, you can do something like this (in sqlite3 at least):
CREATE TABLE IF NOT EXISTS Directors
(id INTEGER PRIMARY KEY,
director TEXT,
UNIQUE(director) ON CONFLICT IGNORE)
CREATE TABLE IF NOT EXISTS file
(file_id INTEGER PRIMARY KEY,
filename TEXT,
director_id INTEGER,
watched INTEGER,
FOREIGN KEY (director_id)
REFERENCES Directors (id)
ON UPDATE CASCADE
ON DELETE SET NULL)
It doesn't matter if more than one genre have the same director, just as long as the 'file' table knows which one it's referencing and staying updated.
The 'watched' column holds a type of value that doesn't make sense to create an individual table for. For instance, say a song's track number is 2. Creating a table just for track numbers to reference doesn't make sense because you're going to spend a point in that table, then spend another point in the 'file' table to reference. So, just spend 1 point and put in the 'file'.
https://www.sqlitetutorial.net/sqlite-foreign-key/

Generally, you would add a second table, Directors, for instance, and then you relate that back to the movie title. You will need a uniqueID for the movie, and you do a join where that uniqueID is referenced in the Directors table, something like this working demo (not all fields were included in my demo):
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9e8d73bc798767b56b974ab4ebd30517
SELECT m.id, m.title, d.director
FROM Media_Library m
JOIN Directors d ON (m.id = d.mediaID);
or to concatenate the directors:
SELECT m.id, m.title, group_concat(d.director) as directors
FROM Media_Library m
JOIN Directors d ON (m.id = d.mediaID)
GROUP BY m.id;
Usually when you make this kind of relationship you will define a foreign key restraint, creating a link between the primary key of one table and a key (or keys) in another. In this case, the link is between id in Media_Library and mediaID in directors, so you would alter the create statement like this:
CREATE TABLE Directors (
id int not null auto_increment,
mediaID int,
director varchar(50),
PRIMARY KEY(id),
FOREIGN KEY (mediaID) REFERENCES Media_Library(id)
);
The foreign key is not strictly necessary, but it can reinforce database integrity. The ins and outs of foreign keys are out of scope for this answer, but you should probably read about them.
It is also possible to store the data in a JSON field since v5.7, like this:
CREATE TABLE test.Media_Library (id int not null auto_increment, title varchar (50), director JSON, PRIMARY KEY (id));
INSERT INTO test.Media_Library (title, director) VALUES
('Alien', json_array("Scott", "Scorsese")),
('The Alienist', '["Tarantino", "Nolan", "Kubrick"]'),
('Alien 2', '["Scott"]');
SELECT * FROM test.Media_Library;
https://www.db-fiddle.com/f/tG1SZorEHEYi5cYwgXjPeY/1
In the second query in that fiddle, I select only the first director from the list:
SELECT id, title, director->>'$[0]' as firstDirector
FROM Media_Library;
There are advantages to storing data this way, but there are tradeoffs, and unless you know what you are doing or you have a specific reason to be using JSON fields (for instance, you are getting the data from an api as JSON and you just want to use it as is), I would stick with the join method. Also, storing arrays is inherently non-normal (read about database normalization, a quick overview on wiki: https://en.wikipedia.org/wiki/Database_normalization).

Related

Benefits of using an autogenerated primary key instead of a constant unique name

I've heard that having an autogenerated primary key is a convention. However, I'm trying to understand its benefits in the following scenario:
CREATE TABLE countries
(
countryID int(11) PRIMARY KEY AUTO_INCREMENT,
countryName varchar(128) NOT NULL
);
CREATE TABLE students
(
studentID int(11) PRIMARY KEY AUTO_INCREMENT,
studentName varchar(128) NOT NULL,
countryOfOrigin int(11) NOT NULL,
FOREIGN KEY (countryOfOrigin) REFERENCES countries (countryID)
);
INSERT INTO countries (countryName)
VALUES ('Germany'), ('Sweden'), ('Italy'), ('China');
If I want to insert something into the students table, I need to lookup the countryIDs in the countries table:
INSERT INTO students (studentName, countryOfOrigin)
VALUES ('Benjamin Schmidt', (SELECT countryID FROM countries WHERE countryName = 'Germany')),
('Erik Jakobsson', (SELECT countryID FROM countries WHERE countryName = 'Sweden')),
('Manuel Verdi', (SELECT countryID FROM countries WHERE countryName = 'Italy')),
('Min Lin', (SELECT countryID FROM countries WHERE countryName = 'China'));
In a different scenario, as I know that the countryNames in the countries table are unique and not null, I could to the following:
CREATE TABLE countries2
(
countryName varchar(128) PRIMARY KEY
);
CREATE TABLE students2
(
studentID int(11) PRIMARY KEY AUTO_INCREMENT,
studentName varchar(128) NOT NULL,
countryOfOrigin varchar(128) NOT NULL,
FOREIGN KEY (countryOfOrigin) REFERENCES countries2 (countryName)
);
INSERT INTO countries2 (countryName)
VALUES ('Germany'), ('Sweden'), ('Italy'), ('China');
Now, inserting data into the students2 table is simpler:
INSERT INTO students2 (studentName, countryOfOrigin)
VALUES ('Benjamin Schmidt', 'Germany'),
('Erik Jakobsson', 'Sweden'),
('Manuel Verdi', 'Italy'),
('Min Lin', 'China');
So why should the first option be the better one, given that countryNames are unique and are never going to change?
There are two apects involved here:
natural keys vs. surrogate keys
autoincremented values
You are wondering why to have to deal with some arbitrary number for a country, when a country can be uniquely identified by its name. Well, imagine you use the country names in several tables to relate rows to each other. Then at some point you are told that you misspelled a country. You want to correct this, but have to do this in every table the country occurs in. In big databases you usually don't have cascading updates in order to avoid updates that unexpectedly take hours instead of mere minutes or seconds. So you must do this manually, but the foreign key constraints get in your way. You cannot change the parent table's key, because there are child tables using this, and you cannot change the key in the child tables first, because that key has to exist in the parent table. You'll have to work with a new row in the parent table and start from there. Quite some task. And even if you have no spelling issue, at some point someone might say "we need the official country names; you have China, but it must be the People's Republic of China instead" and again you must look up and change that contry in all those tables. And what about partial backups? A table gets totally messed up due to some programming error and must be replaced by last week's backup, because this is the best you have. But suddenly some keys don't match any more. You never want a table's key to change.
You say "country names are unique and are never going to change". Think again :-)
It is easier to have your database use a technical arbitrary ID instead. Then the country name only exists in the country table. And if that name must get changed, you change it just in that one place, and all relations stay intact. This, however, doesn't mean that natural keys are worse than technical IDs. They are not. But it's more difficult with them to set up a database correctly. In case of countries a good natural key would be a country ISO code, because this uniquely identifies a country and doesn't change. This would be my choice here.
With students it's the same. Students usually have a student number or student ID in real world, so you can simply use this number to uniquely identifiy a student in the database. But then, how do we get these unique student IDs? At a large university, two secretaries may want to enter new students at the same time. They ask the system what the last student's ID was. It was #11223, so they both want to issue #11224, which causes a conflict of course, because only one student can be given that number. In order to avoid this, DBMS offer sequences of which numbers are taken. Thus one of the secretaries pulls #11224 and the other #11225. Auto-incremented IDs work this way. Both secretaries enter their new student, the rows get inserted into the student table and result in the two different IDs that get reported back to the secretaries. This makes sequences and auto incrementing IDs a great and safe tool to work with.
Convention can be a useful guide. It isn't necessarily the best option in all situations.
There are usually tradeoffs involved, like space, convenience, etc.
While you showed one method of resolving / inserting the proper country key value, there's a slightly less wordy option supported by standard SQL (and many databases).
INSERT INTO students (studentName, countryOfOrigin)
WITH list (name, country) AS (
SELECT *
FROM (
VALUES ('Benjamin Schmidt', 'Germany')
, ('Erik Jakobsson', 'Sweden')
, ('Manuel Verdi', 'Italy')
, ('Min Lin', 'China')
) AS x
)
SELECT name, countryID
FROM list AS l
JOIN countries AS c
ON c.countryName = l.country
;
and a little less wordy again:
INSERT INTO students (studentName, countryOfOrigin)
WITH list (name, country) AS (
VALUES ('Benjamin Schmidt', 'Germany')
, ('Erik Jakobsson', 'Sweden')
, ('Manuel Verdi', 'Italy')
, ('Min Lin', 'China')
)
SELECT name, countryID
FROM list AS l
JOIN countries AS c
ON c.countryName = l.country
;
Here's a test case with MariaDB 10.5:
Working test case (updated)

Best database design for multiple entity types

I'm working on a web app and I have to design it's database. There's a part that didn't come very straightforward to me, so after some thinking and research I came with multiple ideas. Still neither seems completely suitable, so I'm not sure which one to implement and why.
The simplified problem looks as follows:
I have a table Teacher. There are 2 types of teachers, according to the relations with their Fields and Subjects:
A Teacher that's related to a Field, the Field is obligatory related to a Category
A Teacher that's not related to a Field, but directly to a Category
My initial idea was to have two nullable foreign keys, one to the table Field, and the other to the table Category. But in this case, how can I make sure that exactly one is null, and the other one is not?
The other idea is to create a hierarchy, with two types of Teacher tables derived from the table Teacher (is-a relation), but I couldn't find any useful tutorial on this.
I'm developing the app using Django with SQLite db
OK, your comment made it much clearer:
If a Teacher belongs to exactly one category, you should keep this in the Teacher's table directly:
Secondly each teacher belongs to "one or zero" fields. If this is sure for ever you should use a nullable FieldID column. This is set or remains empty.
Category (CategoryID, Name, ...)
Field (FieldID,Name,...)
Teacher (TeacherID,FieldID [NULL FK],CategoryID [NOT NULL FK], FirstName, Lastname, ...)
Remark: This is almost the same as my mapping table of the last answer. The only difference is, that you'll have a strict limitation with your "exactly one" or "exactly none or one"... From my experience I'd still prefer the open approach. It is easy to enforce your rules with unique indexes including the TeacherID-column. Sooner or later you'll probably have to re-structure this...
As you continue, one category is related to "zero or more" fields. There are two approaches:
Add a CategoryID-column to the Field-table (NOT NULL FK). This way you define a field several times with differing CategoryIDs (combined unique index!). A category's fields list you'll get simply by asking the Field-table for all fields with the given CategoryID.
Better in my eyes was a mapping table CategoryField. If you enforce a unique FieldID you'll get for sure, that no field is mapped twice. And add a unique index on the combination of CategoryID and FieldID...
A SELECT could be something like this (SQL Server Syntax, untested):
SELECT Teacher.TeacherID
,Teacher.FieldID --might be NULL
,Teacher.CategoryID --never NULL
,Teacher.[... Other columns ...]
,Field.Name --might be NULL
--The following columns you pick from the right source,
--depending on the return value of the LEFT JOIN to Field and the related "catField"
--the directly joined "Category" (which is never NULL) is the "default"
,ISNULL(catField.CategoryID,Category.CategoryID) AS ResolvedCategoryID
,ISNULL(catField.Name,Category.Name) AS ResolvedCategoryName
,[... Other columns ...]
FROM Teacher
INNER JOIN Category ON Teacher.CategoryID=Category.CategoryID --never NULL
LEFT JOIN Field ON Teacher.FieldID=Field.FieldID --might be NULL
LEFT JOIN Category AS catField ON Field.CategoryID=catField.CategoryID
This was the answer before the EDIT:
I try to help you even if the concept is not absolutely clear to me
Teacher-Table: TeacherID, person's data (name, address...), ...
Category-Table: CategoryID, category title, ...
Field-Tabls: FieldID, field title, ...
You say, that fields are bound to a category in all cases. If this is the same category in all cases, you should set the category as a FK-column in the Field-Table. If there is the slightest chance, that a field's category could differ with the context, you should not...
Same with teachers: If a teacher is ever bound to one single category set a FK-column within the Teacher-table, otherwise don't.
The most flexible you'll be with at least one mapping table:
(SQL Server Syntax)
CREATE TABLE TeacherFieldCategory
(
--A primary key to identify this row. This is not needed actually, but it will serve as clustered key index as a lookup index...
TeacherFieldCategoryID INT IDENTITY NOT NULL CONSTRAINT PK_TeacherFieldCategory PRIMARY KEY
--Must be set
,TeacherID INT NOT NULL CONSTRAINT FK_TeacherFieldCategory_TeacherID FOREIGN KEY REFERENCES Teacher(TeacherID)
--Field may be left NULL
,FieldID INT NULL CONSTRAINT FK_TeacherFieldCategory_FieldID FOREIGN KEY REFERENCES Field(FieldID)
--Must be set. This makes sure, that a teacher ever has a category and - if the field is set - the field will have a category
,CategoryID INT NOT NULL CONSTRAINT FK_TeacherFieldCategory_CategoryID FOREING KEY REFERENCES Category(CategoryID)
);
--This unique index will ensure, that each combination will exist only once.
CREATE UNIQUE INDEX IX_TeacherFieldCategory_UniqueCombination ON TeacherFieldCategory(TeacherID,FieldID,CategoryID);
It could be a better concept to have a mapping table FieldCategory and this table mapped to the mapping table above through a foreign key. Doing so you could avoid invalid field-category combinations.
Hope this helps...

How to construct a Junction Table for Many-to-Many relationship without breaking Normal Form

I have these two tables, Company and Owner.
Right now they are both in Normal Form, but I need to create a Many-to-Many relationship between them, since one Company can have many Owners and one Owner can have many Companies.
I have previously gotten an answer to whether adding an array of CompanyOwners (with Owner UUIDs) to Companies would break Normal Form, It will break Normal Form, and have been able to gather that what could be used is a Junction Table, see thread.
My question is as following: will the creation of an additional Junction Table as shown below, break Normal Form?
-- This is the junction table.
CREATE TABLE CompanyOwners(
Connection-ID UUID NOT NULL, // Just the ID (PK) of the relationship.
Company-ID UUID NOT NULL REFERENCES Company (Company-ID),
Owner-ID UUID NOT NULL REFERENCES Owner (Owner-ID),
CONSTRAINT "CompanyOwners" PRIMARY KEY ("Connection-ID")
)
Your structure allows duplicate data. For example, it allows data like this. (UUIDs abbreviated to prevent horizontal scrolling.)
Connection_id Company_id Owner_id
--
b56f5dc4...af5762ad2f86 4d34cd58...a4a529eefd65 3737dd70...a359346a13b3
0778038c...ad9525bd6099 4d34cd58...a4a529eefd65 3737dd70...a359346a13b3
8632c51e...1876f6d2ebd7 4d34cd58...a4a529eefd65 3737dd70...a359346a13b3
Each row in a relation should have a distinct meaning. This table allows millions of rows that mean the same thing.
Something along these lines is better. It's in 5NF.
CREATE TABLE CompanyOwners(
Company_ID UUID NOT NULL references Company (Company_ID),
Owner_ID UUID NOT NULL references Owner (Owner_ID),
PRIMARY KEY (Company_ID, Owner_ID)
);
Standard SQL doesn't allow "-" in identifiers.
This is fine as it is but you could add a couple of more columns like
DateOwned Datetime --<-- when the owner bought the company
DateSold Datetime --<-- when a the owner sold the compnay
After all you would want to know something like is company is still owned by the same owner, and keep track of the company's ownership history etc.

How To Create a Complex Table in sqlServer?

Lets say I have a table called Employees , and each employee has a primarykey called (E_ID)
and I have another table called Positions , and each Position has a primarykey called (P_ID)
and I also have another table called offices , and each office has an ID called (O_ID)
Now I want to create a table that has three primaryKeys which are (E_ID) and (P_ID) and (O_ID) ...
ofcourse these three values must be withdrawl from the first three tables , but I just can't do it anyway ?
please help me because I neeeeeeed it badly
thanks verymuch
If it was me, I think I'd just add P_ID and O_ID to Employees. The same Position might be filled by multiple employees, and there might be multiple Employees at a given Office, but it's unlikely (without using Cloning technology) that the same Employee would need to be replicated multiple times - thus, just add P_ID and O_ID to Employee and I think you're good to go. Of course, you'll need foreign key constraints from Employee to Position (P_ID) and Office (O_ID).
EDIT: After some thought, and recalling that I've had jobs where I filled multiple positions (although at the same location), I suppose it's conceivable that a single person might have fill multiple positions which might be at different locations.
If you're really set on having a junction table between Employees, Positions, and Offices - OK, create a table called EmployeePositionOffice (or something like that) which contains the three columns E_ID, P_ID, and O_ID. The primary key should be (E_ID, P_ID, O_ID), and each field should be foreign-keyed to the related base table.
EDIT:
Not sure about the SQL Server syntax, but in Oracle the first would be something like:
ALTER TABLE EMPLOYEES
ADD (P_ID NUMBER REFERENCES POSITIONS(P_ID),
O_ID NUMBER REFERENCES OFFICES(O_ID));
while the second would be something like
CREATE TABLE EMPLOYEES_POSISTIONS_OFFICES
(E_ID NUMBER REFERENCES EMPLOYEES(E_ID),
P_ID NUMBER REFERENCES POSITIONS(P_ID),
O_ID NUMBER REFERENCES OFFICES(O_ID),
PRIMARY KEY (E_ID, P_ID, O_ID));
Share and enjoy.

Polymorphism in SQL database tables?

I currently have multiple tables in my database which consist of the same 'basic fields' like:
name character varying(100),
description text,
url character varying(255)
But I have multiple specializations of that basic table, which is for example that tv_series has the fields season, episode, airing, while the movies table has release_date, budget etc.
Now at first this is not a problem, but I want to create a second table, called linkgroups with a Foreign Key to these specialized tables. That means I would somehow have to normalize it within itself.
One way of solving this I have heard of is to normalize it with a key-value-pair-table, but I do not like that idea since it is kind of a 'database-within-a-database' scheme, I do not have a way to require certain keys/fields nor require a special type, and it would be a huge pain to fetch and order the data later.
So I am looking for a way now to 'share' a Primary Key between multiple tables or even better: a way to normalize it by having a general table and multiple specialized tables.
Right, the problem is you want only one object of one sub-type to reference any given row of the parent class. Starting from the example given by #Jay S, try this:
create table media_types (
media_type int primary key,
media_name varchar(20)
);
insert into media_types (media_type, media_name) values
(2, 'TV series'),
(3, 'movie');
create table media (
media_id int not null,
media_type not null,
name varchar(100),
description text,
url varchar(255),
primary key (media_id),
unique key (media_id, media_type),
foreign key (media_type)
references media_types (media_type)
);
create table tv_series (
media_id int primary key,
media_type int check (media_type = 2),
season int,
episode int,
airing date,
foreign key (media_id, media_type)
references media (media_id, media_type)
);
create table movies (
media_id int primary key,
media_type int check (media_type = 3),
release_date date,
budget numeric(9,2),
foreign key (media_id, media_type)
references media (media_id, media_type)
);
This is an example of the disjoint subtypes mentioned by #mike g.
Re comments by #Countably Infinite and #Peter:
INSERT to two tables would require two insert statements. But that's also true in SQL any time you have child tables. It's an ordinary thing to do.
UPDATE may require two statements, but some brands of RDBMS support multi-table UPDATE with JOIN syntax, so you can do it in one statement.
When querying data, you can do it simply by querying the media table if you only need information about the common columns:
SELECT name, url FROM media WHERE media_id = ?
If you know you are querying a movie, you can get movie-specific information with a single join:
SELECT m.name, v.release_date
FROM media AS m
INNER JOIN movies AS v USING (media_id)
WHERE m.media_id = ?
If you want information for a given media entry, and you don't know what type it is, you'd have to join to all your subtype tables, knowing that only one such subtype table will match:
SELECT m.name, t.episode, v.release_date
FROM media AS m
LEFT OUTER JOIN tv_series AS t USING (media_id)
LEFT OUTER JOIN movies AS v USING (media_id)
WHERE m.media_id = ?
If the given media is a movie,then all columns in t.* will be NULL.
Consider using a main basic data table with tables extending off of it with specialized information.
Ex.
basic_data
id int,
name character varying(100),
description text,
url character varying(255)
tv_series
id int,
BDID int, --foreign key to basic_data
season,
episode
airing
movies
id int,
BDID int, --foreign key to basic_data
release_data
budget
What you are looking for is called 'disjoint subtypes' in the relational world. They are not supported in sql at the language level, but can be more or less implemented on top of sql.
You could create one table with the main fields plus a uid then extension tables with the same uid for each specific case. To query these like separate tables you could create views.
Using the disjoint subtype approach suggested by Bill Karwin, how would you do INSERTs and UPDATEs without having to do it in two steps?
Getting data, I can introduce a View that joins and selects based on specific media_type but AFAIK I cant update or insert into that view because it affects multiple tables (I am talking MS SQL Server here). Can this be done without doing two operations - and without a stored procedure, natually.
Thanks
Question is quite old but for modern postresql versions it's also worth considering using json/jsonb/hstore type.
For example:
create table some_table (
name character varying(100),
description text,
url character varying(255),
additional_data json
);