Matching delimited string to table rows - sql

So I have two tables in this simplified example: People and Houses. People can own multiple houses, so I have a People.Houses field which is a string with comma delimeters (eg: "House1, House2, House4"). Houses can have multiple people in them, so I have a Houses.People field, which works the same way ("Sam, Samantha, Daren").
I want to find all the rows in the People table corresponding to the the names of people in the given house, and vice versa for houses belong to people. But I can't figure out how to do that.
This is as close as I've come up with so far:
SELECT People.*
FROM Houses
LEFT JOIN People ON Houses.People Like CONCAT(CONCAT('%', People.Name), '%')
WHERE House.Name = 'SomeArbitraryHouseImInterestedIn'
But I get some false positives (eg: Sam and Samantha might both get grabbed when I just want Samantha. And likewise with House3, House34, and House343, when I want House343).
I thought I might try and write a SplitString function so I could split a string (using a list of delimiters) into a set, and do some subquery on that set, but MySQL functions can't have tables as return values.
Likewise you can't store arrays as fields, and from what I gather the comma-delimited elements in a long string seems to be the usual way to approach this problem.
I can think of some different ways to get what I want but I'm wondering if there isn't a nice solution.

Likewise you can't store arrays as fields, and from what I gather the comma-delimited elements in a long string seems to be the usual way to approach this problem.
I hope that's not true. Representing "arrays" in SQL databases shouldn't be in a comma-delimited format, but the problem can be correctly solved by using a junction table. Comma-separated fields should have no place in relational databases, and they actually violates the very first normal form.
You'd want your table schema to look something like this:
CREATE TABLE people (
id int NOT NULL,
name varchar(50),
PRIMARY KEY (id)
) ENGINE=INNODB;
CREATE TABLE houses (
id int NOT NULL,
name varchar(50),
PRIMARY KEY (id)
) ENGINE=INNODB;
CREATE TABLE people_houses (
house_id int,
person_id int,
PRIMARY KEY (house_id, person_id),
FOREIGN KEY (house_id) REFERENCES houses (id),
FOREIGN KEY (person_id) REFERENCES people (id)
) ENGINE=INNODB;
Then searching for people will be as easy as this:
SELECT p.*
FROM houses h
JOIN people_houses ph ON ph.house_id = h.id
JOIN people p ON p.id = ph.person_id
WHERE h.name = 'SomeArbitraryHouseImInterestedIn';
No more false positives, and they all lived happily ever after.

The nice solution is to redesign your schema so that you have the following tables:
People
------
PeopleID (PK)
...
PeopleHouses
------------
PeopleID (PK) (FK to People)
HouseID (PK) (FK to Houses)
Houses
------
HouseID (PK)
...

Short Term Solution
For your immediate problem, the FIND_IN_SET function is what you want to use for joining:
For People
SELECT p.*
FROM PEOPLE p
JOIN HOUSES h ON FIND_IN_SET(p.name, h.people)
WHERE h.name = ?
For Houses
SELECT h.*
FROM HOUSES h
JOIN PEOPLE p ON FIND_IN_SET(h.name, p.houses)
WHERE p.name = ?
Long Term Solution
Is to properly model this by adding a table to link houses to people, because you're likely storing redundant relationships in both tables:
CREATE TABLE people_houses (
house_id int,
person_id int,
PRIMARY KEY (house_id, person_id),
FOREIGN KEY (house_id) REFERENCES houses (id),
FOREIGN KEY (person_id) REFERENCES people (id)
)

The problem is that you have to use another schema, like the one proposed by #RedFilter. You can see it as:
People table:
PeopleID
otherFields
Houses table:
HouseID
otherFields
Ownership table:
PeopleID
HouseID
otherFields
Hope that helps,

Hi you just change the table name places, left side is People and then right side is Houses:
SELECT People.*
FROM People
LEFT JOIN Houses ON Houses.People Like CONCAT(CONCAT('%', People.Name), '%')
WHERE House.Name = 'SomeArbitraryHouseImInterestedIn'

Related

How to handle multivalued fields in MYSQL database for movie collection?

I have a large number of movies and TV series, which I currently keep track of in an MS Excel worksheet. Due to the large number of records and various data required, it is no longer a convenient option, so I want to switch to a MYSQL database, accessed through a GUI programmed in Java using Netbeans IDE.
I have the following tables in Excel:
Media_Library,
To_Be_Watched,
Statistics,
Wish_List,
Orders
Each film and TV series in my collection is in the Media_Library table, which has the following fields:
Sorting_Title
Title
Collection
Genre
Release_Year
Director
Age_Rating
Country
Runtime (min)
Watched
Media_Type
Format
For example: 'Alien 2', 'Aliens', 'Alien: Anthology', 'Action/Horror/Sci-Fi', 1986, 'James Cameron', 'M', 'America', 137, 'Yes', 'Movie', '4K UHD'
I'm stuck on what to do for the following fields: Genre, Director, Country, Runtime
Those 4 fields can each have multiple values, and I don't know how best to handle that; e.g. most films only have 1 runtime, but many have multiple (2 of the films have 4 different cuts). Also anthology films can have something like 6 different directors. I want to include all relevant genres, directors, countries and runtimes, but I don't know how to best do that.
I've tried adding a column for each value; genre1, genre2, ... This results in many blank values though. In the spreadsheet in Excel I put all applicable genres in a single field as one string, e.g. 'comedy/horror'.
What would be the easiest way to resolve this issue? Can I do a many-to-many relationship to achieve what I want?
Simply put a hard limit on the amount of genres.
For instance, while you may want the user to be able to enter as many genres as they want, is it rational to go above 20 genres? That doesn't make much sense and will only make searches much more time intensive.
For other possible duplicates, you can do something like this (in sqlite3 at least):
CREATE TABLE IF NOT EXISTS Directors
(id INTEGER PRIMARY KEY,
director TEXT,
UNIQUE(director) ON CONFLICT IGNORE)
CREATE TABLE IF NOT EXISTS file
(file_id INTEGER PRIMARY KEY,
filename TEXT,
director_id INTEGER,
watched INTEGER,
FOREIGN KEY (director_id)
REFERENCES Directors (id)
ON UPDATE CASCADE
ON DELETE SET NULL)
It doesn't matter if more than one genre have the same director, just as long as the 'file' table knows which one it's referencing and staying updated.
The 'watched' column holds a type of value that doesn't make sense to create an individual table for. For instance, say a song's track number is 2. Creating a table just for track numbers to reference doesn't make sense because you're going to spend a point in that table, then spend another point in the 'file' table to reference. So, just spend 1 point and put in the 'file'.
https://www.sqlitetutorial.net/sqlite-foreign-key/
Generally, you would add a second table, Directors, for instance, and then you relate that back to the movie title. You will need a uniqueID for the movie, and you do a join where that uniqueID is referenced in the Directors table, something like this working demo (not all fields were included in my demo):
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9e8d73bc798767b56b974ab4ebd30517
SELECT m.id, m.title, d.director
FROM Media_Library m
JOIN Directors d ON (m.id = d.mediaID);
or to concatenate the directors:
SELECT m.id, m.title, group_concat(d.director) as directors
FROM Media_Library m
JOIN Directors d ON (m.id = d.mediaID)
GROUP BY m.id;
Usually when you make this kind of relationship you will define a foreign key restraint, creating a link between the primary key of one table and a key (or keys) in another. In this case, the link is between id in Media_Library and mediaID in directors, so you would alter the create statement like this:
CREATE TABLE Directors (
id int not null auto_increment,
mediaID int,
director varchar(50),
PRIMARY KEY(id),
FOREIGN KEY (mediaID) REFERENCES Media_Library(id)
);
The foreign key is not strictly necessary, but it can reinforce database integrity. The ins and outs of foreign keys are out of scope for this answer, but you should probably read about them.
It is also possible to store the data in a JSON field since v5.7, like this:
CREATE TABLE test.Media_Library (id int not null auto_increment, title varchar (50), director JSON, PRIMARY KEY (id));
INSERT INTO test.Media_Library (title, director) VALUES
('Alien', json_array("Scott", "Scorsese")),
('The Alienist', '["Tarantino", "Nolan", "Kubrick"]'),
('Alien 2', '["Scott"]');
SELECT * FROM test.Media_Library;
https://www.db-fiddle.com/f/tG1SZorEHEYi5cYwgXjPeY/1
In the second query in that fiddle, I select only the first director from the list:
SELECT id, title, director->>'$[0]' as firstDirector
FROM Media_Library;
There are advantages to storing data this way, but there are tradeoffs, and unless you know what you are doing or you have a specific reason to be using JSON fields (for instance, you are getting the data from an api as JSON and you just want to use it as is), I would stick with the join method. Also, storing arrays is inherently non-normal (read about database normalization, a quick overview on wiki: https://en.wikipedia.org/wiki/Database_normalization).

Benefits of using an autogenerated primary key instead of a constant unique name

I've heard that having an autogenerated primary key is a convention. However, I'm trying to understand its benefits in the following scenario:
CREATE TABLE countries
(
countryID int(11) PRIMARY KEY AUTO_INCREMENT,
countryName varchar(128) NOT NULL
);
CREATE TABLE students
(
studentID int(11) PRIMARY KEY AUTO_INCREMENT,
studentName varchar(128) NOT NULL,
countryOfOrigin int(11) NOT NULL,
FOREIGN KEY (countryOfOrigin) REFERENCES countries (countryID)
);
INSERT INTO countries (countryName)
VALUES ('Germany'), ('Sweden'), ('Italy'), ('China');
If I want to insert something into the students table, I need to lookup the countryIDs in the countries table:
INSERT INTO students (studentName, countryOfOrigin)
VALUES ('Benjamin Schmidt', (SELECT countryID FROM countries WHERE countryName = 'Germany')),
('Erik Jakobsson', (SELECT countryID FROM countries WHERE countryName = 'Sweden')),
('Manuel Verdi', (SELECT countryID FROM countries WHERE countryName = 'Italy')),
('Min Lin', (SELECT countryID FROM countries WHERE countryName = 'China'));
In a different scenario, as I know that the countryNames in the countries table are unique and not null, I could to the following:
CREATE TABLE countries2
(
countryName varchar(128) PRIMARY KEY
);
CREATE TABLE students2
(
studentID int(11) PRIMARY KEY AUTO_INCREMENT,
studentName varchar(128) NOT NULL,
countryOfOrigin varchar(128) NOT NULL,
FOREIGN KEY (countryOfOrigin) REFERENCES countries2 (countryName)
);
INSERT INTO countries2 (countryName)
VALUES ('Germany'), ('Sweden'), ('Italy'), ('China');
Now, inserting data into the students2 table is simpler:
INSERT INTO students2 (studentName, countryOfOrigin)
VALUES ('Benjamin Schmidt', 'Germany'),
('Erik Jakobsson', 'Sweden'),
('Manuel Verdi', 'Italy'),
('Min Lin', 'China');
So why should the first option be the better one, given that countryNames are unique and are never going to change?
There are two apects involved here:
natural keys vs. surrogate keys
autoincremented values
You are wondering why to have to deal with some arbitrary number for a country, when a country can be uniquely identified by its name. Well, imagine you use the country names in several tables to relate rows to each other. Then at some point you are told that you misspelled a country. You want to correct this, but have to do this in every table the country occurs in. In big databases you usually don't have cascading updates in order to avoid updates that unexpectedly take hours instead of mere minutes or seconds. So you must do this manually, but the foreign key constraints get in your way. You cannot change the parent table's key, because there are child tables using this, and you cannot change the key in the child tables first, because that key has to exist in the parent table. You'll have to work with a new row in the parent table and start from there. Quite some task. And even if you have no spelling issue, at some point someone might say "we need the official country names; you have China, but it must be the People's Republic of China instead" and again you must look up and change that contry in all those tables. And what about partial backups? A table gets totally messed up due to some programming error and must be replaced by last week's backup, because this is the best you have. But suddenly some keys don't match any more. You never want a table's key to change.
You say "country names are unique and are never going to change". Think again :-)
It is easier to have your database use a technical arbitrary ID instead. Then the country name only exists in the country table. And if that name must get changed, you change it just in that one place, and all relations stay intact. This, however, doesn't mean that natural keys are worse than technical IDs. They are not. But it's more difficult with them to set up a database correctly. In case of countries a good natural key would be a country ISO code, because this uniquely identifies a country and doesn't change. This would be my choice here.
With students it's the same. Students usually have a student number or student ID in real world, so you can simply use this number to uniquely identifiy a student in the database. But then, how do we get these unique student IDs? At a large university, two secretaries may want to enter new students at the same time. They ask the system what the last student's ID was. It was #11223, so they both want to issue #11224, which causes a conflict of course, because only one student can be given that number. In order to avoid this, DBMS offer sequences of which numbers are taken. Thus one of the secretaries pulls #11224 and the other #11225. Auto-incremented IDs work this way. Both secretaries enter their new student, the rows get inserted into the student table and result in the two different IDs that get reported back to the secretaries. This makes sequences and auto incrementing IDs a great and safe tool to work with.
Convention can be a useful guide. It isn't necessarily the best option in all situations.
There are usually tradeoffs involved, like space, convenience, etc.
While you showed one method of resolving / inserting the proper country key value, there's a slightly less wordy option supported by standard SQL (and many databases).
INSERT INTO students (studentName, countryOfOrigin)
WITH list (name, country) AS (
SELECT *
FROM (
VALUES ('Benjamin Schmidt', 'Germany')
, ('Erik Jakobsson', 'Sweden')
, ('Manuel Verdi', 'Italy')
, ('Min Lin', 'China')
) AS x
)
SELECT name, countryID
FROM list AS l
JOIN countries AS c
ON c.countryName = l.country
;
and a little less wordy again:
INSERT INTO students (studentName, countryOfOrigin)
WITH list (name, country) AS (
VALUES ('Benjamin Schmidt', 'Germany')
, ('Erik Jakobsson', 'Sweden')
, ('Manuel Verdi', 'Italy')
, ('Min Lin', 'China')
)
SELECT name, countryID
FROM list AS l
JOIN countries AS c
ON c.countryName = l.country
;
Here's a test case with MariaDB 10.5:
Working test case (updated)

SQL Relation and Query

I am trying to create a database that contains two tables. I have included the create_tables.sql code if this helps. I am trying to set the relationship to make the STKEY the defining key so that a query can be used to search for thr key and show what issues this student has been having. At the moment when I search using:
SELECT *
FROM student, student_log
WHERE 'tilbun' like student.stkey
It shows all the issues in the table regardless of the STKEY. I think I may have the foreign key set incorrectly. I have included the create_tables.sql here.
CREATE TABLE `student`
(
`STKEY` VARCHAR(10),
`first_name` VARCHAR(15),
`surname` VARCHAR(15),
`year_group` VARCHAR(4),
PRIMARY KEY (STKEY)
)
;
CREATE TABLE `student_log`
(
`issue_number` int NOT NULL AUTO_INCREMENT,
`STKEY` VARCHAR(10),
`date_field` DATETIME,
`issue` VARCHAR(150),
PRIMARY KEY (issue_number),
INDEX (STKEY),
FOREIGN KEY (STKEY) REFERENCES student (STKEY)
)
;
Cheers for the help.
Though you have correctly defined the foreign key relationship in the tables, you must still specify a join condition when performing the query. Otherwise, you'll get a cartesian product of the two tables (all rows of one times all rows of the other)
SELECT
student.*,
student_log.*
FROM student INNER JOIN student_log ON student.STKEY = student_log.STKEY
WHERE student.STKEY LIKE 'tilbun'
And note that rather than using an implicit join (comma-separated list of tables), I have used an explicit INNER JOIN, which is the preferred modern syntax.
Finally, there's little use to using a LIKE clause instead of = unless you also use wildcard characters
WHERE student.STKEY LIKE '%tilbun%'

Many-to-many relations in RDBMS databases

What is the best way of handling many-to-many relations in a RDBMS database like mySQL?
Have tried using a pivot table to keep track of the relationships, but it leads to either one of the following:
Normalization gets left behind
Columns that is empty or null
What approach have you taken in order to support many-to-many relationships?
Keep track of a many-to-many relationship in a table specifically for that relationship (sometimes called a junction table). This table models the relationship as two one-to-many relationships pointing in opposite directions.
CREATE TABLE customer (
customer_id VARCHAR NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (customer_id));
CREATE TABLE publication (
issn VARCHAR NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (issn));
-- Many-to-many relationship for subscriptions.
CREATE TABLE subscription (
customer_id VARCHAR NOT NULL,
FOREIGN KEY customer_id REFERENCES customer (customer_id),
issn VARCHAR NOT NULL,
FOREIGN KEY issn REFERENCES publication (issn),
begin TIMESTAMP NOT NULL,
PRIMARY KEY (customer_id, issn));
You then use the junction table to join other tables through it via the foreign keys.
-- Which customers subscribe to publications named 'Your Garden Gnome'?
SELECT customer.*
FROM customer
JOIN subscription
ON subscription.customer_id = customer.customer_id
JOIN publication
ON subscription.issn = publication.issn
WHERE
publication.name = 'Your Garden Gnome';
-- Which publications do customers named 'Fred Nurk' subscribe to?
SELECT publication.*
FROM publication
JOIN subscription
ON subscription.issn = publication.issn
JOIN customer
ON subscription.customer_id = customer.customer_id
WHERE
customer.name = 'Fred Nurk';
I would use a pivot table, but I don't see where your issues are coming from. Using a simple student/class example:
Student
-------
Id (Primary Key)
FirstName
LastName
Course
------
Id (Primary Key)
Title
StudentCourse
-------------
StudentId (Foreign Key -> Student)
CourseId (Foreign Key -> Course)
Or, as somebody else mentioned in response to your Student/Teacher/Course question (which would have an additional table to store the type of person in the course):
PersonType
----------
Id (Primary Key)
Type
Person
------
Id (Primary Key)
FirstName
LastName
Type (Foreign Key -> PersonType)
Course
------
Id (Primary Key)
Title
PersonCourse
------------
PersonId (Foreign Key -> Person)
CourseId (Foreign Key -> Course)
The Student table contains student information, the Course table stores course information...and the pivot table simply contains the Ids of the relevant students and courses. That shouldn't lead to any null/empty columns or anything.
In addition to Justin's answer: if you make clever use of Foreign Key constraints, you can control what happens when data gets updated or deleted. That way, you can make sure that you do not end up with de-normalized data.

Polymorphism in SQL database tables?

I currently have multiple tables in my database which consist of the same 'basic fields' like:
name character varying(100),
description text,
url character varying(255)
But I have multiple specializations of that basic table, which is for example that tv_series has the fields season, episode, airing, while the movies table has release_date, budget etc.
Now at first this is not a problem, but I want to create a second table, called linkgroups with a Foreign Key to these specialized tables. That means I would somehow have to normalize it within itself.
One way of solving this I have heard of is to normalize it with a key-value-pair-table, but I do not like that idea since it is kind of a 'database-within-a-database' scheme, I do not have a way to require certain keys/fields nor require a special type, and it would be a huge pain to fetch and order the data later.
So I am looking for a way now to 'share' a Primary Key between multiple tables or even better: a way to normalize it by having a general table and multiple specialized tables.
Right, the problem is you want only one object of one sub-type to reference any given row of the parent class. Starting from the example given by #Jay S, try this:
create table media_types (
media_type int primary key,
media_name varchar(20)
);
insert into media_types (media_type, media_name) values
(2, 'TV series'),
(3, 'movie');
create table media (
media_id int not null,
media_type not null,
name varchar(100),
description text,
url varchar(255),
primary key (media_id),
unique key (media_id, media_type),
foreign key (media_type)
references media_types (media_type)
);
create table tv_series (
media_id int primary key,
media_type int check (media_type = 2),
season int,
episode int,
airing date,
foreign key (media_id, media_type)
references media (media_id, media_type)
);
create table movies (
media_id int primary key,
media_type int check (media_type = 3),
release_date date,
budget numeric(9,2),
foreign key (media_id, media_type)
references media (media_id, media_type)
);
This is an example of the disjoint subtypes mentioned by #mike g.
Re comments by #Countably Infinite and #Peter:
INSERT to two tables would require two insert statements. But that's also true in SQL any time you have child tables. It's an ordinary thing to do.
UPDATE may require two statements, but some brands of RDBMS support multi-table UPDATE with JOIN syntax, so you can do it in one statement.
When querying data, you can do it simply by querying the media table if you only need information about the common columns:
SELECT name, url FROM media WHERE media_id = ?
If you know you are querying a movie, you can get movie-specific information with a single join:
SELECT m.name, v.release_date
FROM media AS m
INNER JOIN movies AS v USING (media_id)
WHERE m.media_id = ?
If you want information for a given media entry, and you don't know what type it is, you'd have to join to all your subtype tables, knowing that only one such subtype table will match:
SELECT m.name, t.episode, v.release_date
FROM media AS m
LEFT OUTER JOIN tv_series AS t USING (media_id)
LEFT OUTER JOIN movies AS v USING (media_id)
WHERE m.media_id = ?
If the given media is a movie,then all columns in t.* will be NULL.
Consider using a main basic data table with tables extending off of it with specialized information.
Ex.
basic_data
id int,
name character varying(100),
description text,
url character varying(255)
tv_series
id int,
BDID int, --foreign key to basic_data
season,
episode
airing
movies
id int,
BDID int, --foreign key to basic_data
release_data
budget
What you are looking for is called 'disjoint subtypes' in the relational world. They are not supported in sql at the language level, but can be more or less implemented on top of sql.
You could create one table with the main fields plus a uid then extension tables with the same uid for each specific case. To query these like separate tables you could create views.
Using the disjoint subtype approach suggested by Bill Karwin, how would you do INSERTs and UPDATEs without having to do it in two steps?
Getting data, I can introduce a View that joins and selects based on specific media_type but AFAIK I cant update or insert into that view because it affects multiple tables (I am talking MS SQL Server here). Can this be done without doing two operations - and without a stored procedure, natually.
Thanks
Question is quite old but for modern postresql versions it's also worth considering using json/jsonb/hstore type.
For example:
create table some_table (
name character varying(100),
description text,
url character varying(255),
additional_data json
);