Pset7 - Movies Stuck on 12 and 13 SQL? - sql

I am currently working of CS50 PSET7 (https://cs50.harvard.edu/x/2020/psets/7/movies/) and I CAN NOT figure out how to do 12.sql and 13.sql (explained in link). Can someone PLEASE help me?

For 12.sql: Find movie titles where 'id' in "id's of Johnny Depp movies" and 'id' in "id's of Helena Bonham Carter movies", such as:
SELECT "title" FROM "movies"
WHERE "id" IN (-- code to select movie id's in which "Johnny Depp" starred)
AND "id" IN (-- code to select movie id's in which "Helena Bonham Carter" starred);
For 13.sql: Find names of people where "person_id's" in "stars" correspond to the "movie_id" in which "Kevin Bacon (born: 1958)" starred, and names != "Kevin Bacon", such as:
SELECT "name" FROM "people"
WHERE "id" IN
(-- select "person id's" from "stars" where "movie id" in
(-- select "movie id's" in which "Kevin Bacon (born: 1958)" starred))
AND "name" != "Kevin Bacon";
Inside the second brackets of 13.sql, to query "Kevin Bacon born in 1958", you can write some code like this:
... WHERE "people"."name" = "Kevin Bacon" AND "people"."birth" = 1958))...
Think simple, no need to do anything fancy.

12.sql
Consider using HAVING COUNT()
https://www.w3resource.com/sql/aggregate-functions/count-having.php
13.sql
As I've also answered in another thread, I found these steps helpful:
Get the ID of Kevin Bacon, with the criteria that it's the Kevin Bacon who was born in 1958
Get the movie IDs of Kevin Bacon using his ID (hint: linking his ID in table1 with table2)
Get other stars' IDs with the same movie IDs
Get the name of these stars, and exclude Kevin Bacon (because the spec says he shouldn't be included in the resulting list)

For both of these Psets you need to use nested SELECT statements e.g.:
SELECT table.column FROM table WHERE table.column IN (SELECT table.column2 FROM table WHERE ...)
Based on my experience for 12 you will need to use 2 separate nested queries (each of which should have multiple values) and then use an AND operator to find movies that appear in both of these.
For 13 I found using several nested queries helped, starting with finding the id for Kevin Bacon and working up to selecting people. name values from a query that contained multiple possible people.id values.

Related

Joining 2 Tables where one is a specific column and the other is a string?

I asked this yesterday, but I think I need to clarify a few things. I appreciate everyone's help here!
Basically, my goal is to join 2 tables through a "middle table" where the the shared values with one of the tables is in the description column which is a bunch of text.
If I have SportsTable with player_column And then there's NewsTable with a description_column but which contains a players name...
So SportsTable with player_column with "Lebron James" with NewsTable with a description_column of "The leading scorer in the game was Lebron James."
Can I join these two?
The responses I got were for a specific name in the LIKE CONCAT ("Lebron James") but I'm looking to get all the player names that are present in SportsTable (so, "Lebron James", "Michael Jordan", etc") that are also present in the description string column in NewsTable.
Is this even possible?
Consider the following:
with SportsTable as (
select 'Lebron James' as player_column union all
select 'Tom Brady' union all
select 'Michael Jordan'
),
NewsTable as (
select 'I love Lebron James' as description_column union all
select 'Tom Brady is the GOAT' union all
select 'Lebron James and Michael Jordan are good at basketball'
)
select
description_column,
player_column
from SportsTable
cross join NewsTable
where description_column like concat('%',player_column,'%')
order by 1

sql finding multiple characters within multiple books

I can't seem to find all comics with certain characters in them. The comics and characters tables have a many to many relationship as follows:
My database schema:
**comics table**
comic_id
comic_name
comic_date
**character table**
character_id
character_name
**comics_character table**
comic_id
character_id
This works fine for one character:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name = 'Superman';
But if I want all comics with say Superman and Batman in them, I tried using this:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name in ('Batman', 'Superman');
but this only gets me a list of comics featuring either Batman OR Superman rather than comics with both Batman AND Superman in
I've also tried this which doesn't return anything:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on (c.comic_id = cc.comic_id and h.character_id = cc.character_id)
where (h.character_name = 'Batman' and h.character_name = 'Superman');
I've tried other variations but can't get the desired result
The OR route doesn't work, for reasons you've worked out - you get rows for superman or batman, and comics that have both characters have two rows with the same comic is and different character ids. The AND route doesn't work because a row cannot be simultaneously both characters.
So, you need to use the OR route to get comics with one or both characters and then also use a count to show only comics with both characters. Essentially, "filter to superman or batman, and then filter again to only comic ids that appear on two rows" or "filter to only batman or superman then group them up based on the comic is and only take Groups that have two entities in them". Ultimately, the lesson here is that database rows are thought of as different entities and when you want to treat them as one you have to group them, so we are identifying comics based on some attribute of the group after (deliberately) losing the detail of exactly which entities the group contains:
SELECT comic_id
FROM comic_characters
WHERE character_id IN (1,2) --Batman or Superman
GROUP BY conic_id
HAVING COUNT(*) = 2
The number on the right hand side of the = must be the same as the number of character IDs in the IN() clause. If you IN for 4 character IDs, then use COUNT(*) = 4
You can join in other tables so you can use names etc; I simplified this to make the point without extraneous detail
Footnote; this technique wil find comics that feature at least batman and superman- the comic could very well contain other characters too but we lost those other guys at the WHERE stage before we did the GROUP BY. If you wanted comics that ONLY featured batman and superman it's a different thing. For that we could do something like grouping first, and counting conditionally - give batman or superman a score of 1 and everyone else a score of 10, comics that had only B and S would score 2, comics that featured only one of them would score 1 and anyone else's presence would cause a score of 10 or more, so we could filter on the 2

BigQuery: grouping by similar strings for a large dataset

I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.
Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.

Querying for swapped columns in SQL database

We've had a few cases of people entering in first names where last names should be and vice versa. So I'm trying to come up with a SQL search to match the swapped columns. For example, someone may have entered the record as first_name = Smith, last_name = John by accident. Later, another person may see that John Smith is not in the database and enter a new user as first_name = John, last_name = Smith, when in fact it is the same person.
I used this query to help narrow my search:
SELECT person_id, first_name, last_name
FROM people
WHERE first_name IN (
SELECT last_name FROM people
) AND last_name IN (
SELECT first_name FROM people
);
But if we have people named John Allen, Allen Smith, and Smith John, they would all be returned even though none of those are actually duplicates. In this case, it's actually good enough that I can see the duplicates in my particular data set, but I'm wondering if there's a more precise way to do this.
I would do a self join like this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on p1.first_name = p2.last_name and p1.last_name = p2.first_name
To also find typos on names I recommend this:
SELECT p1.person_id, p1.first_name, p1.last_name
FROM people p1
join people p2 on soundex(p1.first_name) = soundex(p2.last_name) and
soundex(p1.last_name) = soundex(p2.first_name)
soundex is a neat function that "hashes" words in a way that two words that sound the same get the same hash. This means Anne and Ann will have the same soundex. So if you had an Anne Smith and a Smith Ann the query above would find them as a match.
Interesting. This is a problem that I cover in Data Analysis Using SQL and Excel (note: I only very rarely mention books in my answers or comments).
The idea is to summarize the data to get a likelihood of a mismatch. So, look at the number of times a name appears as a first name and as a last name and then combine these. So:
with names as (
select first_name as name, 1.0 as isf, 0.0 as isl
from people
union all
select last_name, 0, 1
from people
),
nl as (
select name, sum(isf) as numf, sum(isl) as numl,
avg(isf) as p_f, avg(isl) as p_l
from names
group by name
)
select p.*
from people p join
nl nlf
on p.first_name = nlf.name join
nl nll
on p.last_name = nll.name
order by (coalesce(nlf.p_l, 0) + coalesce(nll.p_f, 0));
This orders the records by a measure of mismatch of the names -- the sum of the probabilities of the first name used by a last name and a last name used as a first name.

mysql where IN on large dataset or Looping?

I have the following scenario:
Table 1:
articles
id article_text category author_id
1 "hello world" 4 1
2 "hi" 5 2
3 "wasup" 4 3
Table 2
authors
id name friends_with
1 "Joe" "Bob"
2 "Sue" "Joe"
3 "Fred" "Bob"
I want to know the total number of authors that are friends with "Bob" for a given category.
So for example, for category 4 how many authors are there that are friends with "Bob".
The authors table is quite large, in some cases I have a million authors that are friends with "Bob"
So I have tried:
Get list of authors that are friends with bob, and then loop through them and get the count for each of them of that given category and sum all those together in my code.
The issue with this approach is it can generate a million queries, even though they are very fast, it seems there should be a better way.
I was thinking of trying to get a list of authors that are friends with bob and then building an IN clause with that list, but I fear that would blow out the amt of memory allowed in the query set.
Seems like this is a common problem. Any ideas?
thanks
SELECT COUNT(DISTINCT auth.id)
FROM authors auth
INNER JOIN articles art ON auth.id = art.author_id
WHERE friends_with = 'bob' AND art.category = 4
Count(Distinct a.id) is required as articles might hit multiple rows for each author.
But if you have any control over the database I would use a link table for friends_with as your cussrent solution either have to use a comma seperated list of names which will be disastrous for performance and require a completly different query or each author can only have one friend.
Friends
id friend_id
then the query would look like this
SELECT COUNT(DISTINCT auth.id)
FROM authors auth
INNER JOIN articles art ON auth.id = art.author_id
INNER JOIN friends f ON auth.id = f.id
INNER JOIN authors fauth ON fauth.id = f.friend_id
WHERE fauth.name = 'bob' AND art.category = 4
Its more complex but will allow for many friends, just remeber, this construct calls for 2 rows in friends for each pair, one from joe to bob and one from bob to joe.
You could build it differently but that would make the query even more complex.
Maybe something like
select fr.name,
fr.id,
au.name,
ar.article_text,
ar.category,
ar.author_id
from authors fr, authors au, articles ar
where fr.id = ar.author_id
and au.friends_with = fr.name
and ar.category = 4 ;
Just the count...
select count(distinct fr.name)
from authors fr, authors au, articles ar
where fr.id = ar.author_id
and au.friends_with = fr.name
and ar.category = 4 ;
A version without using joins (hopefully will work!)
SELECT count(distinct id) from authors where friends_with = 'Bob' and id in(select author_id from articles where category = 4)
I found it is easier to understand statements with 'IN' in when I started out with SQL.