sql finding multiple characters within multiple books - sql

I can't seem to find all comics with certain characters in them. The comics and characters tables have a many to many relationship as follows:
My database schema:
**comics table**
comic_id
comic_name
comic_date
**character table**
character_id
character_name
**comics_character table**
comic_id
character_id
This works fine for one character:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name = 'Superman';
But if I want all comics with say Superman and Batman in them, I tried using this:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name in ('Batman', 'Superman');
but this only gets me a list of comics featuring either Batman OR Superman rather than comics with both Batman AND Superman in
I've also tried this which doesn't return anything:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on (c.comic_id = cc.comic_id and h.character_id = cc.character_id)
where (h.character_name = 'Batman' and h.character_name = 'Superman');
I've tried other variations but can't get the desired result

The OR route doesn't work, for reasons you've worked out - you get rows for superman or batman, and comics that have both characters have two rows with the same comic is and different character ids. The AND route doesn't work because a row cannot be simultaneously both characters.
So, you need to use the OR route to get comics with one or both characters and then also use a count to show only comics with both characters. Essentially, "filter to superman or batman, and then filter again to only comic ids that appear on two rows" or "filter to only batman or superman then group them up based on the comic is and only take Groups that have two entities in them". Ultimately, the lesson here is that database rows are thought of as different entities and when you want to treat them as one you have to group them, so we are identifying comics based on some attribute of the group after (deliberately) losing the detail of exactly which entities the group contains:
SELECT comic_id
FROM comic_characters
WHERE character_id IN (1,2) --Batman or Superman
GROUP BY conic_id
HAVING COUNT(*) = 2
The number on the right hand side of the = must be the same as the number of character IDs in the IN() clause. If you IN for 4 character IDs, then use COUNT(*) = 4
You can join in other tables so you can use names etc; I simplified this to make the point without extraneous detail
Footnote; this technique wil find comics that feature at least batman and superman- the comic could very well contain other characters too but we lost those other guys at the WHERE stage before we did the GROUP BY. If you wanted comics that ONLY featured batman and superman it's a different thing. For that we could do something like grouping first, and counting conditionally - give batman or superman a score of 1 and everyone else a score of 10, comics that had only B and S would score 2, comics that featured only one of them would score 1 and anyone else's presence would cause a score of 10 or more, so we could filter on the 2

Related

Tricks to exceed column limitations in SQL Database

Hello swarm intelligence!
I have the following use case: For every movie that is requested by a user, I create a number of tags for that specific movie, derived from several sources (actors, plot etc.. ).
I will use this data for associaton mining.
The problem: If I use the movie for rows and the tags for columns, the tags will easily exceed the technical limitations of 3000 columns ( there is even more actors, and then plot keywords etc)
Is there any way, I can organize this data to then use it for (quick) association mining?
Thanks a lot
Don't put tags in columns. Instead create a separate table, named something like movie_tags with two columns, movie_id and tag. Put each tag in a separate row of that table.
This is known as "normalizing" your data. Here's a nice walkthrough with an example very similar to yours.
Edit: Let's say you have a catalog of movies about the Italian Mafia in New York City in the 20th century. Let's say the movies are
1 Godfather
2 Goodfellas
3 Godfather II
Then your movie_tags table might contain these rows.
1 Gangsters
2 Gangsters
3 Gangsters
1 Francis Ford Coppola
3 Francis Ford Coppola
2 Martin Scorsese
Pro tip If you find yourself thinking about putting lots of data items with the same meaning in their own columns, you probably need to normalize the data and add appropriate tables.

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;
As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

SUM column values based off differing results of separate column

Consider a relation that contained the names and number of locations of restaurants including split and stand alone restaurants:
RESTAURANT: NUM_OF_LOC
Pizza Hut 1
Pizza Hut/Taco Bell 2
Taco Bell 2
Also consider you will not know the name of the restaurant, stand alone or split, or Number of Locations. The only consistent piece is the "/" string character between split restaurants.
How to return the above table as a result with the number of stand alone restaurants summed into the number of split restaurants in desc, like so:
RESTAURANT: NUM_OF_LOC
Pizza Hut/Taco Bell 5
Taco Bell 2
Pizza Hut 1
So are you looking to get the count of all restaurants for just Taco Bell and Pizza Hut where a joint counts as 1 for each or are you looking to count all occurrences of each variant?
I'm thinking you aren't just looking for totals and are looking to tear apart the combined restaurants so you can do something like
SELECT name, count(*)
FROM restaurants
WHERE CONTAINS (name, 'Taco Bell')
Relooking it seems like you want the consolidated groups to include all occurrences of either which would be something like:
CREATE TABLE sums AS
SELECT name, count(*)
FROM restaurants
WHERE CONTAINS (name, 'Taco Bell')
OR CONTAINS(name, 'Pizza Hut')

Populating column for Oracle Text search from 2 tables

I am investigating the benefits of Oracle Text search, and currently am looking at collecting search text data from multiple (related) tables and storing the data in the smaller table in a 1-to-many relationship.
Consider these 2 simple tables, house and inhabitants, and there are NEVER any uninhabited houses:
HOUSE
ID Address Search_Text
1 44 Some Road
2 31 Letsby Avenue
3 18 Moon Crescent
INHABITANT
ID House Name Nickname
1 1 Jane Doe Janey
2 1 John Doe JD
3 2 Jo Smythe Smithy
4 2 Percy Plum PC
5 3 Apollo Lander Moony
I want to to write SQL that updates the HOUSE.Search_Text column with text from INHABITANT. Now because this is a 1-to-many, the SQL needs to collate the data in INHABITANT for each matching row in house, and then combine the data (comma separated) and update the Search_Text field.
Once done, the Oracle Text search index on HOUSE.Search_Text will return me HOUSEs that match the search criteria, and I can look up INHABITANTs accordingly.
Of course, this is a very simplified example, I want to pick up data from many columns and Full Text Search across fields in both tables.
With the help of a colleague we've got:
select id, ADDRESS||'; '||Names||'; '||Nicknames as Search_Text
from house left join(
SELECT distinct house_id,
LISTAGG(NAME, ', ') WITHIN GROUP (ORDER BY NAME) OVER (PARTITION BY house_id) as Names,
LISTAGG(NICKNAME, ', ') WITHIN GROUP (ORDER BY NICKNAME) OVER (PARTITION BY house_id) as Nicknames
FROM INHABITANT)
i on house.id = i.house_id;
which returns:
1 44 Some Road; Jane Doe, John Doe; JD, Janey
2 31 Letsby Avenue; Jo Smythe, Percy Plum; PC, Smithy
3 18 Moon Crescent; Apollo Lander; Moony
Some questions:
Is this an efficient query to return this data? I'm slightly
concerned about the distinct.
Is this the right way to use Oracle Text search across multiple text fields?
How to update House.Search_Text with the results above? I think I need a correlated subquery, but can't quite work it out.
Would it be more efficient to create a new table containing House_ID and Search_Text only, rather than update House?

mysql where IN on large dataset or Looping?

I have the following scenario:
Table 1:
articles
id article_text category author_id
1 "hello world" 4 1
2 "hi" 5 2
3 "wasup" 4 3
Table 2
authors
id name friends_with
1 "Joe" "Bob"
2 "Sue" "Joe"
3 "Fred" "Bob"
I want to know the total number of authors that are friends with "Bob" for a given category.
So for example, for category 4 how many authors are there that are friends with "Bob".
The authors table is quite large, in some cases I have a million authors that are friends with "Bob"
So I have tried:
Get list of authors that are friends with bob, and then loop through them and get the count for each of them of that given category and sum all those together in my code.
The issue with this approach is it can generate a million queries, even though they are very fast, it seems there should be a better way.
I was thinking of trying to get a list of authors that are friends with bob and then building an IN clause with that list, but I fear that would blow out the amt of memory allowed in the query set.
Seems like this is a common problem. Any ideas?
thanks
SELECT COUNT(DISTINCT auth.id)
FROM authors auth
INNER JOIN articles art ON auth.id = art.author_id
WHERE friends_with = 'bob' AND art.category = 4
Count(Distinct a.id) is required as articles might hit multiple rows for each author.
But if you have any control over the database I would use a link table for friends_with as your cussrent solution either have to use a comma seperated list of names which will be disastrous for performance and require a completly different query or each author can only have one friend.
Friends
id friend_id
then the query would look like this
SELECT COUNT(DISTINCT auth.id)
FROM authors auth
INNER JOIN articles art ON auth.id = art.author_id
INNER JOIN friends f ON auth.id = f.id
INNER JOIN authors fauth ON fauth.id = f.friend_id
WHERE fauth.name = 'bob' AND art.category = 4
Its more complex but will allow for many friends, just remeber, this construct calls for 2 rows in friends for each pair, one from joe to bob and one from bob to joe.
You could build it differently but that would make the query even more complex.
Maybe something like
select fr.name,
fr.id,
au.name,
ar.article_text,
ar.category,
ar.author_id
from authors fr, authors au, articles ar
where fr.id = ar.author_id
and au.friends_with = fr.name
and ar.category = 4 ;
Just the count...
select count(distinct fr.name)
from authors fr, authors au, articles ar
where fr.id = ar.author_id
and au.friends_with = fr.name
and ar.category = 4 ;
A version without using joins (hopefully will work!)
SELECT count(distinct id) from authors where friends_with = 'Bob' and id in(select author_id from articles where category = 4)
I found it is easier to understand statements with 'IN' in when I started out with SQL.