Retrieving duplicate values in column SQL database - sql

I have a small database which contains a table that holds information in each row on a Movie (e.g, Movie Name, Movie Runtime, Movie Rating) and I also have a separate Genre table which contains a list of genres (Horror, Action etc).
I have an association table which links a movie to a genre (a typical row will contain the unique Id for that row, the genreId and the movieId).
I have written a query which pulls back all the genres a user has watched; however, it is removing the duplicate row values and is giving me what seems to be a distinct count.
Below is the SQL statement:
SELECT g.Type,
g.Id
FROM GenreTable g
WHERE
g.Id in (
SELECT gma.GenreId
FROM MovieGenreAssociationTable gma
WHERE gma.MovieId in (
SELECT uma.MovieSeriesId
FROM UserMovieAssociationTable uma
WHERE uma.UserId = '1'
)
);
This returns all of the genres a user has watched, but I'm noticing that it's not bringing back the duplicates which I know exist in the association table.
How do I get those duplicates?

You are not making a JOIN but a SELECT on a single table, so it will never return any duplicates unless they exist in GenreTable.
If you do something like SELECT a FROM tbl WHERE b IN (1,1,1,1,1), it will return only one row -- not five. And even if you have a complicated WHERE there, it's still a simple IN clause.
update: quick and dirty refresher on JOINs.
I'd actually suggest you look for a SQL tutorial. I make no claim about the completeness of this note - rather, the contrary. First google hit, second hit, etc.
Say that you have two simple tables:
a.id a.a b.id b.b
1 1 1 'Hello'
2 1 2 'World'
3 2 7 'foobar'
4 3
If you run a JOIN between a and b, ON(a.a = b.id), the query will select all records in a; each of them will be then joined to all matching records in b. This is what JOIN is for.
In this case, the second and third columns will always be equal:
1 1 1 'Hello'
2 1 1 'Hello'
3 2 2 'World'
Notice that the fourth row of a is discarded because it has no matches, and the third row of b is never selected at all. The second row of b is selected twice, because there are two elements of a which have a match.
A LEFT JOIN works the same, except that if there are no matches for the left side of the query (i.e. table a), as it happens for the fourth row, that row is selected all the same; but the extra fields that would have come from b are replaced by NULLs. You get a further row for which the JOIN clause, ON(a.a = b.id), is actually false:
4 3 NULL NULL
(And you can use this to select the rows of a that have no matches in b: just specify e.g. WHERE b.primary_key_of_b IS NULL).
Your case
You should do something like:
SELECT
g.Type,
g.Id
FROM GenreTable AS g
JOIN MovieGenreAssociationTable AS gma ON (gma.GenreId = g.Id)
JOIN UserMovieAssociationTable AS uma ON (uma.MovieSeriesId = gma.MovieId)
WHERE uma.UserId = 1;
You can then GROUP BY e.g. Type and Id to get the COUNT() of movies watched for each genre.
But...
Say that you have a GenreTable with two rows (Id=123, Type="Science Fiction" and Id=456, Type="Comedy"), a Movie table with one row (777, "Galactic Quest"), a MovieGenreAssociationTable with (123, 777) and (456, 777) because that movie is a great comedy too, and finally user 1 watched only movie 777. You would get:
Genre gma uma Movie
123 "Science Fiction" 123, 777 777, 1 777, "Galaxy Quest"
456 "Comedy" 456, 777 777, 1 777, "Galaxy Quest"
and would see that user 1 has seen two movies - one SciFi, one Comedy.
In this case you need to either accept the result (how many comedies did he watch? One. How many SciFis? One), or make a more complicated query for which you must decide which is the main genre. Otherwise you would get illogical results ("How many comedies? One. How many movies? One. Then number of non-comedies is one minus one, ie, zero? No, it is again one - wait, what?").
In this case you could add a column for this purpose in MovieGenreAssociation, a boolean column "IsMainGenre". So when you want to know how many comedies one watched, you would do as above. But when you split movies by genre, you add AND IsMainGenre=1 and you calculate "Galaxy Quest" among SciFis, but not among comedies or parodies.

Related

SQL column compare

I have 3 columns in the same table in SQL one for the number, a name, and another unrelated data. The numbers repeat for a certain amount of times and have a name next to them, there can't be a name twice on the same number, but the names can be present in multiple different numbers. I need to make an SQL query to find what names have been under the same number the most amount of times. Any help will be very appreciated.
Example: SQL query will find what names have been grouped together the most.
1 Bill
1 Bob
1 Dave
2 Bob
2 John
2 Bill
To confirm - you would like to find
The pairs of names that occur together within a 'number'
Of those, find the pair that occurs most often
The trick here is to get all the pairs, then count how many 'numbers' that pair appears in.
To get the pairs, join the table to itself (on the number) - and then to only have one pairing in each, also join on name with the first in the pair < second in the pair.
The answer to this question depends on your database (SQL Server, MySQL, etc). However, here is an example written in T-SQL but it is fairly generic that does most of the work: it shows the counts and orders them by the the relevant count.
Feel free to get the TOP or LIMIT 1 just to get a pair with the most matches (noting that if there is a tie, only one would be chosen this way)
Alternatively modify the query to work out what the maximum number is, then get the pairs with that number.
CREATE TABLE NameGrps (NameNum int, Name varchar(30));
INSERT INTO NameGrps (NameNum, Name)
VALUES
(1, 'Bill'),
(1, 'Bob'),
(1, 'Dave'),
(2, 'Bob'),
(2, 'John'),
(2, 'Bill');
SELECT NamePairs.FirstInPair, NamePairs.SecondInPair, COUNT(NameNum) AS Num_Paired
FROM
(SELECT A.Name AS FirstInPair, B.Name AS SecondInPair, A.NameNum
FROM NameGrps A
INNER JOIN NameGrps B ON A.NameNum = B.NameNum AND A.Name < B.Name
) AS NamePairs
GROUP BY NamePairs.FirstInPair, NamePairs.SecondInPair
ORDER BY COUNT(NameNum) DESC, NamePairs.FirstInPair, NamePairs.SecondInPair;
And here are the results of the above
FirstInPair SecondInPair Num_Paired
Bill Bob 2
Bill Dave 1
Bill John 1
Bob Dave 1
Bob John 1
If you take a TOP or LIMIT 1 of that, it will find the pair of Bill and Bob is the most frequent.
Here is a db<>fiddle with the query, as well as additional information (e.g., what the sub-query does, and adding a TOP 1 version).

Can I get duplicate results (from one table) in an INTERSECT operation between two tables?

I know the wording of the question is awkward, but I couldn't phrase it any better. Let me explain the situation.
There's table A which has a bunch of columns (a, b, c ... ) and I run a SELECT query on it like so:
SELECT a FROM A WHERE b IN ('....') (the ellipsis indicates a number of values to be matched to)
There's another table B which has a bunch of columns (d, e, f ... ) and I run a SELECT query on it like so:
SELECT d FROM B WHERE f = '...' (the ellipsis indicates a single value to be matched to)
Now I should say here that the two tables store different types of information about the same entity, but the columns a and d contain the exact same data (in this case, an ID). I want to find out the intersection of the two tables so I run this:
SELECT a FROM A WHERE b IN ('....') INTERSECT SELECT d FROM B WHERE f = '...'
Now here's the problem:
The first SELECT contains a set of values in the WHERE clause, right? So let's say the set is (1234, 2345,3456). Now, the result of this query when b is matched ONLY to 1234 is, let's say, abc. When it's matched to 2345, it's def, suppose. And matching to 3456, it gives abc.
Let's suppose these two results (abc and def) are also in the set of results from the second SELECT.
So, now, putting back the entire set of values to matched into the WHERE clause, the INTERSECT operation will give me abc and def. But I want abc twice since two values in the WHERE clause set match to the second SELECT.
Is there any way I can get that?
I hope it's not too complicated to understand my problem. This is a real-life problem I'm facing in my job.
Data structure and my code
Table A contains general information about a company:
company_id | branch_id | no_of_employees | city
Table B contains the financials of the company:
company_id | branch_id | revenue | profits
First SELECT:
SELECT branch_id FROM A WHERE CITY IN ('Dallas', 'Miami', 'New Orleans')
Now, running each city separately in the first SELECT, I get the branch_ids:
branch_id | city
23 | Dallas
45 | Miami
45 | New Orleans
Once again, this seems impractical as to how two cities can have the same branch ids, but please bear with me on this.
Second SELECT:
SELECT branch_id FROM B
WHERE REVENUE = 5000000
I know this is a little impractical, but for the purpose of this example, it suffices.
Running this query I get the following set:
11
23
45
22
10
So the INTERSECT will give me just 23 and 45. But I want 45 twice, since both Miami and New Orleans have that branch_id and that branch_id has generated a revenue of 5 million.
Directly from Microsoft's documentation (https://msdn.microsoft.com/en-us/library/ms188055.aspx)
:
"INTERSECT returns distinct rows that are output by both the left and right input queries operator."
So NO, it is not possible to get the same value twice when using INTERSECT because the results will be DISTINCT. However if you build an INNER JOIN correctly you can do essentially the same thing as INTERSECT except keep the repetitive results by NOT using distinct or group by.
SELECT
A.a
FROM
A
INNER JOIN B
ON A.a = B.d
AND B.F = '....'
WHERE b IN ('....')
And for your specific Example that you edited:
SELECT
branch_id
FROM
A
INNER JOIN B
ON A.branch_id = B.branch_id
AND B.REVENUE = 5000000
WHERE A.CITY IN ('Dallas', 'Miami', 'New Orleans')
You overcomplicated your task a lot:
SELECT *
FROM A
WHERE CITY IN (...)
AND EXISTS
(
SELECT 1 FROM B
WHERE B.REVENUE = 5000000
AND B.branch_id = A.branch_id
)
INTERSECT and EXCEPT are both returning row sets with DISTINCT applied.
Regular joining/filtering operations are not performed by INTERSECT or EXCEPT.

Query to find duplicate values for two fields

Sorry for the Title, But didn't know how to explain.
I have a table that have 2 fields A and B.
I want find all rows in the table that have duplicate A (more than one record) but at the same time A will consider as a duplicate only if B is different in both rows.
Example:
FIELD A Field B
10 10
10 10 // This is not duplicate
10 10
10 5 // this is a duplicate
How to to this in a single query
Let's break this down into how you would go about constructing such a query. You don't make it clear whether you're looking for all values of A or all rows but let's assume all values of A initially.
The first step therefore is to create a list of all values of A. This can be done two ways, DISTINCT or GROUP BY. I'm going to use GROUP BY because of what else you want to do:
select a
from your_table
group by a
This returns a single column that is unique on A. Now, how can you change this to give you the unique values? The most obvious thing to use is the HAVING clause, which allows you to restrict on aggregated values. For instance the following will give you all values of A which only appear once in the table
select a
from your_table
group by a
having count(*) = 1
That is the count of all values of A inside the group is 1. You don't want this of course, you want to do this with the column B. You need there to exist more than one value of B in order for the situation you want to identify to be possible (if there's only one value of B then it's impossible). This gets us to
select a
from your_table
group by a
having count(b) > 1
This still isn't enough as you want two different values of B. The above just counts the number of records with the column B. Inside an aggregate function you use the DISTINCT keyword to determine unique values; bringing us to:
select a
from your_table
group by a
having count(distinct b) > 1
To transcribe this into English this means select all unique values of A from YOUR_TABLE that have more than one values of B in the group.
You can use this method, or something similar, to build up your own queries as you create them. Determine what you want to achieve and slowly build up to it.
select FIELD from your_table group by FIELD having count(b) > 1
take in consideration that this will return count of all duplicate
example
if you have values
1
1
2
1
it will return 3 for value 1 not 2

Oracle / SQL - Count number of occurrences of values in a single column

Okay, I probably could have come up with a better title, but wasn't sure how to word it so let me explain.
Say I have a table with the column 'CODE'. Each record in my table will have either 'A', 'B', or 'C' as it's value in the 'CODE' column. What I would like is to get a count of how many 'A's, 'B's, and 'C's I have.
I know I could accomplish this with 3 different queries, but I'm wondering if there is a way to do it with just 1.
Use:
SELECT t.code,
COUNT(*) AS numInstances
FROM YOUR_TABLE t
GROUP BY t.code
The output will resemble:
code numInstances
--------------------
A 3
B 5
C 1
If a code exists that has not been used, it will not show up. You'd need to LEFT JOIN to the table containing the list of codes in order to see those that don't have any references.

SQL Query - Ensure a row exists for each value in ()

Currently struggling with finding a way to validate 2 tables (efficiently lots of rows for Table A)
I have two tables
Table A
ID
A
B
C
Table matched
ID Number
A 1
A 2
A 9
B 1
B 9
C 2
I am trying to write a SQL Server query that basically checks to make sure for every value in Table A there exists a row for a variable set of values ( 1, 2,9)
The example above is incorrect because t should have for every record in A a corresponding record in Table matched for each value (1,2,9). The end goal is:
Table matched
ID Number
A 1
A 2
A 9
B 1
B 2
B 9
C 1
C 2
C 9
I know its confusing, but in general for every X in ( some set ) there should be a corresponding record in Table matched. I have obviously simplified things.
Please let me know if you all need clarification.
Use:
SELECT a.id
FROM TABLE_A a
JOIN TABLE_B b ON b.id = a.id
WHERE b.number IN (1, 2, 9)
GROUP BY a.id
HAVING COUNT(DISTINCT b.number) = 3
The DISTINCT in the COUNT ensures that duplicates (IE: A having two records in TABLE_B with the value "2") from being falsely considered a correct record. It can be omitted if the number column either has a unique or primary key constraint on it.
The HAVING COUNT(...) must equal the number of values provided in the IN clause.
Create a temp table of values you want. You can do this dynamically if the values 1, 2 and 9 are in some table you can query from.
Then, SELECT FROM tempTable WHERE NOT IN (SELECT * FROM TableMatched)
I had this situation one time. My solution was as follows.
In addition to TableA and TableMatched, there was a table that defined the rows that should exist in TableMatched for each row in TableA. Let’s call it TableMatchedDomain.
The application then accessed TableMatched through a view that controlled the returned rows, like this:
create view TableMatchedView
select a.ID,
d.Number,
m.OtherValues
from TableA a
join TableMatchedDomain d
left join TableMatched m on m.ID = a.ID and m.Number = d.Number
This way, the rows returned were always correct. If there were missing rows from TableMatched, then the Numbers were still returned but with OtherValues as null. If there were extra values in TableMatched, then they were not returned at all, as though they didn't exist. By changing the rows in TableMatchedDomain, this behavior could be controlled very easily. If a value were removed TableMatchedDomain, then it would disappear from the view. If it were added back again in the future, then the corresponding OtherValues would appear again as they were before.
The reason I designed it this way was that I felt that establishing an invarient on the row configuration in TableMatched was too brittle and, even worse, introduced redundancy. So I removed the restriction from groups of rows (in TableMatched) and instead made the entire contents of another table (TableMatchedDomain) define the correct form of the data.