how to avoid duplicate on Joining two tables

how to avoid duplicate on Joining two tables - sql

Student Table
SID Name
1 A
2 B
3 C
Marks Table
id mark subject
1 50 physics
2 40 biology
1 50 chemistry
3 30 mathematics
SELECT distinct(std.id),std.name,m.mark, row_number() over() as rownum FROM
student std JOIN marks m ON std.id=m.id AND m.mark=50
This result is 2 times A even after using disticnt . My expected result will have only one A. if i remove row_number() over() as rownum its working fine. Why this is happening ? how to resolve. AM using DB2!!

There are two rows in marks Table with id = 1 and mark = 50.. So you will get two rows in the output for each row in student table...
If you only want one, you have to do a group By
SELECT std.id, std.name, m.mark, row_number()
over() as rownum
FROM student std
JOIN marks m
ON m.id=std.id AND m.mark=50
Group By std.id, std.name, m.mark

Now that you've clarified your question as:
I want to find all students with a mark of 50 in at least one subject. I would use the query:
SELECT student.id, '50'
FROM student
WHERE EXISTS (SELECT 1 FROM marks WHERE marks.id = student.id AND marks.mark = 50)
This also gives you flexibility to change the criteria, e.g. at least one mark of 50 or less.

Similar to Charles answer, but you always want to put the predicate (mark=50) in the WHERE clause, so you're filtering before joining. If this is just homework it might not matter but you'll want to remember this if you ever hit any real data.
SELECT std.sid,
std.name,
m.mark,
row_number() over() AS rownum
FROM student std
JOIN marks m
ON std.sid=m.id
WHERE m.mark=50
GROUP BY std.sid, std.name, m.mark

Related

How to return all names that appear multiple times in table [duplicate]

This question already has answers here:
What's the SQL query to list all rows that have 2 column sub-rows as duplicates?
(10 answers)
Closed last year.
Suppose I have the following schema:
student(name, siblings)
The related table has names and siblings. Note the number of rows of the same name will appear the same number of times as the number of siblings an individual has. For instance, a table could be as follows:
Jack, Lucy
Jack, Tim
Meaning that Jack has Lucy and Tim as his siblings.
I want to identify an SQL query that reports the names of all students who have 2 or more siblings. My attempt is the following:
select name
from student
where count(name) >= 1;
I'm not sure I'm using count correctly in this SQL query. Can someone please help with identifying the correct SQL query for this?

You're almost there:
select name
from student
group by name
having count(*) > 1;
HAVING is a where clause that runs after grouping is done. In it you can use things that a grouping would make available (like counts and aggregations). By grouping on the name and counting (filtering for >1, if you want two or more, not >=1 because that would include 1) you get the names you want..
This will just deliver "Jack" as a single result (in the example data from the question). If you then want all the detail, like who Jack's siblings are, you can join your grouped, filtered list of names back to the table:
select *
from
student
INNER JOIN
(
select name
from student
group by name
having count(*) > 1
) morethanone ON morethanone.name = student.name
You can't avoid doing this "joining back" because the grouping has thrown the detail away in order to create the group. The only way to get the detail back is to take the name list the group gave you and use it to filter the original detail data again
Full disclosure; it's a bit of a lie to say "can't avoid doing this": SQL Server supports something called a window function, which will effectively perform a grouping in the background and join it back to the detail. Such a query would look like:
select student.*, count(*) over(partition by name) n
from student
And for a table like this:
jack, lucy
jack, tim
jane, bill
jane, fred
jane, tom
john, dave
It would produce:
jack, lucy, 2
jack, tim, 2
jane, bill, 3
jane, fred, 3
jane, tom, 3
john, dave, 1
The rows with jack would have 2 on because there are two jack rows. There are 3 janes, there is 1 john. You could then wrap all that in a subquery and filter for n > 1 which would remove john
select *
from
(
select student.*, count(*) over(partition by name) n
from student
) x
where x.n > 1
If SQL Server didn't have window functions, it would look more like:
select *
from
student
INNER JOIN
(
select name, count(*) as n
from student
group by name
) x ON x.name = student.name
The COUNT(*) OVER(PARTITION BY name) is like a mini "group by name and return the count, then auto join back to the main detail using the name as key" i.e. a short form of the latter query

You can do:
select name
from student as s1
where exists (
select s2
from student as s2
where s1.name = s2.name and s1.siblings != s2.siblings
)

I think the best approach is what 'Caius Jard' mentioned. However, additional way if you want to get how many siblings each name has .
SELECT name, COUNT(*) AS Occurrences
FROM student
GROUP BY name
HAVING (COUNT(*) > 1)

I wanted to share another solution I came up with:
select s1.name
from student s1, student s2
where s1.name = s2.name and s1.sibling != s2.sibling;

Retrieve only the first row per student

Suppose I have a table which stores data for students and the respected grades per classes. In most cases, there exists multiple rows per student in the table.
student_id math chemistry science
100 A B C <--------
100 B A D
100 D F C
200 B A C <--------
300 C D F <--------
300 A A A
400 F C B <--------
400 B A C
500 A B A <--------
I want to retrieve the first row as per the student_id as explained below.
Requested in postgreSQL:
student_id math chemistry science
100 A B C
200 B A C
300 C D F
400 F C B
500 A B A

At the point you decide on an ordering strategy (such as adding a column that records the date they took the exams and defining "first" as "most recent") you can get the "first" row with something like
SELECT *
FROM(
SELECT *, ROW_NUMBER() OVER(PARTITION BY student_id ORDER BY date_taken DESC) rn
FROM yourtable
) x
WHERE rn = 1
The ROW_NUMBER creates an incrementing counter 1,2,3,4.. that restarts from 1 for every different student id, and the rows are numbered in order of descending date (so most recent gets 1). By then requiring rn to be 1 we get the most recent row
You might decide to give the student their best marks, and maybe we could use the ascii value of the score (A is 65, B is 66 etc). If we add the scores up then the lowest total (ie order by ascending) is the best set of marks (BBB is better than AAF)
OVER(PARTITION BY student_id ORDER BY ASCII(math)+ASCII(chemistry)+ASCII(science))

This can be achieved with the following query:
SELECT DISTINCT ON (student_id) student_id, math, chemistry, science
FROM
student
ORDER BY
student_id
This query will return just a single row per student_id. You should determine how you want to actually order it (such as an index or timestamp), otherwise you can't guarantee that you get the same output each time.
But as a basic solution, you can use this to just get a single row if you don't care about the actual order and only care about removing duplicates.

The basic solution is - using MIN/MAX function to aggregate values:
SELECT student_id, MAX(math) math, MAX(chemistry) chemistry, MAX(science) science
FROM
student
GROUP BY
student_id

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500

Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END

Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.

You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable

You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

Ranking Aggregate Field in Access Query

I am trying to rank an aggregate field in access but my efforts are in vain with errors based on referencing. I am ranking using a subquery but the problem comes about due to the alias names resulting from performing an average on a field. The code is as below:
SELECT [Exams].[StudentID],
Avg([Exams].[Biology]) AS [AvgBiology],
(SELECT Avg(T.Biology) AS [TAvgBiology],
Count(*)
FROM [Exams] AS T
WHERE T.[TAvgBiology] > [AvgBiology])
+ 1 AS Rank
FROM [Exams]
GROUP BY [Exams].[StudentID]
ORDER BY Avg([Exams].[Biology]) DESC;
Errors that come about state: "You have selected a subquery that can return more than one value blah blah...please use the Exist keyword.. ".
From the code above I think you get the gist of what I am trying to achieve.

Start with the basic GROUP BY query Gordon Linoff suggested to compute the average Biology for each StudentID.
SELECT
e.StudentID,
Avg(e.Biology) AS AvgBiology
FROM Exams AS e
GROUP BY e.StudentID
Save that query as qryAvgBiology and then use it in another query where you compute Rank.
SELECT
q.StudentID,
q.AvgBiology,
(
(
SELECT Count(*)
FROM qryAvgBiology AS q2
WHERE q2.AvgBiology > q.AvgBiology
)
+1
) AS Rank
FROM qryAvgBiology AS q
ORDER BY 3;
For example, if qryAvgBiology returns this result set ...
StudentID AvgBiology
--------- ----------
1 70
2 80
3 90
The ranking query will transform it to this ...
StudentID AvgBiology Rank
--------- ---------- ----
3 90 1
2 80 2
1 70 3

I assume your basic query is:
SELECT e.StudentId Avg(e.Biology) AS AvgBiology
FROM exams as e
GROUP BY e.StudentId;
(Square braces don't help me understand the query at all.)
I think the following will work in Access:
SELECT e.StudentId Avg(e.Biology) AS AvgBiology,
(SELECT 1 + COUNT(*)
FROM (SELECT e.StudentId, Avg(e.Biology) AS AvgBiology
FROM exams as e
GROUP BY e.StudentId
) e2
WHERE e2.AvgBiology > Avg(e.Biology)
) as ranking
FROM exams as e
GROUP BY e.StudentId;

write a query to identify discrepancy

I have a table with Student ID's and Student Names. There has been issues with assigning unique Student Id's to students and Hence I want to find the duplicates
Here is the sample Table:
Student ID Student Name
1 Jack
1 John
1 Bill
2 Amanda
2 Molly
3 Ron
4 Matt
5 James
6 Kathy
6 Will
Here I want a third column "Duplicate_Count" to display count of duplicate records.
For e.g. "Duplicate_Count" would display "3" for Student ID = 1 and so on. How can I do this?
Thanks in advance

Select StudentId, Count(*) DupCount
From Table
Group By StudentId
Having Count(*) > 1
Order By Count(*) desc,

Select
aa.StudentId, aa.StudentName, bb.DupCount
from
Table as aa
join
(
Select StudentId, Count(*) as DupCount from Table group by StudentId
) as bb
on aa.StudentId = bb.StudentId
The virtual table gives the count for each StudentId, this is joined back to the original table to add the count to each student record.
If you want to add a column to the table to hold dupcount, this query can be used in an update statement to update that column in the table

This should work:
update mytable
set duplicate_count = (select count(*) from mytable t where t.id = mytable.id)
UPDATE:
As mentioned by #HansUp, adding a new column with the duplicate count probably doesn't make sense, but that really depends on what the OP originally thought of using it for. I'm leaving the answer in case it is of help for someone else.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to avoid duplicate on Joining two tables - sql

Related

How to return all names that appear multiple times in table [duplicate]

Retrieve only the first row per student

Case Statement for multiple criteria

Ranking Aggregate Field in Access Query

write a query to identify discrepancy

Categories

Resources