Why do multiple-table joins produce duplicate rows?

Why do multiple-table joins produce duplicate rows? - sql

Let's say I have three tables A, B, and C. Each has two columns: a primary key and some other piece of data. They each have the same number of rows. If I JOIN A and B on the primary key, I should end up with the same number of rows as are in either of them (as opposed to A.rows * B.rows).
Now, if I JOIN A JOIN B with C, why do I end up with duplicate rows? I have run into this problem on several occasions and I do not understand it. It seems like it should produce the same result as JOINing A and B since it has the same number of rows but, instead, duplicates are produced.
Queries that produce results like this are of the format
SELECT *
FROM M
INNER JOIN S
on M.mIndex = S.mIndex
INNER JOIN D
ON M.platformId LIKE '%' + D.version + '%'
INNER JOIN H
ON D.Name = H.Name
AND D.revision = H.revision
Here are schemas for the tables. H contains is a historic table containing everything that was ever in D. There are many M rows for each D and one S for each M.
Table M
[mIndex] [int] NOT NULL PRIMARY KEY,
[platformId] [nvarchar](256) NULL,
[ip] [nvarchar](64) NULL,
[complete] [bit] NOT NULL,
[date] [datetime] NOT NULL,
[DeployId] [int] NOT NULL PRIMARY KEY REFERENCES D.DeployId,
[source] [nvarchar](64) NOT NULL PRIMARY KEY
Table S
[order] [int] NOT NULL PRIMARY KEY,
[name] [nvarchar](64) NOT NULL,
[parameters] [nvarchar](256) NOT NULL,
[Finished] [bit] NOT NULL,
[mIndex] [int] NOT NULL PRIMARY KEY,
[mDeployId] [int] NOT NULL PRIMARY KEY,
[Date] [datetime] NULL,
[status] [nvarchar](10) NULL,
[output] [nvarchar](max) NULL,
[config] [nvarchar](64) NOT NULL PRIMARY KEY
Table D
[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[branch] [nvarchar](64) NOT NULL,
[revision] [int] NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](256) NOT NULL
Table H
[IdDeploy] [int] IDENTITY(1,1) NOT NULL,
[name] [nvarchar](64) NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](max) NOT NULL,
[StartDate] [datetime] NOT NULL,
[EndDate] [datetime] NULL,
[Revision] [nvarchar](64) NULL,
I didn't post the tables and query initially because I am more interested in understanding this problem for myself and avoiding it in the future.

When you have related tables you often have one-to-many or many-to-many relationships. So when you join to TableB each record in TableA many have multiple records in TableB. This is normal and expected.
Now at times you only need certain columns and those are all the same for all the records, then you would need to do some sort of group by or distinct to remove the duplicates. Let's look at an example:
TableA
Id Field1
1 test
2 another test
TableB
ID Field2 field3
1 Test1 something
1 test1 More something
2 Test2 Anything
So when you join them and select all the files you get:
select *
from tableA a
join tableb b on a.id = b.id
a.Id a.Field1 b.id b.field2 b.field3
1 test 1 Test1 something
1 test 1 Test1 More something
2 another test 2 2 Test2 Anything
These are not duplicates because the values of Field3 are different even though there are repeated values in the earlier fields. Now when you only select certain columns the same number of records are being joined together but since the columns with the different information is not being displayed they look like duplicates.
select a.Id, a.Field1, b.field2
from tableA a
join tableb b on a.id = b.id
a.Id a.Field1 b.field2
1 test Test1
1 test Test1
2 another test Test2
This appears to be duplicates but it is not because of the multiple records in TableB.
You normally fix this by using aggregates and group by, by using distinct or by filtering in the where clause to remove duplicates. How you solve this depends on exactly what your business rule is and how your database is designed and what kind of data is in there.

If one of the tables M, S, D, or H has more than one row for a given Id (if just the Id column is not the Primary Key), then the query would result in "duplicate" rows. If you have more than one row for an Id in a table, then the other columns, which would uniquely identify a row, also must be included in the JOIN condition(s).
References:
Related Question on MSDN Forum

This might sound like a really basic "DUH" answer, but make sure that the column you're using to Lookup from on the merging file is actually full of unique values!
I noticed earlier today that PowerQuery won't throw you an error (like in PowerPivot) and will happily allow you to run a Many-Many merge. This will result in multiple rows being produced for each record that matches with a non-unique value.

Make sure your join query is correct: i was facing this issue due to
join query issue
/****** Script for command from SSMS ******/
SELECT [TransWorkShopNo]
,[TransformerCapacity].[CapacistyPrice]
,[TransformerCapacity].[HTCoilPrice]
,[TransformerCapacity].[LTCoilReclaimedPrice]
,[TransformerCapacity].[LTCoilNewPrice]
FROM [Hi-Lit-Electronics].[dbo].[TransformerData] inner join TransformerCapacity on [TransformerData].CapacistyID= [TransformerCapacity].CapacistyID
inner join TransformerItem on [TransformerData].ItemID= TransformerCapacity.ItemID
TransformerCapacity.ItemID
Here this was wrong

Ok in this example you are getting duplicates because you are joining both D and S onto M.
I assume you should be joining D.id onto S.id like below:
SELECT *
FROM M
INNER JOIN S
on M.Id = S.Id
INNER JOIN D
ON S.Id = D.Id
INNER JOIN H
ON D.Id = H.Id

use group by clause on main table id i hope it works
$this->db->group_by('products.id'); for codeigniter

Related

Join 2 tables into 1

I have a Customer and Customer_2 table which I am trying to join together:
Both tables have data in them, but upon joining with a statement only the column names are being returned without data. I am trying to use the following join statement:
select distinct *
from Customer c
join Customer_2 d on c.CUST_NUM = d.CUST_NUM
These are the tables:
CREATE TABLE [Customer]
(
[CUST_NUM] [INT] NOT NULL,
[CUST_LNAME] [VARCHAR](50) NULL,
[CUST_FNAME] [VARCHAR](50) NULL,
CUST_BALANCE [MONEY] NOT NULL,
)
ON [PRIMARY]
CREATE TABLE [Customer_2]
(
[CUST_NUM] [INT] NOT NULL,
[CUST_LNAME] [VARCHAR](50) NULL,
[CUST_FNAME] [VARCHAR](50) NULL,
)
ON [PRIMARY]
Data in each Table:
INSERT INTO Customer
VALUES
('1000', 'Smith', 'Jeanne', '1050.11'),
('1001', 'Ortega', 'Juan', '840.92');
INSERT INTO CUSTOMER_2
VALUES
('2000', 'McPherson', 'Anne'),
('2001', 'Ortega', 'Juan'),
('2002', 'Kowalski', 'Jan'),
('2003', 'Chan', 'George');
Expected output would be combining customer_2 onto the bottom of the customer table with the extra column CUST_BALANCE being 0 or null for each of the four customers on the customer 2 table. The desired output should also exclude the second entry for Juan Ortega or where CUSTOM_NUM is 2001

You can use UNION ALL for this operation. ie:
select cust_num, cust_fname, cust_lname, balance from Customer
union all
select cust_num, cust_fname, cust_lname, 0 from Customer_2 c2
where not exists (select * from Customer c
where c.cust_fname = c2.cust_fname and c.cust_lname = c2.cust_lname);
DBFiddle demo
PS: Probably it is not an official term but join, joins tables vertically while union [all] joins horizontally.

Another option
Just use an full outer join in the 2 tables.
You will get all the rows common and not common from both tables.
https://www.w3schools.com/sql/sql_join_full.asp

SQL Server : select from parent where id doesnt exist in either child tables

I need to select all rows from our customer table where they have no rows in the call table and no rows in the call archive table. Seems simple, but I have wrapped myself up in knots trying to get the query running.
So the structure of the tables is below: customer is the parent, with call and call archive both linked to the customer_id.
Can anyone help me out here please!
CREATE TABLE [dbo].[customer]
(
[customer_Id] [varchar](50) NOT NULL
CONSTRAINT [PK_customer]
PRIMARY KEY CLUSTERED ([customer_Id] ASC)
)
CREATE TABLE [dbo].[call]
(
[call_Id] [int] NOT NULL,
[customer_Id] [int] NULL,
[call_description] [varchar](50) NULL,
CONSTRAINT [PK_call]
PRIMARY KEY CLUSTERED ([call_Id] ASC)
)
ALTER TABLE [dbo].[call] WITH CHECK
ADD CONSTRAINT [FK_call_customer]
FOREIGN KEY([customerKey]) REFERENCES [dbo].[customer] ([customerkey])
GO
CREATE TABLE [dbo].[callArchive]
(
[call_Id] [int] NOT NULL,
[customer_Id] [int] NULL,
[call_description] [varchar](50) NULL,
CONSTRAINT [PK_call]
PRIMARY KEY CLUSTERED ([call_Id] ASC)
)
ALTER TABLE [dbo].[call] WITH CHECK
ADD CONSTRAINT [FK_call_customer]
FOREIGN KEY([customerKey]) REFERENCES [dbo].[customer] ([customerkey])
GO
I tried doing a select count on the call_id columns using left outer joins but I am getting records in there that I was not expecting to see:
SELECT
COUNT(call.Call_id) AS Calls,
COUNT(callArchive.Call_id) AS Archive_Calls
FROM
customer
LEFT OUTER JOIN
call ON customer.customer_id = call.customer_id
LEFT OUTER JOIN
callArchive ON customer.customer_id = callArchive.customer_id
HAVING
((COUNT(callArchive.Call_id) = 0)
AND (COUNT(call.Call_id) = 0))
ORDER BY
customer.customer_dateAdded DESC

Instead of faffing around with joins, a much simpler approach would be to use the not exists operator:
SELECT *
FROM [customer] c
WHERE NOT EXISTS (SELECT *
FROM [call]
WHERE [call].customer_id = c.customer_id) AND
NOT EXISTS (SELECT *
FROM [callArchive]
WHERE [callArchive].customer_id = c.customer_id)

Faffing around with joins? Joins as 'not simple'? OK...now it's getting personal. :-P
SELECT * FROM [customer] cu
LEFT JOIN call c on cu.customer_id = c.customerID
LEFT JOIN callArchive ca on cu.customer_id = ca.customerID
WHERE c.customerID is null AND ca.customerID is null

Returns all values in 3 tables

I have three tables:
CREATE TABLE [dbo].[Data]
(
[PorID] [int] NOT NULL,
[HourS] [int] NOT NULL
)
CREATE TABLE [dbo].[TimeData]
(
[HId] [bigint] NOT NULL,
[HName] [varchar](50) NOT NULL,
[HHour] [int] NOT NULL
)
CREATE TABLE [dbo].PortInfo
(
[Id] [bigint] NOT NULL,
[PortName] [varchar](50) NOT NULL
)
Even if the port is not present in the Data table it should return rows for all port in PortInfo table. Similarly, it should always return 24 records for each port. The result should display all ports for each record even if doesn't exist within the Data table.

Updated Answer
Based on what you are telling me in the comments and me filling in some blanks, this is what I assume you are looking for. This will produce a record for every hour, for every port.
SELECT
td.HHour,
td.HName,
pi.Id,
pi.PortName,
d.PorID,
d.HourS
FROM
dbo.TimeData td
FULL OUTER JOIN
dbo.PortInfo pi
ON (1 = 1)
LEFT OUTER JOIN
dbo.Data d
ON (d.PorID = pi.Id)
AND (d.HourS = td.HHour)
Output:
Some feedback to make this process easier. Share your schema (relationships) and/or some sample data. Also, consider creating more logical/intuitive names for your columns so that relationships and content may be implied.
Original Answer
This sounds like what you are looking for. The query below will return all port information (from PortInfo) even if its Id is not in the Data table (PorId). This is done by using a LEFT JOIN onto the PortInfo table.
SELECT
po.Id,
po.PortName,
d.HourS
FROM
dbo.PortInfo po
LEFT JOIN
dbo.Data d
ON (d.PorID = po.Id)
Now, you don't mention the how or if the 3rd table TimeData should be used, but if you wanted that information in your result as well, you can simply LEFT JOIN that as well:
SELECT
po.Id,
po.PortName,
d.HourS,
td.HName,
td.HHour
FROM
dbo.PortInfo po
LEFT JOIN
dbo.Data d
ON (d.PorID = po.Id)
LEFT JOIN
dbo.TimeData td
ON (td.HId = d.HourS) -- I assume this is the link, you may need to update if not.

Selecting a field in a subquery to be part of outer query.Sql Server 2008

How can I select a field that is in a where not exists subquery into the main query.
These 2 tables are not related in terms of foreign keys etc and I have not control over.
Let's take a fictious sample I have 2 tables "Category-Order"
CREATE TABLE [dbo].[Category](
[Id] [bigint] NOT NULL,
[Name] [varchar](255) NULL,
[OrderName] [varchar](50) NULL,
[CategoryType] [varchar](50) NULL
CREATE TABLE [dbo].[Order](
[OrderId] [bigint] NOT NULL,
[OrderName] [varchar](50) NULL
)
Given a query like below ,How can I rewrite the below query so that I can have CategoryType Selected as well?
At the moment I get the following error:
"The multi-part identifier "C.CategoryType" could not be bound."
INSERT INTO Category(Name,OrderName,CategoryType)
SELECT DISTINCT 'Fruit',O.OrderName , C.CategoryType
FROM [Order]O
WHERE NOT EXISTS(
SELECT 1
FROM Category c
WHERE c.Name='Fruit'
AND C.OrderName=O.OrderName)
Or could you tell me the equivalent of the above query using a join and eliminating the where not exists
Many thanks for any suggestions

You cannot use in the main query a table of the subquery.
You can do that :
INSERT INTO Category(Name,OrderName,CategoryType)
SELECT DISTINCT 'Fruit', O.OrderName , c.CategoryType
FROM [Order]O
LEFT JOIN Category c
ON c.OrderName=O.OrderName
LEFT JOIN Category c2
ON c2.OrderName=O.OrderName
AND c2.Name = 'Fruit'
WHERE
c.OrderName IS NOT NULL
AND c2.OrderName IS NULL
Or with ur exemple :
INSERT INTO Category(Name,OrderName,CategoryType)
SELECT DISTINCT 'Fruit',O.OrderName , Cat.CategoryType
FROM [Order]O
LEFT JOIN Category cat
ON cat.OrderName=O.OrderName
WHERE NOT EXISTS(
SELECT 1
FROM Category c
WHERE c.Name='Fruit'
AND C.OrderName=O.OrderName)

T-SQL - get count of joined entries

I wonder how better to write the following query to Microsoft SQL Server.
I have three tables: surveys, survey_presets and survey_scenes. They have the following columns:
CREATE TABLE [dbo].[surveys](
[id] [int] IDENTITY(1,1) NOT NULL,
[caption] [nvarchar](255) NOT NULL,
[creation_time] [datetime] NOT NULL,
)
CREATE TABLE [dbo].[survey_presets](
[id] [int] IDENTITY(1,1) NOT NULL,
[survey_id] [int] NOT NULL,
[preset_id] [int] NOT NULL,
)
CREATE TABLE [dbo].[survey_scenes](
[id] [int] IDENTITY(1,1) NOT NULL,
[survey_id] [int] NOT NULL,
[scene_id] [int] NOT NULL,
)
Both survey_presets and survey_scenes have foreign keys on surveys for survey_id column.
Now I want to select all surveys with the count of corresponding presets and scenes for each. Here is the "pseudo-query" of what I want:
SELECT
surveys.*,
COUNT(survey_presets, where survey_presets.survey_id = surveys.id),
COUNT(survey_scenes, where survey_scenes.survey_id = surveys.id)
FROM surveys
ORDER BY suverys.creation_time
I can do a mess with SELECT DISTINCT, JOIN, GROUP BY, etc., but I'm new to T-SQL and I doubt my query will be optimal in any sense.

I would do the counting in subqueries to avoid cartesian products. As you might have a few matching rows in presets and also a few in scenes resulting count might be multiplied. You might write simple join query and avoid the multiplication by counting distinct survey_presets.id and distinct survey_scenes.id though.
SELECT
surveys.*,
isnull(presets_count, 0) presets_count,
isnull(scenes_count, 0) scenes_count
FROM surveys
LEFT JOIN
(
SELECT survey_id,
count(*) presets_count
FROM survey_presets
GROUP BY survey_id
) presets
ON surveys.id = presets.survey_id
LEFT JOIN
(
SELECT survey_id,
count(*) scenes_count
FROM survey_scenes
GROUP BY survey_id
) scenes
ON surveys.id = scenes.survey_id
ORDER BY surveys.creation_time
How it works
You can introduce a special kind of subquery called derived table to FROM section of your query. Derived table is defined as normal query enclosed in parenthesis and followed by table alias. It cannot use any column from outer query, but can expose columns you use in ON section to join derived table to main body of the query.
In this case derived table simply count rows grouped by id; joins connect the counts to surveys.

SELECT surveys.ID, surveys.caption, surveys.creation_time,
count(survey_presets.survey_id) as survey_presets,
count(survey_scenes.survey_id) as survey_scenes
FROM surveys
LEFT OUTER JOIN survey_presets on survey_presets.survey_id = surveys.id
LEFT OUTER JOIN survey_scenes on survey_scenes.survey_id = surveys.id
GROUP BY surveys.ID, surveys.caption, surveys.creation_time
ORDER BY suverys.creation_time

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas