Duplicate rows when joining tables

Duplicate rows when joining tables - sql

I am having troubles with a SQL. My problem is that I get alot of duplicate rows but I don't know how to fix it.
I have the following tables:
tblCGG with columns: listId, description
tblCLA with columns: listid, CLADescription
tblHEA with columns: listid, HEADescription
tblACT with columns: listid, ACTDescription
If I run these tables seperatly with listid = '132623' I get the following output:
tblCGG: 1 row
tblCLA: 4 rows
tblHEA: 10 rows
tblACT: 4 rows
I want to join these tables together, but I am getting way to many rows.
I tried this query below, but I get 160 rows:
select distinct cgg.listid, cla.claDescription, hea.heaDescription,
act.actDescription
from tblCGG cgg
left join tblCLA cla on cgg.listid = cla.listid
left join tblHEA hea on cgg.listid = hea.listid
left join tblACT act on cgg.listid = act .listid
where cgg.listid = '132623'
Desired Output
listid claDescription heaDescription actDescription
132623 claTest hea1 act1
132623 clads hea2 act2
132623 cloas hea3 act3
132623 ccaa hea4 act4
132623 null hea5 null
132623 null hea6 null
132623 null hea7 null
132623 null hea8 null
132623 null hea9 null
132623 null hea10 null

I am not sure if desired output really has sense. But if it is what you really, REALLY need then.
select coalesce(t.listid, c.listid, a.listid, h.listid) listid,
cladescription, headescription, actdescription
from tblcgg t
FULL OUTER join (select a.*, row_number() over(partition by listid order by cladescription) seq_no from tblcla a) c on t.listid=c.listid
FULL OUTER join (select a.*, row_number() over(partition by listid order by actdescription) seq_no from tblact a) a on t.listid=a.listid and a.seq_no=c.seq_no
FULL OUTER join (select a.*, row_number() over(partition by listid order by headescription) seq_no from tblhea a) h on h.listid=a.listid and (h.seq_no=c.seq_no or h.seq_no=a.seq_no)
where coalesce(t.listid, c.listid, a.listid, h.listid)=132623
I am a bit upset with this code as performance will be low on bigger datasets but can't quickly find better solutions without writing function.
Few words of code explanation:
row_number() is window function for obtaining sequence number of each description in each table (you can play with "order by" in it for desired ordering)
full outer join is something that shouldn't be used lightly as performance is not a good side of it but you want a rather strange output so it is good for it
coalesce() returns first not null value
You really should think if union all descriptions will not be better for you:
select listid, 'cgg' source,description from tblcgg where listid=132623
UNION ALL
select listid, 'act' source,actdescription from tblact where listid=132623
UNION ALL
select listid, 'head' source,headescription from tblhea where listid=132623
UNION ALL
select listid, 'cla' source,cladescription from tblcla where listid=132623

You want a separate list in each column. This isn't really a SQL'ish thing to do, but you can arrange it. One method uses row_number() and group by:
select listid, max(claDescription) as claDescription,
max(heaDescription) as heaDescription,
max(actDescription) as actDescription
from ((select cla.listid, cla.claDescription, NULL as heaDescription, NULL as actDescription,
row_number() over (partition by cla.listid order by cla.listid) as seqnum
from tblCLA cla
) union all
(select hea.listid, NULL as claDescription, hea.heaDescription, NULL as actDescription,
row_number() over (partition by hea.listid order by hea.listid) as seqnum
from tblHEA hea
) union all
(select act.listid, NULL as claDescription, NULL as heaDescription, act.actDescription,
row_number() over (partition by act.listid order by act.listid) as seqnum
from tblACT act
)
) x
where listid = 132623 -- only use single quotes if this is really a string
group by listid, seqnum;

The following query will give the results you're looking for. It's a slight mod of your original, but depends on knowing that tblHEA has the most rows in it:
WITH ctecla as (select listid, cladescription, rownum as cla_rownum from tblcla),
ctehea as (select listid, headescription, rownum as hea_rownum from tblhea),
cteact as (select listid, actdescription, rownum as act_rownum from tblact)
select cgg.listid,
cla.claDescription,
hea.heaDescription,
act.actDescription
from tblCGG cgg
left join cteHEA hea
on hea.listid = cgg.listid
left join cteCLA cla
on cla.listid = hea.listid AND
cla.cla_rownum = hea.hea_rownum
left join cteACT act
on act.listid = hea.listid AND
act.act_rownum = hea.hea_rownum
where cgg.listid = '132623';
SQLFiddle here

Related

Remove multiple rows with same ID

So I've done some looking around and wasn't unable to find quite what I was looking for. I have two tables.
1.) Table where general user information is stored
2.) Where a status is generated and stored.
The problem is, is that there are multiple rows for the same users and querying these results in multiple returns. I can't just merge them because they aren't all the same status. I need just the newest status from that table.
Example of the table:
SELECT DISTINCT
TOP(50) cam.UserID AS PatientID,
mppi.DisplayName AS Surgeon,
ISNULL(sci.IOPStatus, 'N/A') AS Status,
tkstat.TrackerStatusID AS Stat_2
FROM
Main AS cam
INNER JOIN
Providers AS rap
ON cam.VisitID = rap.VisitID
INNER JOIN
ProviderInfo AS mppi
ON rap.UnvUserID = mppi.UnvUserID
LEFT OUTER JOIN
Inop AS sci
ON cam.CwsID = sci.CwsID
LEFT OUTER JOIN
TrackerStatus AS tkstat
ON cam.CwsID = tkstat.CwsID
WHERE
(
cam.Location_ID IN
(
'SURG'
)
)
AND
(
rap.IsAttending = 'Y'
)
AND
(
cam.DateTime BETWEEN CONCAT(CAST(GETDATE() AS DATE), ' 00:00:00') AND CONCAT(CAST(GETDATE() AS DATE), ' 23:59:59')
)
AND
(
cam.Status_StatusID != 'Cancelled'
)
ORDER BY
cam.UserID ASC
So I need to grab only the newest Stat_2 from each ID so they aren't returning multiple rows. Each Stat_2 also has an update time meaning I can sort by the time/date that column is : StatusDateTime

One way to handle this is to create a calculated row_number for the table where you need the newest record.
Easiest way to do that is to change your TKSTAT join to a derived table with the row_number calculation and then add a constraint to your join where the RN =1
SELECT DISTINCT TOP (50)
cam.UserID AS PatientID, mppi.DisplayName AS Surgeon, ISNULL(sci.IOPStatus, 'N/A') AS Status, tkstat.TrackerStatusID AS Stat_2
FROM Main AS cam
INNER JOIN Providers AS rap ON cam.VisitID = rap.VisitID
INNER JOIN ProviderInfo AS mppi ON rap.UnvUserID = mppi.UnvUserID
LEFT OUTER JOIN Inop AS sci ON cam.CwsID = sci.CwsID
LEFT OUTER JOIN (SELECT tk.CwsID, tk.TrackerStatusId, ROW_NUMBER() OVER (PARTITION BY tk.cwsId ORDER BY tk.CreationDate DESC) AS rn FROM TrackerStatus tk)AS tkstat ON cam.CwsID = tkstat.CwsID
AND tkstat.rn = 1
WHERE (cam.Location_ID IN ( 'SURG' )) AND (rap.IsAttending = 'Y')
AND (cam.DateTime BETWEEN CONCAT(CAST(GETDATE() AS DATE), ' 00:00:00') AND CONCAT(CAST(GETDATE() AS DATE), ' 23:59:59'))
AND (cam.Status_StatusID != 'Cancelled')
ORDER BY cam.UserID ASC;
Note you need a way to derive what the "newest" status is; I assume there is a created_date or something; you'll need to enter the correct colum name
ROW_NUMBER() OVER (PARTITION BY tk.cwsId ORDER BY tk.CreationDate DESC) AS rn

SQL Server doesn't offer a FIRST function, but you can reproduce the functionality with ROW_NUMBER() like this:
With Qry1 (
Select <other columns>,
ROW_NUMBER() OVER(
PARTITION BY <group by columns>
ORDER BY <time stamp column*> DESC
) As Seq
From <the rest of your select statement>
)
Select *
From Qry1
Where Seq = 1
* for the "newest" record.

Selecting from two queries select not null

i have query that the result is a single value there are many cases that bring me a null value in this case that's what i not need so i need make another query to bring me back a value, so i need to make a one query that bring me back when is null value in the first query omit the result and execute the second query.
the firts query is
SELECT DISTINCT
FIRST_VALUE (pac1.pac_name)
OVER (ORDER BY pac1.pac_final_date DESC)
FROM matricula mac
INNER JOIN
periodo pac1
ON mac.pac_id = pac1.pac_id
WHERE mac.ent_id = 26172 AND mac.mac_estado IN (8072, 10221)
the second query is
SELECT DISTINCT
FIRST_VALUE (pac1.pac_name)
OVER (ORDER BY pac1.pac_final_date DESC)
FROM registro rea
INNER JOIN
periodo pac1
ON rea.pac_id = pac1.pac_id
WHERE rea.ent_id = 26172
The two queries bring me back the same value, but first i need to consult for the first query, there are two cases.
case -1 --> when execute query#1 and bring me the result
Result
FIRST_VALUE (pac1.pac_name)
--------------------------
|Oct/2012 - Feb/2013 |
--------------------------
case -2 --> when execute query#1 the result is null, then execute a query #2 that will assure bring me a value
Result
FIRST_VALUE (pac1.pac_name)
--------------------------
|Oct/2012 - Feb/2013 |
--------------------------
This is probably an easy question, but any help is appreciated.

SELECT COALESCE(
(SELECT DISTINCT
FIRST_VALUE (pac1.pac_name)
OVER (ORDER BY pac1.pac_final_date DESC)
FROM matricula mac
INNER JOIN
periodo pac1
ON mac.pac_id = pac1.pac_id
WHERE mac.ent_id = 26172 AND mac.mac_estado IN (8072, 10221) )
,(SELECT DISTINCT
FIRST_VALUE (pac1.pac_name)
OVER (ORDER BY pac1.pac_final_date DESC)
FROM registro rea
INNER JOIN
periodo pac1
ON rea.pac_id = pac1.pac_id
WHERE rea.ent_id = 26172))
I am not positive without test data and a few more things but you could also use periodio as the main table and then LEFT OUTER JOIN to the other 2 tables. Use a case expression to determine which table/rows to use first in your order by. a similar method with EXISTS could also probably be considered:
SELECT DISTINCT FIRST_VALUE(pac1.pac_name) OVER (ORDER BY
CASE WHEN mac.pac_id IS NOT NULL THEN 0 ELSE 1 END, pac1.pac_final_date DESC)
FROM
periodo pac1
LEFT JOIN matricula mac
ON pac1.pac_id = mac.pac_id
AND mac.ent_id = 26172
AND mac.mac_estado IN (8072, 10221)
LEFT JOIN registro rea
ON pac1.pac_id = rea.pac_id
AND rea.ent_id = 26172

Cross apply a table valued function

A real mind bender here guys!
I have a table which basically positions users in a league:
LeagueID Stake League_EntryID UserID TotalPoints TotalBonusPoints Prize
13028 2.00 58659 2812 15 5 NULL
13028 2.00 58662 3043 8 3 NULL
13029 5.00 58665 2812 8 3 NULL
The League_EntryID is the unique field here but you will see this query returns multiple leagues that user is entered for that day.
I also have a table value function which returns the current prize standings for the league and this accepts the LeagueID as a parameter and returns the people who qualify for prize money. This is a complex function which ideally I would like to keep as the function accepting the LeagueID. The result of this is as below:
UserID Position League_EntryID WinPerc Prize
2812 1 58659 36.000000 14.00
3043 6 58662 2.933333 4.40
3075 6 58664 2.933333 4.40
Essentially what I want to do is to join the table value function to the topmost query by passing in the LeagueID to essentially update the Prize Field for that League_EntryID i.e.
SELECT * FROM [League]
INNER JOIN [League_Entry] ON [League].[LeagueID] = [League_Entry].[LeagueID]
INNER JOIN [dbo].[GetPrizesForLeague]([League].[LeagueID]) ....
I'm not sure if a CROSS APPLY would work here but essentially I believe I need to JOIN on both the LeagueID and the League_EntryID to give me my value for the Prize. Not sure on the best way to do this without visiting a scalar function which will in turn call the table value function and obtain the Prize from that.
Speed is worrying me here.
P.S. Not all League_EntryID's will exist as a part of the table value function output so maybe an OUTER JOIN/APPLY can be used?
EDIT See the query below
SELECT DISTINCT [LeagueID],
[CourseName],
[Refunded],
[EntryID],
[Stake],
d.[League_EntryID],
d.[UserID],
[TotalPoints],
[TotalBonusPoints],
[TotalPointsLastRace],
[TotalBonusPointsLastRace],
d.[Prize],
[LeagueSizeID],
[TotalPool],
d.[Position],
[PositionLastRace],
t.Prize
FROM
(
SELECT [LeagueID],
[EntryID],
[Stake],
[MeetingID],
[Refunded],
[UserID],
[League_EntryID],
[TotalPoints],
[TotalBonusPoints],
[TotalPointsLastRace],
[TotalBonusPointsLastRace],
[Prize],
[LeagueSizeID],
[dbo].[GetTotalPool]([LeagueID], 1) AS [TotalPool],
RANK() OVER( PARTITION BY [LeagueID] ORDER BY [TotalPoints] DESC, [TotalBonusPoints] DESC) AS [Position],
RANK() OVER( PARTITION BY [LeagueID] ORDER BY [TotalPointsLastRace] DESC, [TotalBonusPointsLastRace] DESC) AS [PositionLastRace],
ROW_NUMBER() OVER (PARTITION BY [LeagueID]
ORDER BY [TotalPoints] DESC, [TotalBonusPoints] DESC
) as [Position_Rownum]
FROM [DATA] ) AS d
INNER JOIN [Meeting] WITH (NOLOCK) ON [d].[MeetingID] = [Meeting].[MeetingID]
INNER JOIN [Course] ON [Meeting].[CourseID] = [Course].[CourseID]
OUTER APPLY (SELECT * FROM [dbo].[GetLeaguePrizes](d.[LeagueID])) t
WHERE (
([LeagueSizeID] = 3 AND [Position_Rownum] <= 50)
OR (d.[UserID] = #UserID AND [LeagueSizeID] = 3)
)
OR
(
[LeagueSizeID] in (1,2)
)
ORDER BY [LeagueID], [Position]
Any direction would be appreciated.

You need to use OUTER APPLY (a mix of CROSS APPLY and LEFT JOIN).
SELECT * FROM [League]
INNER JOIN [League_Entry] ON [League].[LeagueID] = [League_Entry].[LeagueID]
OUTER APPLY [dbo].[GetPrizesForLeague]([League].[LeagueID]) t
Performance is very good with CROSS APPLY/OUTER APPLY. It's great for replacing some inner queries and cursors.

INNER JOIN on a Sub Query

I have a list of tasks in a table called dbo.Task
In the database, each Task can have 1 or more rows in the TaskLine table.
TaskLine has a TaskID to related the Tasklines to the Task.
A TaskLine has a column called TaskHeadingTypeID
I need to return all the tasks, joined to the LAST TaskLine for that Task.
In english, I need to display a task, with the latest TaskLine heading. So, I basically need to join to the TaskLine table, like this (which, is incorrect and maybe inefficient, but hopefully shows what I am trying to do)
SELECT *
FROM #Results r
INNER JOIN (
SELECT TOP 1 TaskID, TaskHeadingTypeID FROM dbo.TaskLine
ORDER BY TaskLineID DESC
) tl
ON tl.TaskID = r.TaskID
However, the issue is, the sub query only brings back the last TaskLine row, which is incorrect.
Edit:
At the moment, it's 'Working' like the code below, but it seems highly inefficient, because for each task row, it has to run two extra queries. And they're both on the same table, just slightly different columns in that table:
(An extract of the columns in the SELECT cause)
SELECT TaskStatusID,
TaskStatus,
(SELECT TOP 1 TaskHeadingTypeID FROM dbo.TaskLine
WHERE TaskID = r.TaskID
ORDER BY TaskLineID DESC) AS TaskHeadingID,
(SELECT TOP 1 LongName FROM dbo.TaskLine tl
INNER JOIN ref.TaskHeadingType tht
ON tht.TaskHeadingTypeID = tl.TaskHeadingTypeID
WHERE TaskID = r.TaskID
ORDER BY TaskLineID DESC) AS TaskHeading,
PersonInCareID,
ICMSPartyID,
CarerID.... FROM...
EDIT:
Thanks to the ideas and comments below, I have ended up with this, using CTE:
;WITH ValidTaskLines (RowNumber, TaskID, TaskHeadingTypeID, TaskHeadingType)
AS
(SELECT
ROW_NUMBER()OVER(PARTITION BY tl.TaskID, tl.TaskHeadingTypeID ORDER BY tl.TaskLineID) AS RowNumber,
tl.TaskID,
tl.TaskHeadingTypeID,
LongName AS TaskHeadingType
FROM dbo.TaskLine tl
INNER JOIN ref.TaskHeadingType tht
ON tht.TaskHeadingTypeID = tl.TaskHeadingTypeID
)
SELECT AssignedByBusinessUserID,
BusinessUserID,
LoginName,
Comments,
r.CreateDate,
r.CreateUser,
r.Deleted,
r.Version,
IcmsBusinessUserID,
r.LastUpdateDate,
r.LastUpdateUser,
OverrrideApprovalBusinessUserID,
PlacementID,
r.TaskID,
TaskPriorityTypeID,
TaskPriorityCode,
TaskPriorityType,
TaskStatusID,
TaskStatus,
vtl.TaskHeadingTypeID AS TaskHeadingID,
vtl.TaskHeadingType AS TaskHeading,
PersonInCareID,
ICMSPartyID,
CarerID,
ICMSCarerEntityID,
StartDate,
EndDate
FROM #Results r
INNER JOIN ValidTaskLines vtl
ON vtl.TaskID = r.TaskID
AND vtl.RowNumber = 1

You could use the ROW_NUMBER() function for this:
SELECT *
FROM #Results r
INNER JOIN (SELECT TaskID
, TaskHeadingTypeID
, ROW_NUMBER()OVER(PARTITION BY TaskID, TaskHeadingTypeID ORDER BY TAskLineID DESC) RN
FROM dbo.TaskLine
) tl
ON tl.TaskID = r.TaskID
AND t1.RN = 1
The ROW_NUMBER() function assigns a number to each row. PARTITION BY is optional, but used to start the numbering over for each value in that group, ie: if you PARTITION BY Some_Date then for each unique date value the numbering would start over at 1. ORDER BY of course is used to define how the counting should go, and is required in the ROW_NUMBER() function.
You may need to adjust the PARTITION BY to suit your query, run the subquery by itself to get an idea of how the ROW_NUMBER() works.

Multiple MAX values select using inner join

I have query that work for me only when values in the StakeValue don't repeat.
Basically, I need to select maximum values from SI_STAKES table with their relations from two other tables grouped by internal type.
SELECT a.StakeValue, b.[StakeName], c.[ProviderName]
FROM SI_STAKES AS a
INNER JOIN SI_STAKESTYPES AS b ON a.[StakeTypeID] = b.[ID]
INNER JOIN SI_PROVIDERS AS c ON a.[ProviderID] = c.[ID] WHERE a.[EventID]=6
AND a.[StakeGroupTypeID]=1
AND a.StakeValue IN
(SELECT MAX(d.StakeValue) FROM SI_STAKES AS d
WHERE d.[EventID]=a.[EventID] AND d.[StakeGroupTypeID]=a.[StakeGroupTypeID]
GROUP BY d.[StakeTypeID])
ORDER BY b.[StakeName], a.[StakeValue] DESC
Results for example must be:
[ID] [MaxValue] [StakeTypeID] [ProviderName]
1 1,5 6 provider1
2 3,75 7 provider2
3 7,6 8 provider3
Thank you for your help

There are two problems to solve here.
1) Finding the max values per type. This will get the Max value per StakeType and make sure that we do the exercise only for the wanted events and group type.
SELECT StakeGroupTypeID, EventID, StakeTypeID, MAX(StakeValue) AS MaxStakeValue
FROM SI_STAKES
WHERE Stake.[EventID]=6
AND Stake.[StakeGroupTypeID]=1
GROUP BY StakeGroupTypeID, EventID, StakeTypeID
2) Then we need to get only one return back for that value since it may be present more then once.
Using the Max Value, we must find a unique row for each I usually do this by getting the Max ID is has the added advantage of getting me the most recent entry.
SELECT MAX(SMaxID.ID) AS ID
FROM SI_STAKES AS SMaxID
INNER JOIN (
SELECT StakeGroupTypeID, EventID, StakeTypeID, MAX(StakeValue) AS MaxStakeValue
FROM SI_STAKES
WHERE Stake.[EventID]=6
AND Stake.[StakeGroupTypeID]=1
GROUP BY StakeGroupTypeID, EventID, StakeTypeID
) AS SMaxVal ON SMaxID.StakeTypeID = SMaxVal.StakeTypeID
AND SMaxID.StakeValue = SMaxVal.MaxStakeValue
AND SMaxID.EventID = SMaxVal.EventID
AND SMaxID.StakeGroupTypeID = SMaxVal.StakeGroupTypeID
3) Now that we have the ID's of the rows that we want, we can just get that information.
SELECT Stakes.ID, Stakes.StakeValue, SType.StakeName, SProv.ProviderName
FROM SI_STAKES AS Stakes
INNER JOIN SI_STAKESTYPES AS SType ON Stake.[StakeTypeID] = SType.[ID]
INNER JOIN SI_PROVIDERS AS SProv ON Stake.[ProviderID] = SProv.[ID]
WHERE Stake.ID IN (
SELECT MAX(SMaxID.ID) AS ID
FROM SI_STAKES AS SMaxID
INNER JOIN (
SELECT StakeGroupTypeID, EventID, StakeTypeID, MAX(StakeValue) AS MaxStakeValue
FROM SI_STAKES
WHERE Stake.[EventID]=6
AND Stake.[StakeGroupTypeID]=1
GROUP BY StakeGroupTypeID, EventID, StakeTypeID
) AS SMaxVal ON SMaxID.StakeTypeID = SMaxVal.StakeTypeID
AND SMaxID.StakeValue = SMaxVal.MaxStakeValue
AND SMaxID.EventID = SMaxVal.EventID
AND SMaxID.StakeGroupTypeID = SMaxVal.StakeGroupTypeID
)

You can use the over clause since you're using T-SQL (hopefully 2005+):
select distinct
a.stakevalue,
max(a.stakevalue) over (partition by a.staketypeid) as maxvalue,
b.staketypeid,
c.providername
from
si_stakes a
inner join si_stakestypes b on
a.staketypeid = b.id
inner join si_providers c on
a.providerid = c.id
where
a.eventid = 6
and a.stakegrouptypeid = 1
Essentially, this will find the max a.stakevalue for each a.staketypeid. Using a distinct will return one and only one row. Now, if you wanted to include the min a.id along with it, you could use row_number to accomplish this:
select
s.id,
s.maxvalue,
s.staketypeid,
s.providername
from (
select
row_number() over (order by a.stakevalue desc
partition by a.staketypeid) as rownum,
a.id,
a.stakevalue as maxvalue,
b.staketypeid,
c.providername
from
si_stakes a
inner join si_stakestypes b on
a.staketypeid = b.id
inner join si_providers c on
a.providerid = c.id
where
a.eventid = 6
and a.stakegrouptypeid = 1
) s
where
s.rownum = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Duplicate rows when joining tables - sql

Related

Remove multiple rows with same ID

Selecting from two queries select not null

Cross apply a table valued function

INNER JOIN on a Sub Query

Multiple MAX values select using inner join

Categories

Resources