Remove duplicate SQL Join results - sql

Okay, I and totally new to SQL so bear with me. I created a statement which I have the results I wanted but wanted get rid of duplicate results. What's a easy solution to this? Here is my statement
SELECT
li.location,
li.logistics_unit,
li.item,
li.company,
li.item_desc,
li.on_hand_qty,
li.in_transit_qty,
li.allocated_qty,
li.lot,
i.item_category3,
location.locating_zone,
location.location_subclass,
i.item_category4
FROM
location_inventory li
INNER JOIN item i ON li.item = i.item
INNER JOIN location l ON l.location = li.location
WHERE
i.item_category3 = 'AS' AND
li.warehouse = 'river' AND
li.location NOT LIKE 'd%' AND
li.location NOT LIKE 'stg%'
ORDER BY
li.item asc

If you are confident in your JOIN then DISTINCT should do the trick like so:
select DISTINCT
location_inventory.location ,
location_inventory.logistics_unit ,
location_inventory.item ,
location_inventory.company ,
location_inventory.item_desc ,
location_inventory.on_hand_qty ,
location_inventory.in_transit_qty ,
location_inventory.allocated_qty ,
location_inventory.lot ,
item.item_category3 ,
location.locating_zone ,
location.location_subclass ,
item.item_category4
from location_inventory
INNER JOIN item
on location_inventory.item=item.item
INNER JOIN location
on location_inventory.location=location.location
where item.item_category3 = 'AS' and
location_inventory.warehouse = 'river' and
location_inventory.location not like 'd%' and
location_inventory.location not like 'stg%'
order by location_inventory.item asc

Assuming that i.item and l.location are primary or unique keys, any duplicates you see are being caused by duplicate items in your location_inventory table. That might or might not be what you want.
SELECT DISTINCT will only eliminate true duplicates (i.e. those where all of the selected columns are duplicate). If that's what you want, use it. Otherwise, you might need to make an inner select that uses SELECT DISTINCT to identify the columns in which you don't want duplicates, and join the results of the inner select back to the tables to pull out all the other data.

Related

Too many columns in GROUP BY

I'm trying to aggregate some data, but I've a problem. There's my query (using 3 tables):
SELECT
ufc.counter_id,
gcrvf.goal_id,
gcrvf.date_of_visit,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
max(gcrvf.last_update_time) AS last_update_time,
sum(gcrvf.conversions) AS conversions,
c.name AS counter_name,
c.owner_login AS owner_login,
c.status AS counter_status,
concat(g.goal_source,CAST('Goal','text')) AS metric_type,
multiIf(g.is_retargeting = 0,'non-retargeting',g.is_retargeting = 1,'retargeting',NULL) AS metric_key,
concat(g.name,' (',CAST(gcrvf.goal_id,'String'),')') AS metric_name
FROM connectors_yandex_metrika.goal_conversions_report_v_final AS gcrvf
INNER JOIN connectors_yandex_metrika.utm_for_collect AS ufc ON gcrvf.counter_id = ufc.counter_id
LEFT JOIN connectors_yandex_metrika.counter AS c ON gcrvf.counter_id = c.id
LEFT JOIN connectors_yandex_metrika.goal AS g ON gcrvf.goal_id = g.id
WHERE
((gcrvf.utm_campaign = ufc.utm_campaign) OR (ufc.utm_campaign IS NULL))
AND ((gcrvf.utm_source = ufc.utm_source) OR (ufc.utm_source IS NULL))
AND ((gcrvf.utm_medium = ufc.utm_medium) OR (ufc.utm_medium IS NULL))
AND ((gcrvf.utm_content = ufc.utm_content) OR (ufc.utm_content IS NULL))
AND ((gcrvf.utm_term = ufc.utm_term ) OR (ufc.utm_term IS NULL))
GROUP BY
ufc.counter_id,
gcrvf.date_of_visit,
gcrvf.goal_id,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
c.name,
c.owner_login,
c.status,
metric_type,
metric_key,
metric_name
I have to GROUP BY by almost all columns. Is it a real problem?
Columns ufc.original_join_id, c.name,c.owner_login, c.status, metric_type, metric_key,metric_name are not necessary here. I added them to group by just because I need these columns. And I want to ask: any way to make it more abbreviated? Any ways to avoid unnecessary columns from group by? Or it's okay?
And my second question: does ClickHouse cache right table when we use JOINs? So I always should put huge table as left table?
All columns are required in the group by. It is not possible to leaf some columns out which where mentioned as select columns.
Depending on your indexed columns you can improve the speed of the query. You should try to make an index on the key columns.
The Database will handle the cache logic for you. Depending on how often you execute the query.

Error in nested SQL statement

Can someone help me fix this SQL statement? I have 2 tables... trying to get a list of all records in table 1 (c) along with a count (if any) of matching records in table 2 (cp_docs).
SELECT TOP 100 c.cal_procedure ,
c.description ,
c.active ,
c.create_user ,
c.create_date ,
c.edit_user ,
c.edit_date ,
c.id,
cp_docs.cpd
FROM cal_procedure c
OUTTER JOIN (select cal_procedure as cp, count(id) as cpd
from cal_procedure_doc
group by cal_procedure) cp_docs
ON cp_docs.cp = c.cal_procedure
Thanks,
Tracy
Hard to say without the error message but your outer join has a couple issues.
OUTER is incorrectly written at OUTTER
Your OUTER keyword needs to be prefixed with LEFT OR RIGHT. With the logic in your query you want likely want LEFT
Fixed SQL:
SELECT TOP 100 c.cal_procedure ,
c.description ,
c.active ,
c.create_user ,
c.create_date ,
c.edit_user ,
c.edit_date ,
c.id,
cp_docs.cpd
FROM cal_procedure c
LEFT OUTER JOIN (select cal_procedure as cp, count(id) as cpd
from cal_procedure_doc
group by cal_procedure) cp_docs
ON cp_docs.cp = c.cal_procedure
Now in your query, you could get null values in the cpd column if there were no values in the cal_prodcedure_doc table. If you look at Max's answer, you would get 0's instead. If you wanted to use your current approach but have the zero's display you would need to wrap cp_docs.cpd in a coalesce function
coalesce(cp_docs.cpd, 0)
In the end I think Max's answer is easier to read and probably the way I would write this query as I think it's easier to read. If the tables are huge you may want to check how each performs to see one is better than the other.
You can just add a subquery to the SELECT clause. It's cleaner than joining a temp table. If you try to read someone else's query to figure out how a calculation is done, you'll start with the SELECT statement. If the select statement points you to a table alias (e.g. cp_docs), you need to find the table in the FROM clause... etc. The execution plans are almost identical; the proposed SELECT clause subquery actually eliminates one innocuous Compute Scaler step.
SELECT c.cal_procedure ,
c.description ,
c.active ,
c.create_user ,
c.create_date ,
c.edit_user ,
c.edit_date ,
c.id,
(SELECT COUNT(*) FROM cal_procedure_docs where cal_procedure = c.cal_procedure) AS cpd
FROM cal_procedure c
Perhaps you want outer apply :
SELECT TOP 100 c.cal_procedure, c.description, c.active, c.create_user,
c.create_date, c.edit_user, c.edit_date, c.id, cp_docs.cpd
FROM cal_procedure c OUTER APPLY
(select count(id) as cpd
from cal_procedure_doc
where cal_procedure = c.cal_procedure
) cp_docs
ORDER BY ? ? ? ;

Matching values from two columns with the values of two columns in a sub-query

I have a problem I can't really figure out, even though I thought I had the solution.
I think this is DB2 SQL by the way.
I have a customer number and a country code (extracted from a string using SUBSTR) which I don't want to find in combination in a subquery, like so:
SELECT ku.orgnr AS customer ,
Substr(bu.bank_account_swiftadr,5,2) AS country
FROM db811.bet_utl bu
LEFT JOIN db811.henv_utl bh
ON bu.betaling_urn = bh.betaling_urn
LEFT JOIN db811.betaling_status bs
ON bu.betaling_status = bs.betaling_status
LEFT JOIN db811.kunde_orgnr ku
ON bu.kundenr = ku.kundenr
WHERE bu.kanal = 'N'
AND (ku.orgnr, Substr(bu.bank_account_swiftadr,5,2)) ;
Not in the results below
SELECT ku.orgnr AS customer ,
Substr(bu.bank_account_swiftadr,5,2) AS country ,
COUNT(*) AS numberof
FROM db811.bet_utl_hist bu
LEFT JOIN db811.kunde_orgnr ku
ON bu.kundenr = ku.kundenr
WHERE
and bu.kanal = 'N'
AND bu.betalingsdato > '2016-01-01'
GROUP BY ku.orgnr ,
substr(bu.bank_account_swiftadr,5,2);
This should work I though, but it seems to match on just one of them, and I need both to be true in order for me to exclude it with the NOT IN.
I assume I am missing something basic since I am quite new at this.

SQL: SELECT DISTINCT not returning distinct values

The code below is supposed to return unique records in the lp_num field from the subquery to then be used in the outer query, but I am still getting multiples of the lp_num field. A ReferenceNumber can have multiple ApptDate records, but each lp_num can only have 1 rf_num. That's why I tried to retrieve unique lp_num records all the way down in the subquery, but it doesn't work. I am using Report Builder 3.0.
Current Output
Screenshot
The desired output would be to have only unique records in the lp_num field. This is because each value in the lp_num field is a pallet, one single pallet. the info to the right is when it arrived (ApptDate) and what the reference number is for the delivery (ref_num). Therefore, it makes no sense for a pallet to have multiple receipt dates...it can only arrive once...
SELECT DISTINCT
dbo.ISW_LPTrans.item,
dbo.ISW_LPTrans.lot,
dbo.ISW_LPTrans.trans_type,
dbo.ISW_LPTrans.lp_num,
dbo.ISW_LPTrans.ref_num,
(MIN(CONVERT(VARCHAR(10),dbo.CW_CheckInOut.ApptDate,101))) as appt_date_only,
dbo.CW_CheckInOut.ApptTime,
dbo.item.description,
dbo.item.u_m,
dbo.ISW_LPTrans.qty,
(CASE
WHEN dbo.ISW_LPTrans.trans_type = 'F'
THEN 'Produced internally'
ELSE
(CASE
WHEN dbo.ISW_LPTrans.trans_type = 'R'
THEN 'Received from outside'
END)
END
) as original_source
FROM
dbo.ISW_LPTrans
INNER JOIN dbo.CW_Dock_Schedule ON LTRIM(RTRIM(dbo.ISW_LPTrans.ref_num)) = dbo.CW_Dock_Schedule.ReferenceNumber
INNER JOIN dbo.CW_CheckInOut ON dbo.CW_CheckInOut.TruckID = dbo.CW_Dock_Schedule.TruckID
INNER JOIN dbo.item ON dbo.item.item = dbo.ISW_LPTrans.item
WHERE
(dbo.ISW_LPTrans.trans_type = 'R') AND
--CONVERT(VARCHAR(10),dbo.CW_CheckInOut.ApptDate,101) <= CONVERT(VARCHAR(10),dbo.ISW_LPTrans.trans_date,101) AND
dbo.ISW_LPTrans.lp_num IN
(SELECT DISTINCT
dbo.ISW_LPTrans.lp_num
FROM
dbo.ISW_LPTrans
INNER JOIN dbo.item ON dbo.ISW_LPTrans.item = dbo.item.item
INNER JOIN dbo.job ON dbo.ISW_LPTrans.ref_num = dbo.job.job AND dbo.ISW_LPTrans.ref_line_suf = dbo.job.suffix
WHERE
(dbo.ISW_LPTrans.trans_type = 'W' OR dbo.ISW_LPTrans.trans_type = 'I') AND
dbo.ISW_LPTrans.ref_num IN
(SELECT
dbo.ISW_LPTrans.ref_num
FROM
dbo.ISW_LPTrans
--INNER JOIN dbo.ISW_LPTrans on dbo.ISW_LPTrans.
WHERE
dbo.ISW_LPTrans.item LIKE #item AND
dbo.ISW_LPTrans.lot LIKE #lot AND
dbo.ISW_LPTrans.trans_type = 'F'
GROUP BY
dbo.ISW_LPTrans.ref_num
) AND
dbo.ISW_LPTrans.ref_line_suf IN
(SELECT
dbo.ISW_LPTrans.ref_line_suf
FROM
dbo.ISW_LPTrans
--INNER JOIN dbo.ISW_LPTrans on dbo.ISW_LPTrans.
WHERE
dbo.ISW_LPTrans.item LIKE #item AND
dbo.ISW_LPTrans.lot LIKE #lot AND
dbo.ISW_LPTrans.trans_type = 'F'
GROUP BY
dbo.ISW_LPTrans.ref_line_suf
)
GROUP BY
dbo.ISW_LPTrans.lp_num
HAVING
SUM(dbo.ISW_LPTrans.qty) < 0
)
GROUP BY
dbo.ISW_LPTrans.item,
dbo.ISW_LPTrans.lot,
dbo.ISW_LPTrans.trans_type,
dbo.ISW_LPTrans.lp_num,
dbo.ISW_LPTrans.ref_num,
dbo.CW_CheckInOut.ApptDate,
dbo.CW_CheckInOut.ApptTime,
dbo.item.description,
dbo.item.u_m,
dbo.ISW_LPTrans.qty
ORDER BY
dbo.ISW_LPTrans.lp_num
In a nutshell - the way you use DISTINCT is logically wrong from SQL perspective.
Your DISTINCT is in an IN subquery in the WHERE clause - and at that point of code it has absolutely no effect (except from the performance penalty). Think on it - if the outer query returns non-unique values of dbo.ISW_LPTrans.lp_num (which obvioulsy happens) those values can still be within the distinct values of the IN subquery - the IN does not enforce a 1-to-1 match, it only enforces the fact that the outer query values are within the inner values, but they can match multiple times. So it is definitely not DISTINCT's fault.
I would go through the following check steps:
See if there is insufficient JOIN ON condition(s) in the outer FROM section that leads to data multiplication (e.g. if a table has primary-to-foreign key relation on several columns, but you join on one of them only etc.).
Check which of the sources contains non-distinct records in the outer FROM section - then either cleanse your source, or adjust the JOIN condition and / or the WHERE clause so that you only pick distinct & correct records. In fact you might need to SELECT DISTINCT in the FROM sections - there it would make much more sense.

Using COALESCE with JOIN on a different database column

Trying to populate the location column of a query and was hoping that the use of the COALESCE function would help me get what I want.
SELECT OrderItem.Code AS ItemCode, MAX(COALESCE(OrderItem.Location, [Picklist].[dbo].[ItemData].InventoryLocation)) AS Location, SUM(OrderItem.Quantity) AS Quantity, MAX(Store.StoreName) AS Store
FROM OrderItem
INNER JOIN [Order] ON OrderItem.OrderID = [Order].OrderID
INNER JOIN [Store] ON [Order].StoreID = [Store].StoreID
LEFT JOIN [AmazonOrder] ON [AmazonOrder].OrderID = [Order].OrderID
JOIN [Picklist].[dbo].[ItemData] ON [Picklist].[dbo].[ItemData].[InventoryNumber] = [OrderItem].[Code]
WHERE (CASE WHEN [Order].[LocalStatus] = 'Recently Downloaded' AND [AmazonOrder].FulfillmentChannel = 2 THEN 1
WHEN [Order].[LocalStatus] = 'Recently Downloaded' AND [Store].StoreName != 'Amazon' THEN 1 ELSE 0 END ) = 1
GROUP BY OrderItem.Code
ORDER BY ItemCode
There will not be a location when the Store is Amazon so I need to Join on another table in another database. I don't believe I'm using this correctly. Also I do get the right Location results returned if I use :
SELECT InventoryLocation From [Picklist].[dbo].[ItemData] WHERE InventoryNumber = 'L1201-2W-EA'
Perhaps this is more like the query that you want:
SELECT oi.Code AS ItemCode, COALESCE(oi.Location, id.InventoryLocation) AS Location,
oi.Quantity, s.StoreName AS Store
FROM OrderItem oi INNER JOIN
[Order] o
ON oi.OrderID = o.OrderID INNER JOIN
[Store]
ON o.StoreID = s.StoreID LEFT JOIN
AmazonOrder ao
ON ao.OrderID = o.OrderID JOIN
[Picklist].[dbo].[ItemData] id
ON id.InventoryNumber = oi.[Code]
WHERE o.LocalStatus = 'Recently Downloaded' AND
(ao.FulfillmentChannel = 2 OR s.StoreName <> 'Amazon')
ORDER BY ItemCode
Here are the changes:
Removed the aggregation. It does not seem to be part of the question.
Introduced table aliases, so the query is easier to write and to read.
Simplified the logic in the where clause.
As the comment above says, the max seems somewhat strange, an arbitrary aggregation no doubt due to one of the joins bringing back more information than you might of expected.
Then the statement has a few issues:
The coalesce is using two fields, neither if which is in a left join, only the AmazonOrder is left joined, so that seems a bit strange, that would only work if the first field in the coalesce (OrderItem.Location) is nullable - which it might be, there is no schema posted.
The left join itself is an inner join in disguise at present - within the where clause you have given explicit conditions on a field from that table - AND [AmazonOrder].FulfillmentChannel = 2 - if the record was actually missing the left join would return null for that field, and the where clause would then drop it out of the results. If you want this to properly work as a left join, any condition on fields from that table must move into the join condition, or the where clause itself must allow for that field being null (explicitly or using a coalesce.)
SELECT OrderItem.Code AS Code,
CASE WHEN (LEN(ISNULL(MAX([OrderItem].[Location]),'')) = 1)
THEN MAX([OrderItem].[Location])
ELSE MAX([Picklist].[dbo].[ItemData].InventoryLocation)
END AS Location,
SUM(OrderItem.Quantity) AS Quantity,
MAX(Store.StoreName) AS Store
FROM OrderItem
INNER JOIN [Order] ON OrderItem.OrderID = [Order].OrderID
INNER JOIN [Store] ON [Order].StoreID = [Store].StoreID
LEFT JOIN [AmazonOrder] ON [AmazonOrder].OrderID = [Order].OrderID
LEFT JOIN [Picklist].[dbo].[ItemData] ON [Picklist].[dbo].[ItemData].[InventoryNumber] = [OrderItem].[Code] OR
[Picklist].[dbo].[ItemData].[MediaCreator] = [OrderItem].[Code]
WHERE [Order].LocalStatus = 'Recently Downloaded' AND (AmazonOrder.FulfillmentChannel = 2 OR Store.StoreName <> 'Amazon')
GROUP BY OrderItem.Code
ORDER BY OrderItem.Code
Decided to go with case statement on location column route because I could not get COALESCE to work for me. Schema, some not all data, at SQLFiddle.
I guess if someone gets COALESCE to work I'll change the answer?
#Gordon Linoff I used the re-written WHERE clause because it looked cleaner than using the CASE statement. It worked and guessed there was a simpler way to go about it but was more worried about getting COALESCE to work. As for the Aliases sometimes I like to use them but in this case since there was a lot of tables I like to code out what I'm actually working in. Just my preference .