How to find duplicates in a large table based on matching and non matching fields? - sql

I have a very large table with more than 10 million records. I want to find duplicates based on some fields matching and some fields not matching in it.
The query currently I am using is below:
SELECT DISTINCT MainTable.[lineitemid]
FROM [dbo].[lineitem] MainTable
INNER JOIN [dbo].[lineitem] AS ChildTable
ON ChildTable.invoicedate = MainTable.invoicedate
AND LEFT(ChildTable.vendorname, 4) = LEFT(MainTable.vendorname, 4)
AND ChildTable.invoiceid <> MainTable.invoiceid AND -- Invoice ID column not matching
ChildTable.documentcurrencyamount = MainTable.documentcurrencyamount
WHERE ChildTable.lineitemid <> MainTable.lineitemid AND -- LineItemId is PK
MainTable.projectid = 1125 AND ChildTable.projectid = 1125 -- Duplicates should be identified with specific ProjectId
This query is working fine if the number of records for ProjectId is under 100,000.
When the ProjectId records are more than 1 million, while executing this query, the tempdb size shoots up to 100 GB and causing low disk space issues. The query is taking forever to execute.
Please help me in optimizing the query.
Added the below lines after getting answer for the above query....
Thanks a lot, #Gordon-Linoff. The query you suggested worked much faster. The VendorName is from a different table. Can I include a inner join as shown below?
SELECT li1.[LineItemId]
FROM [dbo].[LineItem] li1
INNER JOIN VendorMaster vm1 ON li1.VendorNumber=vm1.VendorNumber
AND vm1.CompanyCode = li1.CompanyCode
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2
INNER JOIN VendorMaster vm2 on li2.VendorNumber = vm2.VendorNumber
AND vm2.CompanyCode = li2.CompanyCode
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li1.VendorName, 4) and
li2.InvoiceId <> li1.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li1.DocumentCurrencyAmount and
li2.LineItemId <> li1.LineItemId and
li2.ProjectId = li1.ProjectId
li2.VendorNumber = li.VendorNumber)
AND li.ProjectId = 1125
Is it an efficient approach?

A less expensive way to run this query is to use exists and dispense with the distinct:
SELECT li.[LineItemId]
FROM [dbo].[LineItem] li
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2 on
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li.VendorName, 4) and
li2.InvoiceId <> li.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li.DocumentCurrencyAmount and
li2.LineItemId <> li.LineItemId and
li2.ProjectId = li.ProjectId
WHERE MainTable.ProjectId = 1125;
For performance, an index on LineItem(ProjectId, InvoiceDate, DocumentCurrencyAmount, VendorName, InvoiceId, LineItemId) would help. You could further speed the query by declaring LEFT(LineItem.VendorName, 4) as a computed column and adding it to the index before VendorName.

Related

SQL Where clause greatly increases query time

I have a table that I do some joins and operations on. This table has about 150,000 rows and if I select all and run it, it returns in about 10 seconds. If I create my query into its own table, and filter out all the rows where a certain field is null, now the query takes 10 minutes to run. Is it suppoused to be like this or is there any way to fix it? Here is the query.
SELECT *
FROM
(
Select
I.Date_Created
,I.Company_Code
,I.Division_Code
,I.Invoice_Number
,Sh.CUST_PO
,I.Total_Quantity
,ID.Total
,SH.Ship_City City
,CASE WHEN SH.Ship_Cntry <> 'US' THEN 'INT' ELSE SH.Ship_prov END State
,SH.Ship_Zip Zip
,SH.Ship_Cntry Country
,S.CustomerEmail
from [JMNYC-AMTDB].[AMTPLUS].[dbo].Invoices I (nolock)
LEFT JOIN (SELECT
ID.Company_Code
,ID.Division_Code
,ID.Invoice_Number
,SUM (ID.Price* ID.Quantity) Total
FROM [JMNYC-AMTDB].[AMTPLUS].[dbo].Invoices_Detail ID (nolock)
GROUP BY ID.Company_Code, ID.Division_Code, ID.Invoice_Number) ID
ON I.Company_Code = ID.Company_Code
AND I.Division_Code = ID.Division_Code
AND I.Invoice_Number = ID.Invoice_Number
LEFT JOIN
[JMDNJ-ACCELSQL].[A1WAREHOUSE].[dbo].SHIPHIST SH (nolock) ON I.Pickticket_Number = SH.Packslip
LEFT JOIN
[JMDNJ-ACCELSQL].[A1WAREHOUSE].[dbo].[MagentoCustomerEmailData] S on SH.CUST_PO = S.InvoiceNumber
Where I.Company_Code ='09' AND I.Division_Code = '001'
AND I.Customer_Number = 'ECOM2X'
)T
Where T.CustomerEmail IS NOT NULL -- This is the problematic line
Order By T.Date_Created desc
If you are aware of the Index Considerations and you are sure about the problem point, then you can use this to improve it:
USE A1WAREHOUSE;
GO
CREATE NONCLUSTERED INDEX IX_MagentoCustomerEmailData_CustomerEmail
ON [dbo].[MagentoCustomerEmailData] (CustomerEmail ASC);
GO
Totally, you need to add index on columns used in ORDER BY, WHERE, GROUP BY, ON etc sections. Before adding indexes be sure that you are aware of the consequences.
Read more about Index:
https://www.mssqltips.com/sqlservertutorial/9133/sql-server-nonclustered-indexes/
https://www.itprotoday.com/sql-server/indexing-dos-and-don-ts

SQL Server Execution Plan Review Request

Having trouble understanding why my query is taking so long, looking for advice to optimise please.
update Laserbeak_Main.dbo.ACCOUNT_MPN set
DateUpgrade = ord.ConnectedDate
FROM [ORDER] ord
WHERE ord.AccountNumber = Laserbeak_Main.dbo.ACCOUNT_MPN.AccountNumber
AND ord.ordertypeID = '2'
AND ord.ConnectedDate IS NOT NULL
AND DateUpgrade <> ord.ConnectedDate
Execution plan as requested on brentozar.com
UPDATE: Following suggestions the new query looks like this & seems to work much more quickly. However if you run the query it sets the rows as expected, then run again it updates the exact same number of rows. Converting to a select confirms that the same rows are being updated each time. The <> clause should stop this but it doesn't. I believed it was something to do with collation but have been unable to confirm if its possible to have different collations at table level in the same database.
;WITH cteOrderInfo AS (
SELECT DISTINCT ord.AccountNumber, ord.ConnectedDate
FROM [ORDER] ord
WHERE ord.ordertypeID = '2'
AND ord.ConnectedDate IS NOT NULL
)
UPDATE Laserbeak_Main.dbo.ACCOUNT_MPN
SET Laserbeak_Main.dbo.ACCOUNT_MPN.DateUpgrade = cteOrderInfo.ConnectedDate
FROM cteOrderInfo
INNER JOIN Laserbeak_Main.dbo.ACCOUNT_MPN acc
ON cteOrderInfo.AccountNumber = acc.AccountNumber
WHERE cteOrderInfo.ConnectedDate <> acc.DateUpgrade
The SELECT to confirm:
;WITH cteOrderInfo AS (
SELECT DISTINCT ord.AccountNumber, ord.ConnectedDate
FROM [ORDER] ord
WHERE ord.ordertypeID = '2'
AND ord.ConnectedDate IS NOT NULL
)
SELECT cteOrderInfo.ConnectedDate, acc.DateUpgrade
FROM cteOrderInfo
INNER JOIN Laserbeak_Main.dbo.ACCOUNT_MPN acc
ON cteOrderInfo.AccountNumber = acc.AccountNumber
WHERE cteOrderInfo.ConnectedDate <> acc.DateUpgrade
SELECT Results Sample:
As Serge suggested, we did not have unique rows.
the solution we arrived at:
;WITH cteSourceStuff AS (
SELECT AccountNumber, MpnUpgrade, MAX(DateConnected) maxConnDate
FROM ORDER_DETAIL, [ORDER]
WHERE ORDER_DETAIL.OrderID = [ORDER].OrderID
AND LEN(MpnUpgrade) > 10
AND OrderTypeID = 2
GROUP BY AccountNumber, MpnUpgrade
)
UPDATE Laserbeak_Main.dbo.ACCOUNT_MPN set
DateUpgrade = cteSourceStuff.maxConnDate
FROM cteSourceStuff
WHERE cteSourceStuff.MpnUpgrade = ACCOUNT_MPN.Mpn
AND cteSourceStuff.AccountNumber = ACCOUNT_MPN.AccountNumber
AND DateUpgrade <> cteSourceStuff.maxConnDate
This works because the duplicates are initially removed, then we only update the rows that we are actually targeting. The reason we have issues before was that SQL was updating the 1st row it found, then when we re-ran or ran the select it was return rows matched on the key but that had not previously been updated.

Confused in join query in SQL

The following works:
SELECT IBAD.TRM_CODE, IBAD.IPABD_CUR_QTY, BM.BOQ_ITEM_NO,
IBAD.BCI_CODE, BCI.BOQ_CODE
FROM IPA_BOQ_ABSTRCT_DTL IBAD,
BOQ_CONFIG_INF BCI,BOQ_MST BM
WHERE BM.BOQ_CODE = BCI.BOQ_CODE
AND BCI.BCI_CODE = IBAD.BCI_CODE
AND BCI.STATUS = 'Y'
AND BM.STATUS = 'Y'
order by boq_item_no;
Results:
But after joining many tables with that query, the result is confusing:
SELECT (SELECT CMN_NAME
FROM CMN_MST
WHERE CMN_CODE= BRI.CMN_RLTY_MTRL) MTRL,
RRI.RRI_RLTY_RATE AS RATE,
I.BOQ_ITEM_NO,
(TRIM(TO_CHAR(IBAD.IPABD_CUR_QTY,
'9999999999999999999999999999990.999'))) AS IPABD_CUR_QTY,
TRIM(TO_CHAR(BRI.BRI_WT_FACTOR,
'9999999999999999999999999999990.999')) AS WT,
TRIM(TO_CHAR((IBAD.IPABD_CUR_QTY*BRI.BRI_WT_FACTOR),
'9999999999999999999999990.999')) AS RLTY_QTY,
(TRIM(TO_CHAR((IBAD.IPABD_CUR_QTY*BRI.BRI_WT_FACTOR*RRI.RRI_RLTY_RATE),
'9999999999999999999999990.99'))) AS TOT_AMT,
I.TRM_CODE AS TRM
FROM
(SELECT * FROM ipa_boq_abstrct_dtl) IBAD
INNER JOIN
(SELECT * FROM BOQ_RLTY_INF) BRI
ON IBAD.BCI_CODE = BRI.BCI_CODE
INNER JOIN
(SELECT * FROM RLTY_RATE_INF) RRI
ON BRI.CMN_RLTY_MTRL = RRI.CMN_RLTY_MTRL
INNER JOIN
( SELECT IBAD.TRM_CODE, IBAD.IPABD_CUR_QTY,
BM.BOQ_ITEM_NO, IBAD.BCI_CODE, BCI.BOQ_CODE
FROM IPA_BOQ_ABSTRCT_DTL IBAD,
BOQ_CONFIG_INF BCI,BOQ_MST BM
WHERE
BM.BOQ_CODE = BCI.BOQ_CODE
AND BCI.BCI_CODE = IBAD.BCI_CODE
and BCI.status = 'Y'
and bm.status = 'Y') I
ON BRI.BCI_CODE = I.BCI_CODE
AND I.TRM_CODE = BRI.TRM_CODE
AND BRI.TRM_CODE =4
group by BRI.CMN_RLTY_MTRL, RRI.RRI_RLTY_RATE, I.BOQ_ITEM_NO,
IBAD.IPABD_CUR_QTY, BRI.BRI_WT_FACTOR, I.TRM_CODE, I.bci_code
order by BRI.CMN_RLTY_MTRL
Results:
TRM should be 11 instead of 4 in the first row.
you getting 4 because you use
AND BRI.TRM_CODE =4
if you remove this criter you can get true result
In your first query, both of the rows you've highlighted have BCI_CODE=1866.
In the second query, you are joining that result set with a number of others (which come from the same tables, which seems odd). In particular, you are joining from the subquery to another table using BCI_CODE, and from there to (SELECT * FROM ipa_boq_abstrct_dtl) IBAD. Since both of the rows from the subquery have the same BCI_CODE, they will join to the same rows in the other tables.
The quantity that you are actually displaying in the second query is from (SELECT * FROM ipa_boq_abstrct_dtl) IBAD, not from the other subquery.
Is the problem simply that you mean to select I.IPABD_CUR_QTY instead of IBAD.IPABD_CUR_QTY?
You might find this clearer if you did not reuse the same aliases for tables at multiple points in the query.

Optimization DB2 query with mass join

I have complex query:
select rma.RELATION_MANAGER_ID,
rm.ORG_STRUCTURE_ID,
rm.RELATIONSHIP_MANAGER_NM,
count(distinct ppa.PARTY_ID) as count_party
from RELATIONSHIP_MANAGER rm --15808 row
join RELATIONSHIP_MANAGER_MARKET rmm --1560 row
on rm.RELATIONSHIP_MANAGER_ID = rmm.RELATIONSHIP_MANAGER_ID
and rmm.INCLUDE_IN_REPORT = 'Y'
join MARKET_SEGMENT rm_ms --4 row
on rmm.MARKET_SEGMENT_ID = rm_ms.MARKET_SEGMENT_ID
and rm_ms.MARKET_SEGMENT = '01'
join RELATIONSHIP_MANAGER_ALLOCATION rma --61349 row
on rm.RELATIONSHIP_MANAGER_ID = rma.RELATIONSHIP_MANAGER_ID
join CMD_PARTY_PORTFOLIO_ALLOCATION ppa --3114096 row
on ppa.PORTFOLIO_ID = rma.PORTFOLIO_ID
join person ps --3112575 row
on ps.IS_DELETED != 1 and ppa.party_id = ps.party_id
join PARTY p --3114146 row
on ppa.party_id=p.party_id
join MARKET_SEGMENT ms --4 row
on p.MARKET_SEGMENT_ID = ms.MARKET_SEGMENT_ID and ms.MARKET_SEGMENT = '01'
where rm.IS_CM = 1 and rm.IS_DELETED != 1
group by rm.RELATIONSHIP_MANAGER_NM, rma.RELATIONSHIP_MANAGER_ID, rm.ORG_STRUCTURE_ID
Table columns have indexes:
rm.RELATIONSHIP_MANAGER_ID,
rmm.RELATIONSHIP_MANAGER_ID,
rmm.MARKET_SEGMENT_ID,
rm_ms.MARKET_SEGMENT_ID,
rma.RELATIONSHIP_MANAGER_ID,
ppa.PORTFOLIO_ID,
rma.PORTFOLIO_ID,
ppa.party_id,
ps.party_id,
p.party_id,
p.MARKET_SEGMENT_ID,
ms.MARKET_SEGMENT_ID
tables PARTY, PERSON have ~1-3 million row,
runtime of query ~20second. I am comment
join MARKET_SEGMENT ms
on p.MARKET_SEGMENT_ID = ms.MARKET_SEGMENT_ID --and ms.MARKET_SEGMENT = '01'
runtime of query became ~3 second.
Explain why this is happening, please ?
Explain plan dont help me.. How i can optimization the query?
EDIT:
platform is DB2 for z/OS V9.7,
added size of table
EDIT2: explain plan shows that the first is always join the small size of the table
Just for grins, see if this makes any difference:
WITH
MktSeg( MARKET_SEGMENT_ID ) AS
( SELECT MARKET_SEGMENT_ID
FROM MARKET_SEGMENT
WHERE MARKET_SEGMENT = '01' )
select rma.RELATION_MANAGER_ID,
rm.ORG_STRUCTURE_ID,
rm.RELATIONSHIP_MANAGER_NM,
count(distinct ppa.PARTY_ID) as count_party
from RELATIONSHIP_MANAGER rm --15808 row
join RELATIONSHIP_MANAGER_MARKET rmm --1560 row
on rm.RELATIONSHIP_MANAGER_ID = rmm.RELATIONSHIP_MANAGER_ID
and rmm.INCLUDE_IN_REPORT = 'Y'
join MktSeg rm_ms --4 row
on rmm.MARKET_SEGMENT_ID = rm_ms.MARKET_SEGMENT_ID
join RELATIONSHIP_MANAGER_ALLOCATION rma --61349 row
on rm.RELATIONSHIP_MANAGER_ID = rma.RELATIONSHIP_MANAGER_ID
join CMD_PARTY_PORTFOLIO_ALLOCATION ppa --3114096 row
on ppa.PORTFOLIO_ID = rma.PORTFOLIO_ID
join person ps --3112575 row
on ps.IS_DELETED != 1 and ppa.party_id = ps.party_id
join PARTY p --3114146 row
on ppa.party_id=p.party_id
join MktSeg ms --4 row
on p.MARKET_SEGMENT_ID = ms.MARKET_SEGMENT_ID
WHERE rm.IS_CM = 1 AND rm.IS_DELETED != 1
group by rm.RELATIONSHIP_MANAGER_NM, rma.RELATIONSHIP_MANAGER_ID,
rm.ORG_STRUCTURE_ID, rma.RELATION_MANAGER_ID
Note also that I've added an item to the Group By clause.

WHERE in Sql, combining two fast conditions multiplies costs many times

I have a fairly complex sql that returns 2158 rows' id from a table with ~14M rows. I'm using CTEs for simplification.
The WHERE consists of two conditions. If i comment out one of them, the other runs in ~2 second. If i leave them both (separated by OR) the query runs ~100 seconds. The first condition alone needs 1-2 seconds and returns 19 rows, the second condition alone needs 0 seconds and returns 2139 rows.
What can be the reason?
This is the complete SQL:
WITH fpcRepairs AS
(
SELECT FPC_Row = ROW_NUMBER()OVER(PARTITION BY t.SSN_Number ORDER BY t.Received_Date, t.Claim_Creation_Date, t.Repair_Completion_Date, t.Claim_Submitted_Date)
, idData, Repair_Completion_Date, Received_Date, Work_Order, SSN_number, fiMaxActionCode, idModel,ModelName
, SP=(SELECT TOP 1 Reused_Indicator FROM tabDataDetail td INNER JOIN tabSparePart sp ON td.fiSparePart=sp.idSparePart
WHERE td.fiData=t.idData
AND (td.Material_Quantity <> 0)
AND (sp.SparePartName = '1254-3751'))
FROM tabData AS t INNER JOIN
modModel AS m ON t.fiModel = m.idModel
WHERE (m.ModelName = 'LT26i')
AND EXISTS(
SELECT NULL
FROM tabDataDetail AS td
INNER JOIN tabSparePart AS sp ON td.fiSparePart = sp.idSparePart
WHERE (td.fiData = t.idData)
AND (td.Material_Quantity <> 0)
AND (sp.SparePartName = '1254-3751')
)
), needToChange AS
(
SELECT idData FROM tabData AS t INNER JOIN
modModel AS m ON t.fiModel = m.idModel
WHERE (m.ModelName = 'LT26i')
AND EXISTS(
SELECT NULL
FROM tabDataDetail AS td
INNER JOIN tabSparePart AS sp ON td.fiSparePart = sp.idSparePart
WHERE (td.fiData = t.idData)
AND (td.Material_Quantity <> 0)
AND (sp.SparePartName IN ('1257-2741','1257-2742','1248-2338','1254-7035','1248-2345','1254-7042'))
)
)
SELECT t.idData
FROM tabData AS t INNER JOIN modModel AS m ON t.fiModel = m.idModel
INNER JOIN needToChange ON t.idData = needToChange.idData -- needs to change FpcAssy
LEFT OUTER JOIN fpcRepairs rep ON t.idData = rep.idData
WHERE
rep.idData IS NOT NULL -- FpcAssy replaced, check if reused was claimed correctly
AND rep.FPC_Row > 1 -- other FpcAssy repair before
AND (
SELECT SP FROM fpcRepairs lastRep
WHERE lastRep.SSN_Number = rep.SSN_Number
AND lastRep.FPC_Row = rep.FPC_Row - 1
) = rep.SP -- same SP, must be rejected(reused+reused or new+new)
OR
rep.idData IS NOT NULL -- FpcAssy replaced, check if reused was claimed correctly
AND rep.FPC_Row = 1 -- no other FpcAssy repair before
AND rep.SP = 0 -- not reused, must be rejected
order by t.idData
Here's the execution plan:
Download: http://www.filedropper.com/exeplanfpc
Try to use UNION ALL of 2 queries separately instead of OR condition.
I've tried it many times and it really helped. I've read about this issue in Art Of SQL .
Read it, you can find many useful information about performance issues.
UPDATE:
Check related questions
UNION ALL vs OR condition in sql server query
http://www.sql-server-performance.com/2011/union-or-sql-server-queries/
Can UNION ALL be faster than JOINs or do my JOINs just suck?
Check Wes's answer
The usage of the OR is probably causing the query optimizer to no longer use an index in the second query.