Related
I have a problem I have been working on the past several hours. It is complex (for me) and I don't expect someone to do it for me. I just need the right direction.
Problem: We had the tables (below) added to our database and I need to update them based off of data already in our DailyCosts table. The tricky part is that I need to take DailyCosts.Notes and move it to PurchaseOrder.PoNumber. Notes is where we currenlty have the PONumbers.
I started with the Insert below, testing it out on one WellID. This is Inserting records from our DailyCosts table to the new PurchaseOrder table:
Insert Into PurchaseOrder (PoNumber,WellId,JObID,ID)
Select Distinct Cast(Notes As nvarchar(20)), WellID, JOBID,
DailyCosts.DailyCostID
From DailyCosts
Where WellID = '24A-23'
It affected 1973 rows (The Notes are in Ntext)
However, I need to update the other new tables because we need to see the actual PONumbers in the application.
This next Insert is Inserting records from our DailyCost table and new PurchaseOrder table (from above) to a new table called PurchaseOrderDailyCost
Insert Into PurchaseOrderDailyCost (WellID, JobID, ReportNo, AccountCode, PurchaseOrderID,ID,DailyCostSeqNo, DailyCostID)
Select Distinct DailyCosts.WellID,DailyCosts.JobID,DailyCosts.ReportNo,DailyCosts.AccountCode,
PurchaseOrder.ID,NEWID(),0,DailyCosts.DailyCostID
From DailyCosts join
PurchaseOrder ON DailyCosts.WellID = PurchaseOrder.WellID
Where DailyCosts.WellID = '24A-23'
Unfortunately, this produces 3,892,729 records. The Notes field contains the same list of PONumbers each day. This is by design so that the people inputting the data out in the field can easily track their PO numbers. The new PONumber column that we are moving the Notes to would store just unique POnumbers. I modified the query by replacing NEWID() with DailyCostID and the Join to ON DailyCosts.DailyCostID = PurchaseOrder.ID
This affected 1973 rows the same as the first Insert.
The next Insert looks like this:
Insert Into PurchaseOrderAccount (WellID, JobID, PurchaseOrderID, ID, AccountCode)
Select PurchaseOrder.WellID, PurchaseOrder.JobID, PurchaseOrder.ID, PurchaseOrderDailyCost.DailyCostID,PurchaseOrderDailyCost.AccountCode
From PurchaseOrder Inner Join
PurchaseOrderDailyCost ON PurchaseOrder.ID = PurchaseOrderDailyCost.DailyCostID
Where PurchaseOrder.WellID = '24A-23'
The page in the application now shows the PONumbers in the correct column. Everything looks like I want it to.
Unfortunately, it slows down the application to an unacceptable level. I need to figure out how to either modify my Insert or delete duplicate records. The problem is that there are multiple foreign key constraints. I have some more information below for reference.
This shows the application after the inserts. These are all duplicate records that I am hoping to elminate
Here is some additional information I received from the vendor about the tables:
-- add a new purchase order
INSERT INTO PurchaseOrder
(WellID, JobID, ID, PONumber, Amount, Description)
VALUES ('MyWell', 'MyJob', NEWID(), 'PO444444', 500.0, 'A new Purchase Order')
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to a DailyCost record with SeqNo 0 and AccountCode 'MyAccount'
INSERT INTO PurchaseOrderDailyCost
(WellID, JobID, ReportNo, AccountCode, DailyCostSeqNo, PurchaseOrderID, ID)
VALUES ('MyWell', 'MyJob', 4, 'MyAccount', 0, 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID())
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to an account code 'MyAccount'
-- (i.e. make it choosable from the DailyCost PO-column dropdown for any DailyCost record whose account code is 'MyAccount')
INSERT INTO PurchaseOrderAccount
(WellID, JobID, PurchaseOrderID, ID, AccountCode)
VALUES ('MyWell', 'MyJob', 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID(), 'MyAccount')
-- link a purchase order with id 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3' to an AFE No. 'MyAFENo'
-- (same behavior as with the account codes above)
INSERT INTO PurchaseOrderAFE
(WellID, JobID, PurchaseOrderID, ID, AFENo)
VALUES ('MyWell', 'MyJob', 'A356FBF4-A19B-4466-9E5C-20C5FD0E95C3', NEWID(), 'MyAFENo')
So it turns out I missed some simple joining principles. The better I get the more silly mistakes I seem to make. Basically, on my very first insert, I did not include a Group By. Adding this took my INSERT from 1973 to 93. Then on my next insert, I joined DailyCosts.Notes on PurchaseOrder.PONumber since these are the only records from DailyCosts I needed. This was previously INSERT 2 on my question. From there basically, everything came together. Two steps forward an one step back. Thanks to everyone that responded to this.
Say I have a table with 100,000 User IDs (UserID is an int).
When I run a query like
SELECT COUNT(Distinct User ID) from tableUserID
the result I get is HIGHER than the result from the following statement:
SELECT COUNT(User ID) from tableUserID
I thought Distinct implied unique, which would mean a lower result. What would cause this discrepancy and how would I identify those user IDs that don't show up in the 2nd query?
Thanks
**
UPDATE - 11:14 am est
**
Hi All
I sincerely apologize as I should've taken the trouble to reproduce this in my local environment. But I just wanted to see if there was a general consensus about this. Here are the full details:
The query is a result of an inner join between 2 tables.
One has this information:
TABLE ACTIVITY (NO PRIMARY KEY)
UserID int (not Nullable)
JoinDate datetime
Status tinyint
LeaveDate datetime
SentAutoMessage tinyint
SectionDetails varchar
And here is the second table:
TABLE USER_INFO (CLUSTERED PRIMARY KEY)
UserID int (not Nullable)
UserName varchar
UserActive int
CreatedOn datetime
DisabledOn datetime
The tables are joined on UserID and the UserID being selected in the original 2 queries is the one from the TABLE ACTIVITY.
Hope this clarifies the question.
This is not technically an answer, but since I took time to analyze this, I might as well post it (although I have the risk of being down voted).
There was no way I could reproduce the described behavior.
This is the scenario:
declare #table table ([user id] int)
insert into #table values
(1),(1),(1),(1),(1),(1),(1),(2),(2),(2),(2),(2),(2),(null),(null)
And here are some queries and their results:
SELECT COUNT(User ID) FROM #table --error: this does not run
SELECT COUNT(dsitinct User ID) FROM #table --error: this does not run
SELECT COUNT([User ID]) FROM #table --result: 13 (nulls not counted)
SELECT COUNT(distinct [User ID]) FROM #table --result: 2 (nulls not counted)
And something interesting:
SELECT user --result: 'dbo' in my sandbox DB
SELECT count(user) from #table --result: 15 (nulls are counted because user value
is not null)
SELECT count(distinct user) from #table --result: 1 (user is the same
value always)
I find it very odd that you are able to run the queries exactly how you described. You'd have to let us know the table structure and the data to get further help.
how would I identify those user IDs that don't show up in the 2nd query
Try this query
SELECT UserID from tableUserID Where UserID not in (SELECT Distinct User ID from tableUserID)
I think there will be no row.
Edit:
User is a reserved keyword. Do you mean UserID in your requests ?
Ray : Yes
I tried to reproduce the problem in my environment and my conclusion is that given the conditions you described, the result from the first query can not be higher than the second one. Even if there would be NULL's, that just won't happen.
Did you run the query #Jean-Charles sugested?
I'm very intrigued with this, please let us know what turns out to be the problem.
Lets say I have a database table called "Scrape" possibly setup like:
UserID (int)
UserName (varchar)
Wins (int)
Losses (int)
ScrapeDate (datetime)
I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.
How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?
Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?
For example I could have:
1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09
Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?
So, my question is:
How would you handle querying only the most recent scrape of each user to determine the rankings
Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?
Thank you, and thank you for your responses you have given me on my related question:
When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?
This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.
I solve this type of problem using an outer join technique:
SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;
This will return only one row for each username -- the row with the greatest value in the ScrapeDate column. That's what the outer join is for, to try to match s1 with some other row s2 with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2, and then we know s1 corresponds to the row with the greatest date for that given username.
This should also work when you have a partially-completed scrape in progress.
This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.
Try something like:
Select user id and max date of last entry for each user.
Select and order records to get ranking based on above query results.
This should work, however depends on your database size.
DECLARE
#last_entries TABLE(id int, dte datetime)
-- insert date (dte) of last entry for each user (id)
INSERT INTO
#last_entries (id, dte)
SELECT
UserID,
MAX(ScrapeDate)
FROM
Scrape WITH (NOLOCK)
GROUP BY
UserID
-- select ranking
SELECT
-- optionally you can use RANK OVER() function to get rank value
UserName,
Wins,
Losses
FROM
#last_entries
JOIN
Scraps WITH (NOLOCK)
ON
UserID = id
AND ScrapeDate = dte
ORDER BY
Winds,
Losses
I do not test this code, so it could not compile on first run.
The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.
I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).
To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.
EDIT
The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:
create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)
INSERT #scrape
select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob' ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'
;with latestScrapeCTE
AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY userName
ORDER BY scrapeDate desc
) AS rn
,wins + losses AS totalPlayed
,wins - losses as winDiff
from #scrape
)
SELECT userName
,wins
,losses
,scrapeDate
,winDiff
,totalPlayed
,RANK() OVER (ORDER BY winDiff desc
,totalPlayed desc
) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos
EDIT 2
An illustration of the use of a metadata table to select the latest complete scrape:
create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)
create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)
INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete
INSERT #scrape
select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob' ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1
;with latestScrapeCTE
AS
(
SELECT TOP 1 runID
,startDate
FROM #scrape_run
WHERE completedDate IS NOT NULL
)
SELECT userName
,wins
,losses
,startDate AS scrapeDate
,wins - losses AS winDiff
,wins + losses AS totalPlayed
,RANK() OVER (ORDER BY (wins - losses) desc
,(wins + losses) desc
) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON runID = scrapeRunID
ORDER BY rankPos
My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.
Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing
Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime
Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;
start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36
Suppose I have a table of customers:
CREATE TABLE customers (
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
This table does not have a primary key. However, customer_name and customer_address should be unique for any given customer_number.
It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:
SELECT
DISTINCT customer_number, customer_name, customer_address
FROM customers
Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number in question.
For example, I tried this query with no success:
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number
Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?
EDIT: The rationale behind the bizarre query:
Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...
I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:
CREATE TABLE unprocessed_invoices (
invoice_number INTEGER,
invoice_date DATE,
...
// other invoice columns
...
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.
The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number
Hopefully this helps clarify the original intent of the question.
EDIT: Examples of good/bad data
To clarify: customer_name and customer_address only have to be unique for a particular customer_number.
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
3 | 'Fred' | '456 Avenue'
3 | 'Fred' | '789 Crescent'
The first two rows are fine because it is the same customer_name and customer_address for customer_number 1.
The middle two rows are fine because it is the same customer_name and customer_address for customer_number 2 (even though another customer_number has the same customer_name and customer_address).
The last two rows are not okay because there are two different customer_addresses for customer_number 3.
The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
I hope this clarifies what I meant by "conflicting customer_name and customer_address". They have to be unique per customer_number.
I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)
Your approach is flawed. You do not want data that was successfully able to be stored to then throw an error on a select - that is a land mine waiting to happen and means you never know when a select could fail.
What I recommend is that you add a unique key to the table, and slowly start modifying your application to use this key rather than relying on any combination of meaningful data.
You can then stop caring about duplicate data, which is not really duplicate in the first place. It is entirely possible for two people with the same name to share the same address.
You will also gain performance improvements from this approach.
As an aside, I highly recommend you normalize your data, that is break up the name into FirstName and LastName (optionally MiddleName too), and break up the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip, or whatever)
Update: If I understand your situation correctly (which I am not sure I do), you want to prevent duplicate combinations of name and address from ever occurring in the table (even though that is a possible occurrence in real life). This is best done by a unique constraint or index on these two fields to prevent the data from being inserted. That is, catch the error before you insert it. That will tell you the import file or your resulting app logic is bad and you can choose to take the appropriate measures then.
I still maintain that throwing the error when you query is too late in the game to do anything about it.
A scalar sub-query must only return one row (per result set row...) so you could do something like:
select distinct
customer_number,
(
select distinct
customer_address
from customers c2
where c2.customer_number = c.customer_number
) as customer_address
from customers c
Making the query fail may be tricky...
This will show you if there are any duplicate records in the table:
select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1
If you just add a unique index for all the three fields, noone can create a duplicate record in the table.
The defacto key is Name+Address, so that's what you need to group by.
SELECT
Customer_Name,
Customer_Address,
CASE WHEN Count(DISTINCT Customer_Number) > 1
THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address
If you want to do it from the point of view of a Customer_Number, then this is good too.
SELECT *,
CASE WHEN Exists((
SELECT top 1 1
FROM Customers c2
WHERE c1.Customer_Number != c2.Customer_Number
AND c1.Customer_Name = c2.Customer_Name
AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = #Number
If you have dirty data, I would clean it up first.
Use this to find the duplicate customer records...
Select * From customers
Where customer_number in
(Select Customer_number from customers
Group by customer_number Having count(*) > 1)
If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.
CREATE TABLE #temp_customers
(customer_number int,
customer_name varchar(50),
customer_address varchar(50),
PRIMARY KEY (customer_number),
UNIQUE(customr_name, customer_address))
)
INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers
SELECT customer_number, customer_name, customer_address
FROM #temp_customers
DROP TABLE #temp_customers
This will fail if there are issues but will keep your duplicate records from causing issues.
Let's put the data into a temp table or table variable with your distinct query
select distinct customer_number, customer_name, customer_address,
IDENTITY(int, 1,1) AS ID_Num
into #temp
from unprocessed_invoices
Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.
Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.
Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number
You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.
Select t1.* from #temp t1
join #temp t2
on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address
where t1.customer_number <> t2.customer_number
select t1.* from #temp t1
join
(select customer_number from #temp group by customer_number having count(*) >1) t2
on t1.customer_number = t2.customer_number