T-SQL Duplicate, selecting 'master" record based on Modified Date - sql

I have a database with two ID fields, one assigned as a GUID by the system, and a ExternalID, which is used to denote duplicates after data cleansing, the table also contains a ModifiedDate
I am trying to merge these records, with the most recently modified record absorbing the older accounts. I have tried the following query types.
SELECT
a1.GUID
,a1.ModifiedDate
,a2.GUID
,a2,ModifiedDate
FROM Accounts a1
INNER JOIN Accounts a2 on a1.ExternalID = a2.ExternalID
This unfortunately causes the duplicate accounts to appear twice, once for the Master record, again for the subordinate record, which returns the Master record as a duplicate.
WITH Dup as (
SELECT 1 as track
,ExternalID DomEx
,ExternalID
,GUID DomGUID
,ModDate
from crm.Accounts
WHERE ExternalID is not null
UNION ALL
SELECT track +1
,OI.DomEx
,OG.ExternalID
,OG.GUID
,Og.ModDate
from crm.Accounts OG
INNER JOIN Dup OI on OI.ExternalID = OG.ExternalID
)
,
cte_dp as(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ExternalID Order by track, ModDate desc) rn
FROM Dup
)
SELECT * FROM cte_dp
This unfortunately reaches the recursion limit of 100, and runs indefinitely if the limit is escaped.
Is it possible to correct the logic here in order to present the results required, or is there a more elegant solution.
+--------------+---------------------+--------------------+--+
| MasterGUID | SharedExternalID | SubordinateGUID | |
+--------------+---------------------+--------------------+--+
| (MasterGUID) | (SharedExternalID) | (SubordinateGUID) | |
| (MasterGUID) | (Shared ExternalID) | (SubordinateGUID) | |
+--------------+---------------------+--------------------+--+
Is the result I would ideally like to achieve, with MasterGUID being the GUID with the most recent modified date from between the two duplicates.

MERGE Accounts a1
USING Accounts a2
ON a1.ExternalID = a2.ExternalID
WHEN MATCHED THEN
UPDATE
SET a1.ModifiedDate = a2.ModifiedDate,
a1.guid = a2.guid;
SELECT * FROM Accounts a1;

a1.ExternalID = a2.ExternalID
is symmetric, so if you switch the order, the relationship will have the same logical result. So, if you find such a pair (for instance, self), then it will appear twice in the result. We need to break the symmetry with an additional condition:
a1.ExternalID = a2.ExternalID and a1.GUID < a2.GUID
This will prevent joining with self. If that is needed, you can use union, but for now I will assume that is not needed. If there is another match of ExternalID, the match will yield true if the left side has a strictly smaller GUID than the right side, therefore the inverse will not be true and the duplicates will go away.

If you post sample data this will be easier but is this what you mean?
SELECT *
FROM (
select *
, ROW_NUMBER() OVER (PARTITION BY ExternalID ORDER BY ModifiedDate DESC) rnk
from accounts
) i
WHERE i.rnk = 1

Related

Why doesn't this SQL sub query statement work?

The statement produces the following error.
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
I presume I somehow need to concatenate the field names in the subquery?
SELECT (
SELECT COALESCE(Table_Field, Field) AS Fields
FROM API_Objects_Fields
WHERE Field IN (
'fullname'
,'confirmed'
,'primary_email'
,'location_short'
)
)
FROM user_basics U
INNER JOIN Pod_Membership PM ON U.UserID = PM.UserID
WHERE PM.PodID = 164
ORDER BY U.Ctime DESC
The sub query specifies the fields to be returned from the table.
DECLARE #Name VARCHAR(1000)
Select #Name =
COALESCE(#Name,'') +Table_Field + ';'
FROM API_Objects_Fields
WHERE Field IN
( 'fullname' ,'confirmed' ,'primary_email' ,'location_short' )
Select #Name As FieldName
#akfkmupiwu need to do like this for above comment
WITH CTE AS
(SELECT (
SELECT DISTINCT TOP 1 COALESCE(Table_Field, Field)
FROM API_Objects_Fields F
WHERE F.UserID = PM.UserID AND F.Field IN (
'fullname'
,'confirmed'
,'primary_email'
,'location_short'
)
)AS Fields,
ROW_NUMBER()OVER (PARTITION BY Table_Field ORDER BY FIELD)AS RN
FROM user_basics U
INNER JOIN Pod_Membership PM ON U.UserID = PM.UserID
WHERE PM.PodID = 164
ORDER BY U.Ctime DESC
)
Select * from CTE WHERE RN = 1
It is an assumption query basing on your question
What the error is telling you
The problem with your query is exactly what the error says, it brings back more than one result. Since your subquery is in the select portion of the outer query (as opposed to the from or the where), sql is looking for the one value to populate the specific column. Think of it more in terms of filling in an excel spreadsheet. You cannot add two separate values to one cell. Instead, you need the data to go into two separate rows.
On another note, coalesce checks if the first value is null, if it is then it returns the second value. If the first value is not null, that value is returned. It sounds to me that this is not the behavior that you are looking for.
How to fix this
You need to either change your query to pull back different rows for each of the possible values that Fields can be or you need to find a way to specify only one value to return for Fields. Since I am unsure what you are looking for, I am going to demonstrate the first way of solving this.
Data
Your question does not provide any data for API_Objects_Fields, so I am going to make some up. Let's assume the columns in this table are Field_ID, Table_Field, and Field and let's say that your table looks like this:
Field_ID | Table_Field | Field
1 | Alan Turing | fullname
2 | Catherine Zeta Jones | fullname
3 | True | confirmed
4 | MN | location_short
5 | 123-456-7890 | phone_number
As I mentioned before, right now your query would try to pull back the rows where the field is fullname, confirmed, or location_short all. Instead of trying to stuff one column of one row. full of 4 results, let's change your query to bring back 4 rows
The Query
SELECT f.Table_Field, Field
FROM user_basics U
INNER JOIN Pod_Membership PM ON U.UserID = PM.UserID
INNER JOIN (
SELECT Table_Field, Field
FROM API_Objects_Fields
WHERE Field IN (
'fullname'
,'confirmed'
,'primary_email'
,'location_short'
)
) f
WHERE PM.PodID = 164
ORDER BY U.Ctime DESC
What will happen
This query will now pull back data that looks more like this:
Table_Field | Fields
Alan Turing | fullname
Catherine Zeta Jones | fullname
True | confirmed
MN | location_short
However, I think you will be surprised with the results you actually end up with. Since the query does not connect the data in API_Objects_Fields with any other tables, you would get the values from the results table above over and over again. In fact, you would get the values above for every single row returned by
Select *
From user_basics u
INNER JOIN Pod_Membership PM ON U.UserID = PM.UserID
WHERE PM.PodID = 164
If this query returns 12 results, you would end up with 12 Alan Turings, 12 Catherine Zeta Jones, 12 Trues, and 12 MNs. If this is not the result you are looking for, you will need to add an ON portion to the inner join so the results from f are connected with the other tables.

SQL Query to remove cyclic redundancy

I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.

Select distinct values for a particular column choosing arbitrarily from duplicates

I have health data relating to deaths. Individual should die once maximum. In the database they sometimes don't; probably because causes of death were changed but the original entry was not deleted. I don't really understand how this was allowed to happen, but it has. So, as a made up example, I have:
Row_number | Individual_ID | Cause_of_death | Date_of_death
------------+---------------+-----------------------+---------------
1 | 1 | Stroke | 3 march 2008
2 | 2 | Myocardial infarction | 1 jan 2009
3 | 2 | Pulmonary Embolus | 1 jan 2009
I want each individual to have only one cause of death.
In the example, I want a query that returns row 1 and either row 2 or row 3 (not both). I have to make an arbitrary choice between rows 2 and 3 because there is no timestamp in any of the fields that can be used to determine which is the revision; it's not ideal but is unavoidable.
I can't make the SQL work to do this. I've tried inner joining distinct Individual_ID to the other fields, but this still gives all the rows. I've tried adding a 'having count(Individual_ID) = 1' clause with it. This leaves out people with more than one cause of death completely. Suggestions on the internet seem to be based on using a timestamped field to choose the most recent, but I don't have that.
IBM DB2. Windows XP. Any thoughts gratefully received.
Have you tried using MIN (or MAX) against the cause of death. (and the date of death, if they died on two different dates)
SELECT IndividualID, MIN(Cause_Of_Death), MIN (Date_Of_Death)
from deaths
GROUP BY IndividualID
I don't know DB2 so I'll answer in general. There are two main approaches:
select *
from T
join (
select keys, min(ID) as MinID
from T
group by keys
) on T.ID = MinID
And
select *, row_number() over (partition by keys) as r
from T
where r = 1
Both return all rows, no matter if duplicate or not. But they returns only one duplicate per "key".
Notice, that both statements are pseudo-SQL.
The row_number() approach is probably preferable from a performance standpoint. Here is usr's example, in DB2 syntax:
select * from (
select T.*, row_number() over (partition by Individual_ID) as r
from T
)
where r=1;

in-house records over outside records

I am stuck with this problem.
I have duplicates in my query coming from that some of the records are in-house and outside simultaneously. I prefer in-house over outside but some of the outside are preferable when there is no entrance from in-house.
Select id, date, location from prod
id date location
----------
2 01/01/2012 in-house
2 05/01/2012 outside <- in this situation i want to keep just in-house
id date location
----------
4 01/01/2012 in-house
5 03/01/2012 outside <- in this situation i want to keep both since there is no db entry for id=5 therefor i have just info from outside
Could someone help?
One way to do this is to do a full outer join from your table to itself and then use coalesce.
Select
COALESCE(Inside.Id, outside.id) Id,
COALESCE(Inside.date, outside.date) Date,
COALESCE(Inside.location, outside.location) Location
From
prod Inside
FULL OUTER JOIN prod Outside
ON Inside.id = Outside.iD
and Inside.location <> Outside.Location
Where
(Inside.Location = 'in-house'
or
Inside.Location is null)
and
(outside.Location = 'outside'
or
outside.Location is null)
DEMO
Notes:
If your fields can be nullable you may want to use a Case statement instead of coalesce and use the ID field to determine which table to use. Using the date as an example
CASE WHEN Inside.Id is not null THEN Inside.date ELSE outside.date END date
As Dems noted this also assumes that {id, location} is unique.
UPDATE
Since you're using SQL Server and {ID, Location} isn't unique and you want the max date value and always choosing in-house over outside then you can use ROW_NUMBER/WHERE RowNumber = 1 effectively here, by ordering First on location and then by date.
WITH cte
AS (SELECT Row_number() OVER ( partition BY ID
ORDER BY CASE LOCATION WHEN 'in-house' THEN 0
WHEN 'outside' THEN 1 END,
DATE DESC) rn,
ID,
Date,
Location
FROM prod)
SELECT ID,
Date,
Location
FROM cte
WHERE rn = 1
Demo
Note We didn't have to use a case statement but I wanted the mapping to be explicit.

How to group by a column

Hi I know how to use the group by clause for sql. I am not sure how to explain this so Ill draw some charts. Here is my original data:
Name Location
----------------------
user1 1
user1 9
user1 3
user2 1
user2 10
user3 97
Here is the output I need
Name Location
----------------------
user1 1
9
3
user2 1
10
user3 97
Is this even possible?
The normal method for this is to handle it in the presentation layer, not the database layer.
Reasons:
The Name field is a property of that data row
If you leave the Name out, how do you know what Location goes with which name?
You are implicitly relying on the order of the data, which in SQL is a very bad practice (since there is no inherent ordering to the returned data)
Any solution will need to involve a cursor or a loop, which is not what SQL is optimized for - it likes working in SETS not on individual rows
Hope this helps
SELECT A.FINAL_NAME, A.LOCATION
FROM (SELECT DISTINCT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.NAME,
YT.LOCATION
FROM YOUR_TABLE_7 YT) A
As Jirka correctly pointed out, I was using the Outer select, distinct and raw Name unnecessarily. My mistake was that as I used DISTINCT , I got the resulted sorted like
1 1
2 user2 1
3 user3 97
4 user1 1
5 3
6 9
7 10
I wanted to avoid output like this.
Hence I added the raw id and outer select
However , removing the DISTINCT solves the problem.
Hence only this much is enough
SELECT DECODE((LAG(YT.NAME, 1) OVER(ORDER BY YT.NAME)),
YT.NAME,
NULL,
YT.NAME) AS FINAL_NAME,
YT.LOCATION
FROM SO_BUFFER_TABLE_7 YT
Thanks Jirka
If you're using straight SQL*Plus to make your report (don't laugh, you can do some pretty cool stuff with it), you can do this with the BREAK command:
SQL> break on name
SQL> WITH q AS (
SELECT 'user1' NAME, 1 LOCATION FROM dual
UNION ALL
SELECT 'user1', 9 FROM dual
UNION ALL
SELECT 'user1', 3 FROM dual
UNION ALL
SELECT 'user2', 1 FROM dual
UNION ALL
SELECT 'user2', 10 FROM dual
UNION ALL
SELECT 'user3', 97 FROM dual
)
SELECT NAME,LOCATION
FROM q
ORDER BY name;
NAME LOCATION
----- ----------
user1 1
9
3
user2 1
10
user3 97
6 rows selected.
SQL>
I cannot but agree with the other commenters that this kind of problem does not look like it should ever be solved using SQL, but let us face it anyway.
SELECT
CASE main.name WHERE preceding_id IS NULL THEN main.name ELSE null END,
main.location
FROM mytable main LEFT JOIN mytable preceding
ON main.name = preceding.name AND MIN(preceding.id) < main.id
GROUP BY main.id, main.name, main.location, preceding.name
ORDER BY main.id
The GROUP BY clause is not responsible for the grouping job, at least not directly. In the first approximation, an outer join to the same table (LEFT JOIN below) can be used to determine on which row a particular value occurs for the first time. This is what we are after. This assumes that there are some unique id values that make it possible to arbitrarily order all the records. (The ORDER BY clause does NOT do this; it orders the output, not the input of the whole computation, but it is still necessary to make sure that the output is presented correctly, because the remaining SQL does not imply any particular order of processing.)
As you can see, there is still a GROUP BY clause in the SQL, but with a perhaps unexpected purpose. Its job is to "undo" a side effect of the LEFT JOIN, which is duplication of all main records that have many "preceding" ( = successfully joined) records.
This is quite normal with GROUP BY. The typical effect of a GROUP BY clause is a reduction of the number of records; and impossibility to query or test columns NOT listed in the GROUP BY clause, except through aggregate functions like COUNT, MIN, MAX, or SUM. This is because these columns really represent "groups of values" due to the GROUP BY, not just specific values.
If you are using SQL*Plus, use the BREAK function. In this case, break on NAME.
If you are using another reporting tool, you may be able to compare the "name" field to the previous record and suppress printing when they are equal.
If you use GROUP BY, output rows are sorted according to the GROUP BY columns as if you had an ORDER BY for the same columns. To avoid the overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
Relying on implicit GROUP BY sorting in MySQL 5.6 is deprecated. To achieve a specific sort order of grouped results, it is preferable to use an explicit ORDER BY clause. GROUP BY sorting is a MySQL extension that may change in a future release; for example, to make it possible for the optimizer to order groupings in whatever manner it deems most efficient and to avoid the sorting overhead.
For full information - http://academy.comingweek.com/sql-groupby-clause/
SQL GROUP BY STATEMENT
SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups.
Syntax:
1. SELECT column_nm, aggregate_function(column_nm) FROM table_nm WHERE column_nm operator value GROUP BY column_nm;
Example :
To understand the GROUP BY clauserefer the sample database.Below table showing fields from “order” table:
1. |EMPORD_ID|employee1ID|customerID|shippers_ID|
Below table showing fields from “shipper” table:
1. | shippers_ID| shippers_Name |
Below table showing fields from “table_emp1” table:
1. | employee1ID| first1_nm | last1_nm |
Example :
To find the number of orders sent by each shipper.
1. SELECT shipper.shippers_Name, COUNT (orders.EMPORD_ID) AS No_of_orders FROM orders LEFT JOIN shipper ON orders.shippers_ID = shipper.shippers_ID GROUP BY shippers_Name;
1. | shippers_Name | No_of_orders |
Example :
To use GROUP BY statement on more than one column.
1. SELECT shipper.shippers_Name, table_emp1.last1_nm, COUNT (orders.EMPORD_ID) AS No_of_orders FROM ((orders INNER JOIN shipper ON orders.shippers_ID=shipper.shippers_ID) INNER JOIN table_emp1 ON orders.employee1ID = table_emp1.employee1ID)
2. GROUP BY shippers_Name,last1_nm;
| shippers_Name | last1_nm |No_of_orders |
for more clarification refer my link
http://academy.comingweek.com/sql-groupby-clause/