Unable to get duplicate records from table - sql

I have a table with the structure given below:
A User_ID has values for its respective items in the specific time interval. Item value can be text or integer depends upon the item.
I want to check if any Two or more UserId as same values, meaning their items are same with same values and in the same time interval.
As in above table UserId 213456 and UserId 213458 has same records.
I tried using cursor and loops, but it's taking too long. My table has more than 50 million UserId. Is there a way to do this in an efficient way?
I also tried using group by with subqueries but all the attempts were failed to create a good query for it.
I have created the following query using How do I find duplicate values in a table in Oracle?
select t1.USERID, count(t1.USERID)
from USERS_ITEM_VAL t1
where exists ( select *
from USERS_ITEM_VAL t2
where t1.rowid <> t2.rowid and
t2.ITEMID = t1.ITEMID and
t2.TEXT_VALUE = t1.TEXT_VALUE and
--t2.INTEGER_VALUE = t1.INTEGER_VALUE and
t2.INIT_DATE = t1.INIT_DATE and
t2.FINAL_DATE = t1.FINAL_DATE )
group by t1.USERID having count(t1.USERID) > 1 order by count(t1.USERID);
But the problem is its working when excluding the INTEGER_VALUE columns but not giving me output when I include INTEGER_VALUE column in the join, though my data in INTEGER_VALUE column is same.
Here is the structure of my table:
USERID - NUMBER
ITEMID - NUMBER
TEXT_VALUE - VARCHAR2(500)
INTEGER_VALUE - NUMBER
INIT_DATE - DATE
FINAL_DATE - DATE

One way to approach this uses a self join. The idea is to count the number of items that two users have in common (taking the date columns into account). Then compare this to the number of items that each has:
with t as (
select t.*, count(*) over (partition by userid) as numitems
from t
)
select t1.userid, t2.userid
from t t1 join
t t2
on t1.userid < t2.userid and
t1.itemid = t2.itemid and
t1.init_date = t2.init_date and
t1.final_date = t2.final_date and
t1.numitems = t2.numitems
group by t1.userid, t2.userid, t1.numitems
having count(*) = t1.numitems;

The reason your query failed is that either text_value or integer_value will be NULL in every row. For this reason, it's not possible to use an equality predicate in the self-join without using NVL functions to plug the NULL values.
However, below is a query that uses an analytic function to accomplish the goal:
Select * From (
Select t.*, Count(*) Over (Partition By t.itemId,
t.text_value,
t.integer_value,
t.init_date,
t.final_date) as Cnt)
Where cnt > 1;
The query returns all rows where multiple records have identical values in the five columns of the Partition By clause.
A benefit of this technique over the self-join approach is that the table is scanned only once, whereas it would be scanned twice with a self join. This could result in better performance if the table is large.

Related

Foreach/per-item iteration in SQL

I'm new to SQL and I think I must just be missing something, but I can't find any resources on how to do the following:
I have a table with three relevant columns: id, creation_date, latest_id. latest_id refers to the id of another entry (a newer revision).
For each entry, I would like to find the min creation date of all entries with latest_id = this.id. How do I perform this type of iteration in SQL / reference the value of the current row in an iteration?
select
t.id, min(t2.creation_date) as min_creation_date
from
mytable t
left join
mytable t2 on t2.latest_id = t.id
group by
t.id
You could solve this with a loop, but it's not anywhere close the best strategy. Instead, try this:
SELECT tf.id, tf.Creation_Date
FROM
(
SELECT t0.id, t1.Creation_Date,
row_number() over (partition by t0.id order by t1.creation_date) rn
FROM [MyTable] t0 -- table prime
INNER JOIN [MyTable] t1 ON t1.latest_id = t0.id -- table 1
) tf -- table final
WHERE tf.rn = 1
This connects the id to the latest_id by joining the table to itself. Then it uses a windowing function to help identify the smallest Creation_Date for each match.

How to return column changes in a column [duplicate]

I need to calculate the difference of a column between two lines of a table. Is there any way I can do this directly in SQL? I'm using Microsoft SQL Server 2008.
I'm looking for something like this:
SELECT value - (previous.value) FROM table
Imagining that the "previous" variable reference the latest selected row. Of course with a select like that I will end up with n-1 rows selected in a table with n rows, that's not a probably, actually is exactly what I need.
Is that possible in some way?
Use the lag function:
SELECT value - lag(value) OVER (ORDER BY Id) FROM table
Sequences used for Ids can skip values, so Id-1 does not always work.
SQL has no built in notion of order, so you need to order by some column for this to be meaningful. Something like this:
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
If you know how to order things but not how to get the previous value given the current one (EG, you want to order alphabetically) then I don't know of a way to do that in standard SQL, but most SQL implementations will have extensions to do it.
Here is a way for SQL server that works if you can order rows such that each one is distinct:
select rank() OVER (ORDER BY id) as 'Rank', value into temp1 from t
select t1.value - t2.value from temp1 t1, temp1 t2
where t1.Rank = t2.Rank - 1
drop table temp1
If you need to break ties, you can add as many columns as necessary to the ORDER BY.
WITH CTE AS (
SELECT
rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
value
FROM table
)
SELECT
curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1
Oracle, PostgreSQL, SQL Server and many more RDBMS engines have analytic functions called LAG and LEAD that do this very thing.
In SQL Server prior to 2012 you'd need to do the following:
SELECT value - (
SELECT TOP 1 value
FROM mytable m2
WHERE m2.col1 < m1.col1 OR (m2.col1 = m1.col1 AND m2.pk < m1.pk)
ORDER BY
col1, pk
)
FROM mytable m1
ORDER BY
col1, pk
, where COL1 is the column you are ordering by.
Having an index on (COL1, PK) will greatly improve this query.
LEFT JOIN the table to itself, with the join condition worked out so the row matched in the joined version of the table is one row previous, for your particular definition of "previous".
Update: At first I was thinking you would want to keep all rows, with NULLs for the condition where there was no previous row. Reading it again you just want that rows culled, so you should an inner join rather than a left join.
Update:
Newer versions of Sql Server also have the LAG and LEAD Windowing functions that can be used for this, too.
select t2.col from (
select col,MAX(ID) id from
(
select ROW_NUMBER() over(PARTITION by col order by col) id ,col from testtab t1) as t1
group by col) as t2
The selected answer will only work if there are no gaps in the sequence. However if you are using an autogenerated id, there are likely to be gaps in the sequence due to inserts that were rolled back.
This method should work if you have gaps
declare #temp (value int, primaryKey int, tempid int identity)
insert value, primarykey from mytable order by primarykey
select t1.value - t2.value from #temp t1
join #temp t2
on t1.tempid = t2.tempid - 1
Another way to refer to the previous row in an SQL query is to use a recursive common table expression (CTE):
CREATE TABLE t (counter INTEGER);
INSERT INTO t VALUES (1),(2),(3),(4),(5);
WITH cte(counter, previous, difference) AS (
-- Anchor query
SELECT MIN(counter), 0, MIN(counter)
FROM t
UNION ALL
-- Recursive query
SELECT t.counter, cte.counter, t.counter - cte.counter
FROM t JOIN cte ON cte.counter = t.counter - 1
)
SELECT counter, previous, difference
FROM cte
ORDER BY counter;
Result:
counter
previous
difference
1
0
1
2
1
1
3
2
1
4
3
1
5
4
1
The anchor query generates the first row of the common table expression cte where it sets cte.counter to column t.counter in the first row of table t, cte.previous to 0, and cte.difference to the first row of t.counter.
The recursive query joins each row of common table expression cte to the previous row of table t. In the recursive query, cte.counter refers to t.counter in each row of table t, cte.previous refers to cte.counter in the previous row of cte, and t.counter - cte.counter refers to the difference between these two columns.
Note that a recursive CTE is more flexible than the LAG and LEAD functions because a row can refer to any arbitrary result of a previous row. (A recursive function or process is one where the input of the process is the output of the previous iteration of that process, except the first input which is a constant.)
I tested this query at SQLite Online.
You can use the following funtion to get current row value and previous row value:
SELECT value,
min(value) over (order by id rows between 1 preceding and 1
preceding) as value_prev
FROM table
Then you can just select value - value_prev from that select and get your answer

SQL - Summarize column with maximum date value and other fields

I have a table with the following fields:
Id|Date|Name
---------------
A|2019-04-24|"VALUE1"
A|2019-04-23|"VALUE2"
A|2019-06-11|"VALUE3"
A|2019-06-12|"VALUE4"
B|2019-05-21|"VALUE5"
B|2019-05-22|"VALUE6"
B|2019-03-13|"VALUE7"
C|2019-01-03|"VALUE8"
I would like to get one line per Id having the info of the maximum date line. This would be the output:
Id|Date|Name
---------------
A|2019-06-12|"VALUE4"
B|2019-05-22|"VALUE6"
C|2019-01-03|"VALUE8"
I have achieved through a group by getting the Id and the MAX Date, but not the value associated to that date.
What I am working on now is to inner join that table with the input one joining it on date and id, but I am not able to join on two fields.
Is there any way to bring to the result the value field related to the max date in the group by clause?
Otherwise, How could I join on two different fields those two tables?
Any Suggestion?
Thank you so much!!
You can use a correlated subquery :
select t.*
from table t
where t.date = (select max(t1.date) from table t1 where t1.id = t.id);
However, Most of DBMS supports analytical functions, so you can use :
select t.*
from (select t.*, row_number() over (partition by t.id order by t.date desc) as seq
from table t
) t
where seq = 1;

SQL Find Possible Duplicates

I need SQL code that will identify possible duplicates in a table. Lets say my table has 4 columns:
ID (primary key)
Date1
Date2
GroupID
(Date1, Date2, GroupID) form a unique key.
This table gets populated with blocks of data at a time, and it often happens that a new block is loaded in that contains a number of records that are already in there. This fine as long as the unique key catches them. Unfortunately, sometimes Date1 is empty (or at least '1900/01/01') either with the first or subsequent uploads.
So what I need is something to identify where the (Date2, GroupID) combination appear more than once and where for one of the records Date1 = '1900/01/01'
Thanks
Karl
bkm kind of has it, but the inner select can perform poorly on some databases.
This is more straightforward:
select t1.* from
t as t1 left join t as t2
on (t1.date2=t2.date2 and t1.groupid=t2.groupid)
where t1.id != t2.id and (t1.date1='1900/01/01' or t2.date2='1900/01/01')
You can identify duplicates on (date2, GroupID) using
Select date2,GroupID
from t
group by (date2,GroupID)
having count(*) >1
Use this to identify records in main table that are duplicates:
Select *
from t
where date1='1900/01/01'
and (date2,groupID) = (Select date2,GroupID
from t
group by (date2,GroupID)
having count(*) >1)
NOTE: Since Date1, Date2, GroupID forms a unique key, check if your design is right in allowing Date1 to be NULL. You could have a genuine case where Date 1 is different for two rows while (date2,GroupID) is the same
If I understand correctly, you are looking for a group of IDs for which GroupID and Date2 are the same, there's one occurance of Date1 that's different from 1900/01/01, and all the rest of the Date1s are 1900/01/01.
If I got it right, here's the query for you:
SELECT T.ID
FROM Table T1
WHERE
(T1.GroupID, T1.Date2) IN
(SELECT T2.GroupID, T2.Date2
WHERE T2.Date1 = '1900/01/01' OR
T2.Date IS NULL
GROUP BY T2.GroupID, T2.Date2)
AND
1 >=
(
SELECT COUNT(*)
FROM TABLE T3
WHERE NOT (T3.Date1 = '1900/01/01')
AND NOT (T3.Date1 IS NULL)
AND T3.GroupID = T1.GroupID
AND T3.Date2 = T1.Date2
)
Hope that helps.
In addition to having a PRIMARY KEY field defined on the table, you can also add other UNIQUE constraints to perform the same sort of thing you're asking for. They'll validate that a particular column or set of columns have a unique value in the table.
Check out the entry in the MySQL manual for an example:
http://dev.mysql.com/doc/refman/5.1/en/create-table.html
A check constraint perhaps.
Something along the lines of select count(*) where date1 = '1900/01/01' and date2 = #date2 and groupid = #groupid.
Just need to see if you can do this in a table-level constraint ....
select * from table a
join (
select Date2, GroupID, Count(*)
from table
group by Date2, GroupID
having count(*) > 1
) b on (a.Date2 = b.Date2 and a.GroupID = b.GroupID)
where a.Date1 = '1900/01/01'
This is the most straightforward way I can think to do it:
SELECT DISTINCT t1.*
FROM t t1 JOIN t t2 USING (date2, groupid)
WHERE t1.date1 = '1900/01/01';
No need to use GROUP BY, which performs poorly on some brands of database.

Is there a way to access the "previous row" value in a SELECT statement?

I need to calculate the difference of a column between two lines of a table. Is there any way I can do this directly in SQL? I'm using Microsoft SQL Server 2008.
I'm looking for something like this:
SELECT value - (previous.value) FROM table
Imagining that the "previous" variable reference the latest selected row. Of course with a select like that I will end up with n-1 rows selected in a table with n rows, that's not a probably, actually is exactly what I need.
Is that possible in some way?
Use the lag function:
SELECT value - lag(value) OVER (ORDER BY Id) FROM table
Sequences used for Ids can skip values, so Id-1 does not always work.
SQL has no built in notion of order, so you need to order by some column for this to be meaningful. Something like this:
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
If you know how to order things but not how to get the previous value given the current one (EG, you want to order alphabetically) then I don't know of a way to do that in standard SQL, but most SQL implementations will have extensions to do it.
Here is a way for SQL server that works if you can order rows such that each one is distinct:
select rank() OVER (ORDER BY id) as 'Rank', value into temp1 from t
select t1.value - t2.value from temp1 t1, temp1 t2
where t1.Rank = t2.Rank - 1
drop table temp1
If you need to break ties, you can add as many columns as necessary to the ORDER BY.
WITH CTE AS (
SELECT
rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
value
FROM table
)
SELECT
curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1
Oracle, PostgreSQL, SQL Server and many more RDBMS engines have analytic functions called LAG and LEAD that do this very thing.
In SQL Server prior to 2012 you'd need to do the following:
SELECT value - (
SELECT TOP 1 value
FROM mytable m2
WHERE m2.col1 < m1.col1 OR (m2.col1 = m1.col1 AND m2.pk < m1.pk)
ORDER BY
col1, pk
)
FROM mytable m1
ORDER BY
col1, pk
, where COL1 is the column you are ordering by.
Having an index on (COL1, PK) will greatly improve this query.
LEFT JOIN the table to itself, with the join condition worked out so the row matched in the joined version of the table is one row previous, for your particular definition of "previous".
Update: At first I was thinking you would want to keep all rows, with NULLs for the condition where there was no previous row. Reading it again you just want that rows culled, so you should an inner join rather than a left join.
Update:
Newer versions of Sql Server also have the LAG and LEAD Windowing functions that can be used for this, too.
select t2.col from (
select col,MAX(ID) id from
(
select ROW_NUMBER() over(PARTITION by col order by col) id ,col from testtab t1) as t1
group by col) as t2
The selected answer will only work if there are no gaps in the sequence. However if you are using an autogenerated id, there are likely to be gaps in the sequence due to inserts that were rolled back.
This method should work if you have gaps
declare #temp (value int, primaryKey int, tempid int identity)
insert value, primarykey from mytable order by primarykey
select t1.value - t2.value from #temp t1
join #temp t2
on t1.tempid = t2.tempid - 1
Another way to refer to the previous row in an SQL query is to use a recursive common table expression (CTE):
CREATE TABLE t (counter INTEGER);
INSERT INTO t VALUES (1),(2),(3),(4),(5);
WITH cte(counter, previous, difference) AS (
-- Anchor query
SELECT MIN(counter), 0, MIN(counter)
FROM t
UNION ALL
-- Recursive query
SELECT t.counter, cte.counter, t.counter - cte.counter
FROM t JOIN cte ON cte.counter = t.counter - 1
)
SELECT counter, previous, difference
FROM cte
ORDER BY counter;
Result:
counter
previous
difference
1
0
1
2
1
1
3
2
1
4
3
1
5
4
1
The anchor query generates the first row of the common table expression cte where it sets cte.counter to column t.counter in the first row of table t, cte.previous to 0, and cte.difference to the first row of t.counter.
The recursive query joins each row of common table expression cte to the previous row of table t. In the recursive query, cte.counter refers to t.counter in each row of table t, cte.previous refers to cte.counter in the previous row of cte, and t.counter - cte.counter refers to the difference between these two columns.
Note that a recursive CTE is more flexible than the LAG and LEAD functions because a row can refer to any arbitrary result of a previous row. (A recursive function or process is one where the input of the process is the output of the previous iteration of that process, except the first input which is a constant.)
I tested this query at SQLite Online.
You can use the following funtion to get current row value and previous row value:
SELECT value,
min(value) over (order by id rows between 1 preceding and 1
preceding) as value_prev
FROM table
Then you can just select value - value_prev from that select and get your answer