Remove all non contiguous records with identical fields - sql

I got a table with some columns like
ID RecordID DateInserted
1 10 now + 1
2 10 now + 2
3 4 now + 3
4 10 now + 4
5 10 now + 5
I would like to remove all non contiguous duplicates of the RecordID Column when they are sorted by DateInserted
In my example I would like to remove record 4 and 5 because between 2 and 4 there is a record with different id.
Is there a way to do it with 1 query ?

You can use window functions. One method is to count the changes in value that occur up to each row and just take the rows with one change:
select t.*
from (select t.*,
sum(case when prev_recordid = recordid then 0 else 1 end) over (order by dateinserted) as grp_num
from (select t.*,
lag(recordid) over (order by dateinserted) as prev_recordid
from t
) t
) t
where grp_num = 1;

One way would be to "flag" all the rows where it is not the first time this RecordID appeared and the prior row contained a different RecordID. Then you just exclude any row beyond that point for that RecordID.
;WITH cte AS
(
SELECT ID, RecordID, DateInserted,
dr = DENSE_RANK() OVER (PARTITION BY RecordID ORDER BY DateInserted),
prior = COALESCE(LAG(RecordID,1) OVER (ORDER BY DateInserted), RecordID)
FROM dbo.table_name
),
FlaggedRows AS
(
SELECT RecordID, dr
FROM cte
WHERE dr > 1 AND prior <> RecordID
)
SELECT cte.ID, cte.RecordID, cte.DateInserted
FROM cte
LEFT OUTER JOIN FlaggedRows AS f
ON cte.RecordID = f.RecordID
WHERE cte.dr < COALESCE(f.dr, cte.dr + 1)
ORDER BY cte.DateInserted;
If you want to actually delete the rows from the source (remove will typically be inferred as removing from the result), then change the SELECT at the end to:
DELETE cte
FROM cte
INNER JOIN FlaggedRows f
ON cte.RecordID = f.RecordID
WHERE cte.dr >= f.dr;

Related

Finding unique combination of columns associated with 1 non-unique column

Here's my table:
ItemID
ItemName
ItemBatch
TrackingNumber
a
bag
1
498239
a
bag
1
498239
a
bag
1
958103
b
paper
2
123444
b
paper
2
123444
I'm trying to find occurrences of ItemID + ItemName + ItemBatch that have a non-unique TrackingNumber. So in the example above, there are 3 occurrences of a bag 1 and at least 1 of those rows has a different TrackingNumber from any of the other rows. In this case 958103 is different from 498239 so it should be a hit.
For b paper 2 the TrackingNumber is unique for all the respective rows so we ignore this. Is there a query that can pull this combination of columns with 3 identical fields and 1 non-unique field?
Yet another option:
SELECT *
FROM tab
WHERE ItemBatch IN (SELECT ItemBatch
FROM tab
GROUP BY ItemBatch, TrackingNumber
HAVING COUNT(TrackingNumber) = 1)
This query finds the combination of (ItemBatch, TrackingNumber) that occur only once, then gets all rows corresponding to their ItemBatch values.
Try it here.
You can use GROUP BY and HAVING
SELECT
t.ItemID,
t.ItemName,
t.ItemBatch,
COUNT(*)
FROM YourTable t
GROUP BY
t.ItemID,
t.ItemName,
t.ItemBatch
HAVING COUNT(DISTINCT TrackingNumber) > 1;
Or if you want each individual row you can use a window function. You cannot use COUNT(DISTINCT in a window function, but you can simulate it with DENSE_RANK and MAX
SELECT
t.*
FROM (
SELECT *,
Count = MAX(dr) OVER (PARTITION BY t.ItemID, t.ItemName, t.ItemBatch)
FROM (
SELECT *,
dr = DENSE_RANK() OVER (PARTITION BY t.ItemID, t.ItemName, t.ItemBatch ORDER BY t.TrackingNumber)
FROM YourTable t
) t
) t
WHERE t.Count > 1;
db<>fiddle

SUM a specific column in next rows until a condition is true

Here is a table of articles and I want to store sum of Mass Column from next rows in sumNext Column based on a condition.
If next row has same floor (in floorNo column) as current row, then add the mass of next rows until the floor is changed
E.g : Rows three has sumNext = 2. That is computed by adding the mass from row four and row five because both rows has same floor number as row three.
id
mass
symbol
floorNo
sumNext
2891176
1
D
1
0
2891177
1
L
8
0
2891178
1
L
1
2
2891179
1
L
1
1
2891180
1
1
0
2891181
1
5
2
2891182
1
5
1
2891183
1
5
0
Here is the query, that is generating this table, I just want to add sumNext column with the right value inside.
WITH items AS (SELECT
SP.id,
SP.mass,
SP.symbol,
SP.floorNo
FROM articles SP
ORDER BY
DECODE(SP.symbol,
'P',1,
'D',2,
'L',3,
4 ) asc)
SELECT CLS.*
FROM items CLS;
You could use below solution which uses
common table expression (cte) technique to put all consecutive rows with same FLOORNO value in the same group (new grp column).
Then uses the analytic version of SUM function to sum all next MASS per grp column as required.
Items_RowsNumbered (id, mass, symbol, floorNo, rnb) as (
select ID, MASS, SYMBOL, FLOORNO
, row_number()over(
order by DECODE(symbol, 'P',1, 'D',2, 'L',3, 4 ) asc, ID )
/*
You need to add ID column (or any others columns that can identify each row uniquely)
in the "order by" clause to make the result deterministic
*/
from (Your source query)Items
)
, cte(id, mass, symbol, floorNo, rnb, grp) as (
select id, mass, symbol, floorNo, rnb, 1 grp
from Items_RowsNumbered
where rnb = 1
union all
select t.id, t.mass, t.symbol, t.floorNo, t.rnb
, case when t.floorNo = c.floorNo then c.grp else c.grp + 1 end grp
from Items_RowsNumbered t
join cte c on (c.rnb + 1 = t.rnb)
)
select
ID, MASS, SYMBOL, FLOORNO
/*, RNB, GRP*/
, nvl(
sum(MASS)over(
partition by grp
order by rnb
ROWS BETWEEN 1 FOLLOWING and UNBOUNDED FOLLOWING)
, 0
) sumNext
from cte
;
demo on db<>fiddle
This is a typical gaps-and-islands problem. You can use LAG() in order to determine the exact partitions, and then SUM() analytic function such as
WITH ii AS
(
SELECT i.*,
ROW_NUMBER() OVER (ORDER BY id DESC) AS rn2,
ROW_NUMBER() OVER (PARTITION BY floorNo ORDER BY id DESC) AS rn1
FROM items i
)
SELECT id,mass,symbol, floorNo,
SUM(mass) OVER (PARTITION BY rn2-rn1 ORDER BY id DESC)-1 AS sumNext
FROM ii
ORDER BY id
Demo

DISTINCT with PagedResults

I am sure this will be answered somewhere...
Aim: Get DISTINCT DOCURL and additional columns
Tried:
1. Changing SELECT * FROM to SELECT DISTINCT DOCURL FROM which only yields the DOCURL column
2. Adding DISTINCT into the second select (as per example) but again I get all columns and rows.
Notes: Code is normally built dynamically so I've taken the print...
SELECT *
FROM
(
Select DISTINCT
isnull(d.DOCURL,'-') As DOCURL,
isnull(d.ID,'-') As ID,
isnull(d.UPRN,'-') As UPRN,
isnull(d.VFMDISCIPLINE,'-') As VFMDISCIPLINE,
isnull(d.VFMDISCIPLINEELEMENT,'-') As VFMDISCIPLINEELEMENT ,
isnull(d.SurveyDate,' ') As SurveyDate,
isnull(d.WorkOrder,'-') As WorkOrder,
ROW_NUMBER() OVER (ORDER BY DOCURL) AS ResultSetRowNumber
From TblData As D
WHERE 1 = 1
AND d.UPRN = '123XYZ'
AND (d.VFMDISCIPLINE = '1' OR d.VFMDISCIPLINE = '2' )
) As PagedResults
WHERE ResultSetRowNumber > 0 And ResultSetRowNumber <= 20
Assuming DOCURL is a unique column, the issue with the DISTINCT statement is that a new row number will be generated for each row in the sub query, therefore all rows will be considered different. You should apply distinct first and then get the row numbers.
Edit: I removed DISTINCT since your result set do not satisfy the criteria. Instead, I've added a partition inside the sub query, this way row numbers will start from 1 for each unique DOCURL and they're ordered by ID since I just assumed that's what you mean by first. Outer query reassigns row_numbers based on unique results from the sub query.
Select * From (
SELECT *, ROW_NUMBER() OVER (ORDER BY DOCURL) AS ResultSetRowNumber
FROM
(
Select
isnull(d.DOCURL,'-') As DOCURL,
isnull(d.ID,'-') As ID,
isnull(d.UPRN,'-') As UPRN,
isnull(d.VFMDISCIPLINE,'-') As VFMDISCIPLINE,
isnull(d.VFMDISCIPLINEELEMENT,'-') As VFMDISCIPLINEELEMENT ,
isnull(d.SurveyDate,' ') As SurveyDate,
isnull(d.WorkOrder,'-') As WorkOrder,
ROW_NUMBER() OVER (PARTITION BY d.DOCURL ORDER BY d.ID) As PART
From TblData As D
WHERE 1 = 1
AND d.UPRN = '123XYZ'
AND (d.VFMDISCIPLINE = '1' OR d.VFMDISCIPLINE = '2' )
) As t Where PART = 1
) As PagedResults
WHERE ResultSetRowNumber > 0 And ResultSetRowNumber <= 20

Getting all fields from table filtered by MAX(Column1)

I have table with some data, for example
ID Specified TIN Value
----------------------
1 0 tin1 45
2 1 tin1 34
3 0 tin2 23
4 3 tin2 47
5 3 tin2 12
I need to get rows with all fields by MAX(Specified) column. And if I have few row with MAX column (in example ID 4 and 5) i must take last one (with ID 5)
finally the result must be
ID Specified TIN Value
-----------------------
2 1 tin1 34
5 3 tin2 12
This will give the desired result with using window function:
;with cte as(select *, row_number(partition by tin order by specified desc, id desc) as rn
from tablename)
select * from cte where rn = 1
Edit: Updated query after question edit.
Here is the fiddle
http://sqlfiddle.com/#!9/20e1b/1/0
SELECT * FROM TBL WHERE ID IN (
SELECT max(id) FROM
TBL WHERE SPECIFIED IN
(SELECT MAX(SPECIFIED) FROM TBL
GROUP BY TIN)
group by specified)
I am sure we can simplify it further, but this will work.
select * from tbl where id =(
SELECT MAX(ID) FROM
tbl where specified =(SELECT MAX(SPECIFIED) FROM tbl))
One method is to use window functions, row_number():
select t.*
from (select t.*, row_number() over (partition by tim
order by specified desc, id desc
) as seqnum
from t
) t
where seqnum = 1;
However, if you have an index on tin, specified id and on id, the most efficient method is:
select t.*
from t
where t.id = (select top 1 t2.id
from t t2
where t2.tin = t.tin
order by t2.specified desc, id desc
);
The reason this is better is that the index will be used for the subquery. Then the index will be used for the outer query as well. This is highly efficient. Although the index will be used for the window functions; the resulting execution plan probably requires scanning the entire table.

Oracle SQL query : finding the last time a data was changed

I want to retrieve elapsed days since the last time the data of the specific column was changed, for example :
TABLE_X contains
ID PDATE DATA1 DATA2
A 10-Jan-2013 5 10
A 9-Jan-2013 5 10
A 8-Jan-2013 5 11
A 7-Jan-2013 5 11
A 6-Jan-2013 14 12
A 5-Jan-2013 14 12
B 10-Jan-2013 3 15
B 9-Jan-2013 3 15
B 8-Jan-2013 9 15
B 7-Jan-2013 9 15
B 6-Jan-2013 14 15
B 5-Jan-2013 14 8
I simplify the table for example purpose.
The result should be :
ID DATA1_LASTUPDATE DATA2_LASTUPDATE
A 4 2
B 2 5
which says,
- data1 of A last update is 4 days ago,
- data2 of A last update is 2 days ago,
- data1 of B last update is 2 days ago,
- data2 of B last update is 5 days ago.
Using query below is OK but it takes too long to complete if I apply it to the real table which have lots of records and add 2 more data columns to find their latest update days.
I use LEAD function for this purposes.
Any other alternatives to speed up the query?
with qdata1 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data1,1,0) over (partition by ID order by pdate desc) - data1 as data1_diff
from table_x a
) a
where data1_diff <> 0
)
where rnum=1
),
qdata2 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data2,1,0) over (partition by ID order by pdate desc) - data2 as data2_diff
from table_x a
) a
where data2_diff <> 0
)
where rnum=1
)
select a.ID,
trunc(sysdate) - b.pdate data1_lastupdate,
trunc(sysdate) - c.pdate data2_lastupdate,
from table_master a, qdata1 b, qdata2 c
where a.ID=b.ID(+) and a.ID=b.ID(+)
and a.ID=c.ID(+) and a.ID=c.ID(+)
Thanks a lot.
You can avoid the multiple hits on the table and the joins by doing both lag (or lead) calculations together:
with t as (
select id, pdate, data1, data2,
lag(data1) over (partition by id order by pdate) as lag_data1,
lag(data2) over (partition by id order by pdate) as lag_data2
from table_x
),
u as (
select t.*,
case when lag_data1 is null or lag_data1 != data1 then pdate end as pdate1,
case when lag_data2 is null or lag_data2 != data2 then pdate end as pdate2
from t
),
v as (
select u.*,
rank() over (partition by id order by pdate1 desc nulls last) as rn1,
rank() over (partition by id order by pdate2 desc nulls last) as rn2
from u
)
select v.id,
max(trunc(sysdate) - (case when rn1 = 1 then pdate1 end))
as data1_last_update,
max(trunc(sysdate) - (case when rn2 = 1 then pdate2 end))
as data2_last_update
from v
group by v.id
order by v.id;
I'm assuming that you meant your data to be for Jun-2014, not Jan-2013; and that you're comparing the most recent change dates with the current date. With the data adjusted to use 10-Jun-2014 etc., this gives:
ID DATA1_LAST_UPDATE DATA2_LAST_UPDATE
-- ----------------- -----------------
A 4 2
B 2 5
The first CTE (t) gets the actual table data and adds two extra columns, one for each of the data columns, using lag (whic his the the same as lead ordered by descending dates).
The second CTE (u) adds two date columns that are only set when the data columns are changed (or when they are first set, just in case they have never changed). So if a row has data1 the same as the previous row, its pdate1 will be blank. You could combine the first two by repeating the lag calculation but I've left it split out to make it a bit clearer.
The third CTE (v) assigns a ranking to those pdate columns such that the most recent is ranked first.
And the final query works out the difference from the current date to the highest-ranked (i.e. most recent) change for each of the data columns.
SQL Fiddle, including all the CTEs run individually so you can see what they are doing.
Your query wasn't returning the right results for me, maybe I missed something, but I got the correct results also with the below query (you can check this SQLFiddle demo):
with ranked as (
select ID,
data1,
data2,
rank() over(partition by id order by pdate desc) r
from table_x
)
select id,
sum(DATA1_LASTUPDATE) DATA1_LASTUPDATE,
sum(DATA2_LASTUPDATE) DATA2_LASTUPDATE
from (
-- here I get when data1 was updated
select id,
count(1) DATA1_LASTUPDATE,
0 DATA2_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data1 = data1)
and PRIOR r = r - 1
group by id
union
-- here I get when data2 was updated
select id,
0 DATA1_LASTUPDATE,
count(1) DATA0_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data2 = data2)
and PRIOR r = r - 1
group by id
)
group by id