Eliminating duplicate records in SQL - sql

I have a table called attribute_value with the following columns
attribute_id | start_date | value | latest_ind | mod_dtime
The latest_ind column can have a value of either 1 or 0.
I basically want to run an update script on this table which finds all the attributes that have a common start date and a latest_ind equal to one and set the latest ind to zero EXCEPT in the case where the record is the latest one.
I've managed to put together the following SELECT query but I have no idea how I would go about converting it into an update. Any pointers would be appreciated
SELECT av.attribute_id, av.start_date, count(latest_ind), max(mod_dtime)
FROM t_attribute_value av
where latest_ind = 1
group by attribute_id, start_date
having count(latest_ind) > 1

This is a case where an UPDATE using a CTE comes in handy:
;WITH ToUpdate AS (
SELECT latest_ind,
ROW_NUMBER() OVER (PARTITION BY attribute_id, start_date
ORDER BY mod_dtime DESC) AS rn
FROM attribute_value
WHERE latest_ind = 1
)
UPDATE ToUpdate
SET latest_ind = 0
WHERE rn > 1
The update operation is propagated to the real table. Hence, in case of a attribute_id, start_date partition with a population greater than one, all records but the lastest are updated.

May be something like this
Method 1 : With CTE
;WITH T AS
( SELECT attribute_id, start_date, latest_ind,
ROW_NUMBER() OVER (PARTITION BY av.attribute_id, av.start_date ORDER BY mod_dtime DESC) RN
FROM t_attribute_value
where latest_ind = 1
)
UPDATE T
SET latest_ind = 0
WHERE RN > 1
Method 2: You don't need a CTE for this
UPDATE T
SET T.latest_ind = 0
FROM t_attribute_value T
INNER JOIN
(
SELECT attribute_id, start_date, latest_ind,
ROW_NUMBER() OVER (PARTITION BY av.attribute_id, av.start_date ORDER BY mod_dtime DESC) RN
FROM t_attribute_value
where latest_ind = 1
) V
ON T.attribute_id= V.attribute_id AND V.RN > 1

Related

Codility SqlEventsDelta (Compute the difference between the latest and the second latest value for each event type)

Recently, I'm practicing code exercises in Codility.
Here you can find the problem, it is in the Exercises 6 - SQL section.
Just start a test to see the problem description! SqlEventsDelta
Problem Define:
I wrote this solution to the SqlEventDelta Question in SQLite. It works fine in local tool But, It was not working in web tool.
Can anyone give any advice on how can I solve this problem?
※ I searched this problem in Stackoverflow and I know a better code then my own way.
But, If possible, I wanna use my own SQLite code logic and function.
WITH cte1 AS
(
SELECT *, CASE WHEN e2.event_type = e2.prev THEN 0
WHEN e2.event_type = e2.next THEN 0
ELSE 1 END AS grp
FROM (SELECT *, LAG(e1.event_type) OVER(ORDER BY (SELECT 1)) AS prev , LEAD(e1.event_type) OVER(ORDER BY (SELECT 1)) AS next FROM events e1) e2
)
,cte2 AS
(
SELECT cte1.event_type, cte1.time, cte1.grp, cte1.value - LAG(cte1.value) OVER(ORDER BY cte1.event_type, cte1.time) AS value
FROM cte1
WHERE cte1.grp = 0
ORDER BY cte1.event_type, cte1.time
)
SELECT c2.event_type, c2.value
FROM cte2 c2
WHERE (c2.event_type, c2.time) IN (
SELECT c2.event_type, MAX(c2.time) AS time
FROM cte2 c2
GROUP BY c2.event_type)
GROUP BY c2.event_type
ORDER BY c2.event_type, c2.time
It ran just fine on my local tool(DB Browser for SQLite Version 3.12.2) without error.
event_type | value
-----------+-----------
2 | -5
3 | 4
Execution finished without errors.
Result: 2 rows returned in 7ms
But, on the web tool(Codility test editor-SQLite Version 3.11.0) can't running and I am getting the following errors.
| Compilation successful.
| Example test: (example test)
| Output (stderr):
| error on query: ...
| ...
| ...,
| details: near "(": syntax error
| RUNTIME ERROR (tested program terminated with exit code 1)
Detected some errors.
SqlEventDelta Question :
Write an SQL query that, for each event_type that has been registered more than once, returns the difference between the latest (i.e. the most recent in terms of time) and the second latest value.
The table should be ordered by event_type (in ascending order).
The names of the columns in the rowset don't matter, but their order does.
Given a table events with the following structure :
create table events (
event_type integer not null,
value integer not null,
time timestamp not null,
unique(event_type, time)
);
For example, given the following data :
event_type | value | time
-----------+------------+--------------------
2 | 5 | 2015-05-09 12:42:00
4 | -42 | 2015-05-09 13:19:57
2 | 2 | 2015-05-09 14:48:30
2 | 7 | 2015-05-09 12:54:39
3 | 16 | 2015-05-09 13:19:57
3 | 20 | 2015-05-09 15:01:09
Given the above data, the output should return the following rowset :
event_type | value
-----------+-----------
2 | -5
3 | 4
Thank you.
I tried to use a somehow naive approach. I'm aware that it is very bad for performance due to many subqueries but the catch here is the "DISTINCT ON" of PostgreSQL, however I got 100% 😃
Hope you like it!
select distinct on (event_type) event_type, result * -1
from (select event_type, value, lead(value) over (order by event_type) - value result
from (select *
from events
where event_type in (select event_type
from events
group by event_type
having count(event_type) >= 2)
order by event_type, time desc) a) b
with data as (SELECT a.event_type, a.value, a.time,
--Produce a virtual table that stores the next and previous values for each event_type.
LEAD(a.value,1) over (PARTITION by a.event_type ORDER by 'event_type', 'time' DESC) as recent_val,
LAG(a.value,1) over (PARTITION by a.event_type ORDER by 'event_type', 'time' DESC) as penult_val
from events a
JOIN (SELECT event_type
from events --Filter the initial dataset for duplicates. Store in correct order
group by event_type HAVING COUNT(*) > 1
ORDER by event_type) b
on a.event_type = b.event_type) --Compare the virtual table to the filtered dataset
SELECT event_type, ("value"-"penult_val") as diff --Perform the desired arithematic
from data
where recent_val is NULL --Filter for the most recent value
Hi team! This one's my answer. It's largely a goopy conglomerate of the answers above, but it reads more simply and it's commented for context. Being a newbie, I hope it helps other newbies.
I do have the same problem when using the sqlite.
Try using below code with PostgreSQL
with data as (select
e.event_type,
e.value,
e.time,
lead(e.value,1) over (PARTITION by e.event_type order by e.event_type,e.time asc) as next_val,
lag (e.value,1) over (PARTITION by e.event_type order by e.event_type,e.time asc) as prev_val
from events e)
select distinct d.event_type, (d.value-d.prev_val) as diff
from
events e,data d
where e.event_type = d.event_type
and d.next_val is null
and e.event_type in ( SELECT event_type
from data
group by
event_type
having count(1) > 1)
order by 1;
Adding another answer involving self joins -
PostgreSQL
-- write your code in PostgreSQL 9.4
WITH TotalRowCount AS (
SELECT
event_type,
COUNT(*) as row_count
FROM events
GROUP BY 1
),
RankedEventType AS (
SELECT
event_type,
value,
ROW_NUMBER() OVER(PARTITION BY event_type ORDER BY time) as row_num
FROM events
)
SELECT
a.event_type,
a.value - b.value as value
FROM RankedEventType a
INNER JOIN TotalRowCount c
ON a.event_type = c.event_type
INNER JOIN RankedEventType b
ON a.event_type = b.event_type
WHERE 1 = 1
AND a.row_num = c.row_count
AND b.row_num = c.row_count - 1
ORDER BY 1
without nested queries, got 100%
with data as (
with count as (select event_type
from events
group by event_type
having count(event_type) >= 2)
select e.event_type , e.value, e.time from events as e inner join count as r on e.event_type=r.event_type order by e.event_type, e.time desc
)
select distinct on (event_type) event_type,
value - (LEAD(value) over (order by event_type)) result from data
Solution with one subquery
WITH diff AS
(SELECT event_type,
value,
LEAD(value) OVER (PARTITION BY event_type
ORDER BY TIME DESC) AS prev
FROM EVENTS
GROUP BY event_type,
value,
time
)
SELECT DISTINCT ON (event_type) event_type,
value - prev
FROM diff
WHERE prev IS NOT NULL;
with deltas as (
select distinct event_type,
first_value(value) over (PARTITION by event_type ORDER by time DESC) -
nth_value(value, 2) over (PARTITION by event_type ORDER by time DESC) as delta
from events
)
select * from deltas where delta is not null order by 1;
--in PostgreSQL 9.4
with ct1 as (SELECT
event_type,
value,
time,
rank() over (partition by event_type order by time desc) as rank
from events),
ct2 as (
select event_type, value, rank, lag (value,1) over (order by event_type) as previous_value
from ct1
order by event_type)
select event_type, previous_value - value from ct2
where rank = 2
order by event_type
My solution:
--Get table with rank 1, 2 group by event_type
with t2 as(
select event_type, value, rank from (
select event_type, value,
rank() over(
partition by event_type
order by time desc) as rank,
count(*) over (partition by event_type) as count
from events) as t
where t.rank <= 2 and t.count > 1
)
--Calculate diff using Lead() and filter out null diff with max
select t3.event_type, max(t3.diff) from (
select event_type,
value - lead(value, 1) over (
partition by event_type
order by rank) as diff
from t2) as t3
group by t3.event_type

Update Flag Based On Change of Previous Value

I have below table .Need sql ,If there is change in INPUT value then update FLAG to 1 else 0.
INPUT START_DATE PERSON_ID FLAG
42707 2017-01-01 227317 0
40000 2018-01-01 227317 1
42400 2019-01-01 227317 1
42400 2019-01-02 227317 0
You can use lag() :
select t.*,
(case when lag(input, 1, input) over (partition by person_id order by start_date) = input
then 0 else 1
end) as FLAG
from table t;
If you want this in a query, then use row_number():
select t.*,
(case when row_number() over (partition by person_id order by start_date) = 1
then 0 else 1
end) as flag
from t;
If the input_value could be the same on different rows, then use first_value():
select t.*,
(case when value <> first_value(input) over (partition by person_id order by start_date) = 1
then 0 else 1
end) as flag
from t;
Either form could be incorporated into an update using an updatable CTE if you want to update the table.
EDIT:
If you want to know if the value changes from one row to the "next", then use lag(). In an update, this looks like:
with toupdate as (
select t.*,
lag(input) over (partition by customerid order by date) as prev_input
from t
)
update toupdate
set flag = (case when prev_input <> input then 1 else 0 end);
That said, I would not advise you to store the data in the table. Instead, just put the logic in a select when you need it. Otherwise, the data could get out of date if a historical value is updated.

Filter the table with latest date having duplicate OrderId

I have following table:
I need to filter out the rows for which start date is latest corresponding to its order id .With reference to given table row no 2 and 3 should be the output.
As row 1 and row 2 has same order id and order date but start date is later than first row. And same goes with row number 3 and 4 hence I need to take out row no 3 . I am trying to write the query in SQL server. Any help is appreciated.Please let me know if you need more details.Apologies for poor English
You can do this easily with a ROW_NUMBER() windowed function:
;With Cte As
(
Select *,
Row_Number() Over (Partition By OrderId Order By StartDate Desc) RN
From YourTable
)
Select *
From Cte
Where RN = 1
But I question the StartDate datatype. It looks like these are being stored as VARCHAR. If that is the case, you need to CONVERT the value to a DATETIME:
;With Cte As
(
Select *,
Row_Number() Over (Partition By OrderId
Order By Convert(DateTime, StartDate) Desc) RN
From YourTable
)
Select *
From Cte
Where RN = 1
Another way using a derived table.
select
t.*
from
YourTable t
inner join
(select OrderId, max(StartDate) dt
from YourTable
group by OrderId) t2 on t2.dt = t.StartDate and t2.OrderId = t.OrderId

How to update rows based only on ROW_NUMBER()?

Such SQL query:
SELECT ROW_NUMBER() OVER (PARTITION BY ID, YEAR order by ID ), ID, YEAR
from table t
give me following query set:
1 1000415591 2012
1 1000415591 2013
2 1000415591 2013
1 1000415591 2014
2 1000415591 2014
How could I update records with ROW_NUMBER() equals to 2? Other fields of this records is identically (select distinct from table where id = 1000415591 gives 3 records when there are 5 without distinct keyword), so I can depend only on ROW_NUMBER() value.
I need solution for Oracle, because I saw something similar for SQL-Server but it won't work with Oracle.
You could use a MERGE statement which is quite verbose and easy to understand.
For example,
MERGE INTO t s
USING
(SELECT ROW_NUMBER() OVER (PARTITION BY ID, YEAR order by ID ) RN,
ID,
YEAR
FROM TABLE t
) u ON (s.id = u.id)
WHEN MATCHED THEN
UPDATE SET YEAR = some_value WHERE u.RN = 2)
/
Note You cannot merge the same column which is used to join in the ON clause.
Try to use ROWID field:
UPDATE T
SET t.year = t.year*1000
WHERE (rowid,2) in (SELECT rowid,
ROW_NUMBER()
OVER (PARTITION BY ID, t.YEAR order by ID )
FROM T)
SQLFiddle demo
If you need to delete range of ROWNUMBERS then :
UPDATE T
SET t.year = t.year*1000
WHERE rowid in ( SELECT rowid FROM
(
SELECT rowid,
ROW_NUMBER()
OVER (PARTITION BY ID, t.YEAR order by ID ) as RN
FROM T
) T2 WHERE RN >=2 AND RN <=10
)
SQLFiddle demo
This is not the update statement but this is how to get the 2 rows you wanted to update:
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY ID, YEAR order by ID ) as rn, ID, YEAR
from t )
where rn = 2
SQLFIDDLE
When I've posted thq question, I've found that this could be wrong approach. I could modify table and add new fields. So better solution to create one more field IDENTITY and update it with numbers from the new sequence from 1 to total row numbers. Then I could update fields based on this IDENTIY field.
I'll keep this question opened if someone come up with solution based on ROW_NUMBER() analytic function.
update TABLE set NEW_ID = TABLE_SEQ.nextval
where IDENTITY in (
select IDENTITY from (
select row_number() over(PARTITION BY ID, YEAR order by ID) as row_num, t.ID, t."YEAR", t.IDENTITY
from TABLE t
) where row_num > 1
)

SQL update last occurrence

I have a simple SELECT query that works fine and returns one row, which is the last occurrence of a specific value in order_id column. I want to update this row. However, I cannot combine this SELECT query with the UPDATE query.
This is the working query that returns one row, which I want to update:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY order_id
ORDER BY start_hour DESC) rn
FROM general_report
WHERE order_id = 16836
) q
WHERE rn = 1
And I tried many combinations to update the row returned by this statement. For example, I tried to remove SELECT *, and update the table q as in the following, but it didn't work telling me that relation q does not exist.
UPDATE q
SET q.cost = 550.01685
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY order_id
ORDER BY start_hour DESC) rn
FROM general_report
WHERE order_id = 16836
) q
WHERE rn = 1
How can I combine these codes with a correct UPDATE syntax? In case needed, I test my codes at SQL Manager for PostgreSQL.
Try something like this. I am not sure on PostgreSQL syntax:
UPDATE general_report AS d
SET cost = 550.01685
FROM (
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY order_id
ORDER BY start_hour DESC) rn
FROM general_report
WHERE order_id = 16836
) q
WHERE rn = 1
) s
WHERE d.id = s.id
Ana alternative method for update the most recent record is to use NOT EXISTS (even more recent):
UPDATE general_report dst
SET cost = 550.01685
WHERE order_id = 16836
AND NOT EXISTS (
SELECT *
FROM general_report nx
WHERE nx.order_id = dst.order_id
AND nx.start_hour > dst.start_hour
);
Test below query
UPDATE q
SET q.cost = 550.01685
where id in
(select id from
(
SELECT *
ROW_NUMBER() OVER(PARTITION BY order_id
ORDER BY start_hour DESC) rn
FROM general_report
WHERE order_id = 16836
) q
WHERE rn = 1)