How to delete duplicates data that is in between two common value? - sql

How can I delete duplicate data based on the common value (Start and End)
(Time is unique key)
My table is:
Time
Data
10:24:11
Start
10:24:12
Result
10:24:13
Result
10:24:14
End
10:24:15
Start
10:24:16
Result
10:24:17
End
I want to get Data: Result in between Start and End that is with the MAX(TIME) when duplication does occur. as such
The result that I want:
Time
Data
10:24:11
Start
10:24:13
Result
10:24:14
End
10:24:15
Start
10:24:16
Result
10:24:17
End
I have tried rearranging the data, but couldn't seems to get the result that I want, Could someone give their advice on this case?
Update
I ended up not using either of the the approach suggested by #fredt and #airliquide as my version of HSQLDB doesn't support the function.
so what I did was, adding sequence and making Start = 1, Result = 2, and End = 3.
Sequence
Time
Data
Indicator
1
10:24:11
Start
1
2
10:24:12
Result
2
3
10:24:13
Result
2
4
10:24:14
End
3
5
10:24:15
Start
1
6
10:24:16
Result
2
7
10:24:17
End
3
Thereon, I make use of the indicator and sequence to get only latest Result. Such that if previous row is 2 (which is result), remove it.
The guide that I follow:
From: Is there a way to access the "previous row" value in a SELECT statement?
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1

Hi a first approach will be to use a lead function as folow
select hour,status from (select *,lead(status,1) over ( order by hour) as lead
from newtable)compare
where compare.lead <> status
OR lead is null
Give me what's expected using a postgres engine.

You can do this sort of thing with SQL procedures.
-- create the table with only two columns
CREATE TABLE actions (attime TIME UNIQUE, data VARCHAR(10));
-- drop the procedure if it exists
DROP PROCEDURE del_duplicates IF EXISTS;
create procedure del_duplicates() MODIFIES SQL DATA begin atomic
DECLARE last_time time(0) default null;
for_loop:
-- loop over the rows in order
FOR SELECT * FROM actions ORDER BY attime DO
-- each time 'Start' is found, clear the last_time variable
IF data = 'Start' THEN
SET last_time = NULL;
ITERATE for_loop;
END IF;
-- each time 'Result' is found, delete the row with previous time
-- if last_time is null, no row is actually deleted
IF data = 'Result' THEN
DELETE FROM actions WHERE attime = last_time;
-- then store the latest time
SET last_time = attime;
ITERATE for_loop;
END IF;
END FOR;
END
Your data must all belong to a single day, otherwise there will be strange overlaps that cannot be distinguished. It is better to use TIMESTAMP instead of TIME.

Related

Complex INSERT INTO SELECT statement in SQL

I have two tables in SQL. I need to add rows from one table to another. The table to which I add rows looks like:
timestamp, deviceID, value
2020-10-04, 1, 0
2020-10-04, 2, 0
2020-10-07, 1, 1
2020-10-08, 2, 1
But I have to add a row to this table if a state for a particular deviceID was changed in comparison to the last timestamp.
For example this record "2020-10-09, 2, 1" won't be added because the value wasn't changed for deviceID = 2 and last timestamp = "2020-10-08". In the same time record "2020-10-09, 1, 0" will be added, because the value for deviceID = 1 was changed to 0.
I have a problem with writing a query for this logic. I have written something like this:
insert into output
select *
from values
where value != (
select value
from output
where timestamp = (select max(timestamp) from output) and output.deviceID = values.deviceID)
Of course it doesn't work because of the last part of the query "and output.deviceID = values.deviceID".
Actually the problem is that I don't know how to take the value from "output" table where deviceID is the same as in the row that I try to insert.
I would use order by and something to limit to one row:
insert into output
select *
from values
where value <> (select o2.value
from output o2
where o2.deviceId = v.deviceId
order by o2.timestamp desc
fetch first 1 row only
);
The above is standard SQL. Specific databases may have other ways to express this, such as limit or top (1).

Compare counts between two tables even when one of the tables has no record in oracle

I am stuck in a difficult situation. I need to compare the counts of 2 and update one of the tables based on the comparison. Below are the details:
I have 2 tables: dwh_idh_to_iva_metadata_t and lnd_sml_price_t.
Every day there is a package running that loads data from source table dwh_sml_price_t to destination table lnd_sml_price_t. This package also checks the counts between the tables "dwh_idh_to_iva_metadata_t" and "lnd_sml_price_t". If the counts are matching then it updates few columns in the table dwh_idh_to_iva_metadata_t and if the counts are not matching then the package throws an exception and exits. To do the count comparison we created a cursor and then fetching that cursor for comparison. The code for cursor is as follows:
CURSOR C_CNT_PRICE IS
SELECT
lnd.smp_batchrun_id batch_id,
lnd.lnd_count,
dwh.dwh_count
FROM
(
SELECT
smp_batchrun_id,
COUNT(*) lnd_count
FROM
iva_landing.lnd_sml_price_t
GROUP BY
smp_batchrun_id
) lnd
LEFT JOIN (
SELECT
batchrun_id,
sent_records_count dwh_count,
dwh_sending_table
FROM
dwh.dwh_idh_to_iva_metadata_t
) dwh ON dwh.batchrun_id = lnd.smp_batchrun_id
WHERE
dwh.dwh_sending_table = 'DWH_SML_PRICE_T'
ORDER BY
1 DESC; `
And the actual code for comparison is:
` FOR L_COUNT IN C_CNT_PRICE LOOP --0001,0002
IF L_COUNT.lnd_count = L_COUNT.dwh_count THEN
UPDATE DWH.DWH_IDH_TO_IVA_METADATA_T idh
SET idh.IVA_RECEIVING_TABLE = 'LND_SML_PRICE_T',
idh.RECEIVED_DATE = SYSDATE,
idh.RECEIVED_RECORDS_COUNT = L_COUNT.lnd_count,
idh.status = 'Verified'
WHERE L_COUNT.batch_id = idh.batchrun_id
AND idh.dwh_sending_table = 'DWH_SML_PRICE_T';
COMMIT;
ELSE
RAISE EXCPT_MISSDATA; -- Throw error and exit process immediately
END IF;
END LOOP; `
Now, the problem is that there may be certain cases when the table "LND_SML_PRICE_T" is not having any data, in which case, DWH_IDH_TO_IVA_METDATA_T should have the columns updated as in case of matching counts.
I need help in modifying the code so that comparison is done even in case of no records in LND_SML_PRICE_T table.
Thanks!
If there are no rows in LND_SML_PRICE_T table, cursor C_CNT_PRICE's LND_COUNT equals 0 which means that comparison code should be modified to
FOR L_COUNT IN C_CNT_PRICE LOOP
IF (L_COUNT.lnd_count = L_COUNT.dwh_count)
or L_COUNT.lnd_count = 0 --> this
THEN

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

Return overlapping date records in SQL

I used the following query to fetch the overlapping records in SQL:
SELECT QUOTE_ID,FUNCTION_ID,FUNCTION_DT,FUNC_SPACE_ID,FN_START_TIME,FN_END_TIME,DATE_AUTH_LEVEL
FROM R_13_ALL_RESERVED A
WHERE
A.FUNC_SPACE_ID = '401-ZFU-52'
AND A.FUNCTION_DT = TO_DATE('09/03/2015','MM/DD/YYYY')
AND EXISTS ( SELECT 'X'
FROM R_13_ALL_RESERVED B
WHERE A.PROPERTY = B.PROPERTY
AND A.FUNCTION_DT = B.FUNCTION_DT
AND A.FUNCTION_ID <> B.FUNCTION_ID
AND ( ( A.FN_START_TIME > B.FN_START_TIME
AND A.FN_START_TIME < B.FN_END_TIME)
OR ( B.FN_START_TIME > A.FN_START_TIME
AND B.FN_START_TIME < A.FN_END_TIME)
OR ( A.FN_START_TIME = B.FN_START_TIME
AND A.FN_END_TIME = B.FN_END_TIME)
)
)
But eventhough the dates are not overlapping it still returns the records as overlapping.
I am missing some thing here?
Also if the date records overlap, I need to compare the count of function_id records with DATE_AUTH_LEVEL, if 2 function_id records overlap and the count of function_id would be 2 and DATE_AUTH_LEVEL is 1, such record should in the result set.
Please find the data set in SQLFiddle
http://sqlfiddle.com/#!9/95874/1
Desired Output : The SQL should return overlapping FN_START_TIME and FN_END_TIME for a function_space_id and it's function_dt
In the provided example, row 5 and 6 overlap for the function space id '401-ZFU-12' and function_dt 'August, 15 2015' and all others are not overlapping
The simplest predicate (where clause condition) for detecting the overlap of two ranges is to compare the start of the first range with the end of the 2nd range, and the start of the 2nd range with the end of the first range:
WHERE R1.Start_Date <= R2.End_Date
AND R2.Start_Date <= R1.End_Date
As you can see each of the two inequalities looks at a start and end value from separate records (R1 and R2 and then R2 and R1 respectively) all that remains is to add the conditions that will correlate the records, and also ensure that you aren't comparing a row to itself So if you want to find all Common_IDs that have Distinct_IDs with over lapping date ranges:
select *
from Your_Table R1
where exists (select 1 from Your_Table R2
where R1.Common_ID = R2.Common_ID
and R1.Distinct_ID <> R2.Distinct_ID
and R1.Start_Date <= R2.End_Date
and R2.Start_Date <= R1.End_Date)
If there is no Distinct_ID to use, you can use R1.rowid <> R2.rowid in place of R1.Distinct_ID <> R2.Distinct_ID
Here is an approach to troubleshooting the issue on your end.
My first suspicion is that the results of your exists clause are too broad and thus returning rows for every record matching in the outer clause unexpectedly. Likely there are rows that do not fall on the desired date or spaceid that share one component of their interval with your inner criteria.
Inspect the results of the inner select statement (the one within the exists clause) for an example row, exchanging all the 'A' aliased values with actual values from one of the rows returned you did not expect to receive.
Additionally, you can inspect what I think would be a semi join in the execution profile to see what the join criteria are. If you expect it to be filtered by a constant for 'FUNC_SPACE_ID' of '401-ZFU-52', you will discover that it is not.

SQL query to retrieve discrepancies in punch order

Consider the table below.
The rule is - an employee cannot take a break (needs to clock out) from job num 1 before clocking in to job num 2. In this case the employee "A" was supposed to clock OUT instead of BREAK on jobnum 1 because he later clocked in to JobNum#2
Is it possible to write a query to find this in plain SQL?
Idea is to check if next record is proper one. To find next record one has to find first punchtime after current for same employee. Once this information is retrieved one can isolate record itself and check fields of interest, specifically is jobnum the same and [optionally] is punch_type 'IN'. If it is not, not exists evaluates to true and record is output.
select *
from #punch p
-- Isolate breaks only
where p.punch_type = 'BREAK'
-- The ones having no proper entry
and not exists
(
select null
-- The same table
from #punch a
where a.emplid = p.emplid
and a.jobnum = p.jobnum
-- Next record has punchtime from subquery
and a.punchtime = (select min (n.punchtime)
from #punch n
where n.emplid = p.emplid
and n.punchtime > p.punchtime
)
-- Optionally you might force next record to be 'IN'
and a.punch_type = 'IN'
)
Replace #punch with your table name. -- is comment in Sql Server; if you are not using this database, remove this lines. It is a good idea to tag your database and version as there are probably faster/better ways to do this.
Here is the SQL
select * from employees e1 cross join employees e2 where e1.JOBNUM = (e2.JOBNUM + 1)
and e1.PUNCH_TYPE = 'BREAK' and e2.PUNCH_TYPE = 'IN'
and e1.PUNCHTIME < e2.PUNCHTIME
and e1.EMPLID = e2.EMPLID