Select duplicate rows based on time difference and occurrence count - sql

I have a table like this :
As you can see, some records with the same farsi_pelak field have been added(detected) more than 1 time within a few seconds.
That's happened because of some application bug which has been fixed.
Now I need to select and then delete duplicate rows which have been added at the same time (+- few seconds)
And this is my query :
SELECT TOP 100 PERCENT
y.id, y.farsi_pelak , y.detection_date_p , y.detection_time
FROM dbo._tbl_detection y
INNER JOIN
(SELECT TOP 100 PERCENT
farsi_pelak , detection_date_p
FROM dbo._tbl_detection WHERE camera_id = 2
GROUP BY farsi_pelak , detection_date_p
HAVING COUNT(farsi_pelak)>1) dt
ON
y.farsi_pelak=dt.farsi_pelak AND y.detection_date_p =dt.detection_date_p
ORDER BY farsi_pelak , detection_date_p DESC
But I can't calculate the time difference because my detection_time field should not be grouped by.

If you use SQL Server 2012 or later, you can use LAG function to get the values from the "previous" row.
Then calculate the difference between adjacent timestamps and find those rows where this difference is small.
WITH
CTE
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,LAG(detection_time) OVER (PARTITION BY farsi_pelak
ORDER BY detection_date_p, detection_time) AS prev_detection_time
FROM dbo._tbl_detection
)
,CTE_Diff
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,DATEDIFF(second, prev_detection_time, detection_time) AS diff
FROM CTE
)
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,diff
FROM CTE_Diff
WHERE
diff <= 10
;
When you run this query and verify that it returns only rows that you want to delete, you can change the last SELECT to DELETE:
WITH
CTE
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,LAG(detection_time) OVER (PARTITION BY farsi_pelak
ORDER BY detection_date_p, detection_time) AS prev_detection_time
FROM dbo._tbl_detection
)
,CTE_Diff
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,DATEDIFF(second, prev_detection_time, detection_time) AS diff
FROM CTE
)
DELETE
FROM CTE_Diff
WHERE
diff <= 10
;

I guess you need rownumber to check time as below keeping the earliest time data and discarding the rest detection time for rownums greater than 1
Select y.id, y.farsi_pelak ,
y.detection_date_p , y.detection_time,
row_number() over (partition by
y.farsi_pelak,
y.detection_date_p order by
y.detection_time) rn
from ( the above query) where rn>1

Related

Select every second record then determine earliest date

I have table that looks like the following
I have to select every second record per PatientID that would give the following result (my last query returns this result)
I then have to select the record with the oldest date which would be the following (this is the end result I want)
What I have done so far: I have a CTE that gets all the data I need
WITH cte
AS
(
SELECT visit.PatientTreatmentVisitID, mat.PatientMatchID,pat.PatientID,visit.RegimenDate AS VisitDate,
ROW_NUMBER() OVER(PARTITION BY mat.PatientMatchID, pat.PatientID ORDER BY visit.VisitDate ASC) AS RowNumber
FROM tblPatient pat INNER JOIN tblPatientMatch mat ON mat.PatientID = pat.PatientID
LEFT JOIN tblPatientTreatmentVisit visit ON visit.PatientID = pat.PatientID
)
I then write a query against the CTE but so far I can only return the second row for each patientID
SELECT *
FROM
(
SELECT PatientTreatmentVisitID,PatientMatchID,PatientID, VisitDate, RowNumber FROM cte
) as X
WHERE RowNumber = 2
How do I return the record with the oldest date only? Is there perhaps a MIN() function that I could be including somewhere?
If I follow you correctly, you can just order your existing resultset and retain the top row only.
In standard SQL, you would write this using a FETCH clause:
SELECT *
FROM (
SELECT
visit.PatientTreatmentVisitID,
mat.PatientMatchID,
pat.PatientID,
visit.RegimenDate AS VisitDate,
ROW_NUMBER() OVER(PARTITION BY mat.PatientMatchID, pat.PatientID ORDER BY visit.VisitDate ASC) AS rn
FROM tblPatient pat
INNER JOIN tblPatientMatch mat ON mat.PatientID = pat.PatientID
LEFT JOIN tblPatientTreatmentVisit visit ON visit.PatientID = pat.PatientID
) t
WHERE rn = 2
ORDER BY VisitDate
OFFSET 0 ROWS FETCH FIRST 1 ROW ONLY
This syntax is supported in Postgres, Oracle, SQL Server (and possibly other databases).
If you need to get oldest date from all selected dates (every second row for each patient ID) then you can try window function Min:
SELECT * FROM
(
SELECT *, MIN(VisitDate) OVER (Order By VisitDate) MinDate
FROM
(
SELECT PatientTreatmentVisitID,PatientMatchID,PatientID, VisitDate,
RowNumber FROM cte
) as X
WHERE RowNumber = 2
) Y
WHERE VisitDate=MinDate
Or you can use SELECT TOP statement. The SELECT TOP clause allows you to limit the number of rows returned in a query result set:
SELECT TOP 1 PatientTreatmentVisitID,PatientMatchID,PatientID, VisitDate FROM
(
SELECT *
FROM
(
SELECT PatientTreatmentVisitID,PatientMatchID,PatientID, VisitDate,
RowNumber FROM cte
) as X
WHERE RowNumber = 2
) Y
ORDER BY VisitDate
For simplicity add order desc on date column and use TOP to get the first row only
SELECT TOP 1 *
FROM
(
SELECT PatientTreatmentVisitID,PatientMatchID,PatientID, VisitDate, RowNumber FROM cte
) as X
WHERE RowNumber = 2
order by VisitDate desc

sql: Select count(*) - nth record from each group

I'm grouping by tenant_id. I want to select the count() - 1000th record (ordered by _updated time) from each GROUPBY group, for the groups where count() is greater than 1000. As follows:
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
count(*) - 1000
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
But this is not allowed as count(*) cannot be used inside the subquery.
So I tried the following, which still doesn't work as I don't have access to t1 since this is not a join.
select t1.tenant_id,
(select temp._updated
from trace temp
where temp.tenant_id = t1.tenant_id
order by _updated limit 1 offset
(select count(*)-1000
from trace t2
group by tenant_id
having t2.tenant_id = t1.tenant_id)
) as timekey
from fgc.trace as t1
group by tenant_id
having count(*) > 1000;
So how can I get the following?
tenant_id | timekey
+-----------+----------------------------------+
n7ia6ryc | 2019-07-23 23:09:49.951406+00:00
You seem to want ROW_NUMBER(). Cockroach supports windows functions, so:
SELECT updated
FROM (
SELECT
tenant_id,
updated,
ROW_NUMBER() OVER(PARTITION BY tenant_id ORDER BY updated DESC) rn
FROM trace
) x WHERE rn = 1001
For each tenant_id, this will return the timestamp of the 1001th less recent record. If a given tenant has less than 1000 records, it will not appear in the results.
select x.tenant_id
from (
select t.tenant_id,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
group by x.tenant_id
just the one timestamp would look like this:
select min(x.timekey) as min_timestamp
from (
select t.tenant_id, t.timekey,
row_number() over (partition by t.tenant_id order by t.timekey) as tenant_number
from fgc.trace as t
) x
where x.tenant_number > 1000
note that grouping does not matter here because each row can only be in one group and you are only looking at one row.

Per year one maximum date row according to previous row date

I have a table having two columns and I want to fetch data of 6 years with rules
The first row would be maximum date row that is available before and equals to input date (I will pass an input date)
From the second row till 6th row I need maximum(date row) that is earlier than previous row data selected data and there should not be 2 rows for same year i need only latest one according to the previous row but not in same year.
declare #tbl table (id int identity, marketdate date )
insert into #tbl (marketdate)
values('2018-05-31'),
('2017-06-01'),
('2017-05-28'),
('2017-04-28'),
('2016-05-26'),
('2015-04-18'),
('2015-04-20'),
('2015-03-18'),
('2014-05-31'),
('2014-04-18'),
('2013-04-15')
output:
id marketdate
1 2018.05.31
3 2017.05.28
5 2016.05.27
7 2015.04.20
9 2014.04.18
10 2013.04.15
Can't you do this with a simple order by/desc?
SELECT TOP 6 id, max(marketdate) FROM tbl
WHERE tbl.marketdate <= #date
GROUP BY YEAR(marketdate), id, marketdate
ORDER BY YEAR(marketdate) DESC
Based purely on your "Output" given your sample data, I believe the following is what you are after (The max date for each distinct year of data):
SELECT TOP 6
max(marketdate),
Year(marketDate) as marketyear
FROM #tbl
WHERE #tbl.marketdate <= getdate()
GROUP BY YEAR(marketdate)
ORDER BY YEAR(marketdate) DESC;
SQLFiddle of this matching your output
you can use row_number if you are using sql server
select top 6
id
, t.marketdate
from ( select rn = row_number() over (partition by year(marketdate)order by marketdate desc)
, id
, marketdate
from #tbl) as t
where t.rn = 1
order by t.marketdate desc
The following recursively searches for the next date, which must be at least one year earlier than the previous date.
Your parameterised start position goes where I chose 2018-06-01.
WITH
recursiveSearch AS
(
SELECT
id,
marketDate
FROM
(
SELECT
yourTable.id,
yourTable.marketDate,
ROW_NUMBER() OVER (ORDER BY yourTable.marketDate DESC) AS relative_position
FROM
yourTable
WHERE
yourTable.marketDate <= '2018-06-01'
)
search
WHERE
relative_position = 1
UNION ALL
SELECT
id,
marketDate
FROM
(
SELECT
yourTable.id,
yourTable.marketDate,
ROW_NUMBER() OVER (ORDER BY yourTable.marketDate DESC) AS relative_position
FROM
yourTable
INNER JOIN
recursiveSearch
ON yourTable.marketDate < DATEADD(YEAR, -1, recursiveSearch.marketDate)
)
search
WHERE
relative_position = 1
)
SELECT
*
FROM
recursiveSearch
WHERE
id IS NOT NULL
ORDER BY
recursiveSearch.marketDate DESC
OPTION
(MAXRECURSION 0)
http://sqlfiddle.com/#!18/56246/13

how to calculate the time length of 0-1 sequence with hive?

Now I have a data like:
time(string) id(int)
201801051127 0
201801051130 0
201801051132 0
201801051135 1
201801051141 1
201801051145 0
201801051147 0
It has three different parts, and I want to calculate the time length of these three parts, such as the first zero sequence, the time length is 5 minutes. If I use 'group by 0 and 1', the first zero sequence would combine with the third zero sequence, which is not what I want. How I calculate the three parts' length with sql? My tried my-sql code is as follows:
SET #id_label:=0;
SELECT id_label,id,TIMESTAMPDIFF(MINUTE,MIN(DATE1),MAX(DATE1)) FROM
(SELECT id, DATE1, id_label FROM (
SELECT id, str_to_date ( TIME,'%Y%m%d%H%i' ) DATE1,
#id_label := IF(#id = id, #id_label, #id_label+1) id_label,
#id := id
FROM test.t
ORDER BY str_to_date ( TIME,'%Y%m%d%h%i' )
) a)b
GROUP BY id_label,id;
I don't know how to change it into hive code.
I would suggest some transformations:
add an indication whether a row is the first one in its group (flag as 1, or null otherwise)
count the number of such flags that precede a row to know its group number
Then you can just group by that new group number.
Oracle version (original question)
with q1 as (
select to_date(time, 'YYYYMMDDHH24MI') time, id,
case id when lag(id) over(order by time) then null else 1 end first_in_group
from t
), q2 as (
select time, id, count(first_in_group) over (order by time) grp_id
from q1
)
select min(id) id, (max(time) - min(time)) * 24 * 60 minutes
from q2
group by grp_id
order by grp_id
SQL fiddle
Hive version
Different database engines use different functions to deal with date/time values, so use Hive's unix_timestamp and deal with the number of seconds it returns:
with q1 as (
select unix_timestamp(time, 'yyyyMMddHHmm')/60 time, id,
case id when lag(id) over(order by time) then null else 1 end first_in_group
from t
), q2 as (
select time, id, count(first_in_group) over (order by time) grp_id
from q1
)
select min(id) id, max(time) - min(time) minutes
from q2
group by grp_id
order by grp_id
Try This.
SELECT id, ( max( TO_DATE ( time,'YYYYMMDDHHMI' ) )
- min( TO_DATE ( time,'YYYYMMDDHHMI' ) ) ) *24*60 diff_in_minutes from
(
select t.*,
row_number() OVER ( ORDER BY
TO_DATE ( time,'YYYYMMDDHHMI' ) )
- row_number() OVER ( PARTITION BY ID ORDER BY
TO_DATE ( time,'YYYYMMDDHHMI' ) ) seq
FROM Table1 t ORDER BY time
) GROUP BY ID,seq
ORDER BY max(time)
;
DEMO
EDIT: This answer was written considering that the OP had tagged oracle.Now it is changed to hive.
As an alternative in hive for TO_DATE in Oracle,
unix_timestamp(time, 'yyyyMMddhhmm')
could be used.

Filter the table with latest date having duplicate OrderId

I have following table:
I need to filter out the rows for which start date is latest corresponding to its order id .With reference to given table row no 2 and 3 should be the output.
As row 1 and row 2 has same order id and order date but start date is later than first row. And same goes with row number 3 and 4 hence I need to take out row no 3 . I am trying to write the query in SQL server. Any help is appreciated.Please let me know if you need more details.Apologies for poor English
You can do this easily with a ROW_NUMBER() windowed function:
;With Cte As
(
Select *,
Row_Number() Over (Partition By OrderId Order By StartDate Desc) RN
From YourTable
)
Select *
From Cte
Where RN = 1
But I question the StartDate datatype. It looks like these are being stored as VARCHAR. If that is the case, you need to CONVERT the value to a DATETIME:
;With Cte As
(
Select *,
Row_Number() Over (Partition By OrderId
Order By Convert(DateTime, StartDate) Desc) RN
From YourTable
)
Select *
From Cte
Where RN = 1
Another way using a derived table.
select
t.*
from
YourTable t
inner join
(select OrderId, max(StartDate) dt
from YourTable
group by OrderId) t2 on t2.dt = t.StartDate and t2.OrderId = t.OrderId