Fetch rows based on condition - sql

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.

There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.

Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

Related

How to delete duplicates data that is in between two common value?

How can I delete duplicate data based on the common value (Start and End)
(Time is unique key)
My table is:
Time
Data
10:24:11
Start
10:24:12
Result
10:24:13
Result
10:24:14
End
10:24:15
Start
10:24:16
Result
10:24:17
End
I want to get Data: Result in between Start and End that is with the MAX(TIME) when duplication does occur. as such
The result that I want:
Time
Data
10:24:11
Start
10:24:13
Result
10:24:14
End
10:24:15
Start
10:24:16
Result
10:24:17
End
I have tried rearranging the data, but couldn't seems to get the result that I want, Could someone give their advice on this case?
Update
I ended up not using either of the the approach suggested by #fredt and #airliquide as my version of HSQLDB doesn't support the function.
so what I did was, adding sequence and making Start = 1, Result = 2, and End = 3.
Sequence
Time
Data
Indicator
1
10:24:11
Start
1
2
10:24:12
Result
2
3
10:24:13
Result
2
4
10:24:14
End
3
5
10:24:15
Start
1
6
10:24:16
Result
2
7
10:24:17
End
3
Thereon, I make use of the indicator and sequence to get only latest Result. Such that if previous row is 2 (which is result), remove it.
The guide that I follow:
From: Is there a way to access the "previous row" value in a SELECT statement?
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
Hi a first approach will be to use a lead function as folow
select hour,status from (select *,lead(status,1) over ( order by hour) as lead
from newtable)compare
where compare.lead <> status
OR lead is null
Give me what's expected using a postgres engine.
You can do this sort of thing with SQL procedures.
-- create the table with only two columns
CREATE TABLE actions (attime TIME UNIQUE, data VARCHAR(10));
-- drop the procedure if it exists
DROP PROCEDURE del_duplicates IF EXISTS;
create procedure del_duplicates() MODIFIES SQL DATA begin atomic
DECLARE last_time time(0) default null;
for_loop:
-- loop over the rows in order
FOR SELECT * FROM actions ORDER BY attime DO
-- each time 'Start' is found, clear the last_time variable
IF data = 'Start' THEN
SET last_time = NULL;
ITERATE for_loop;
END IF;
-- each time 'Result' is found, delete the row with previous time
-- if last_time is null, no row is actually deleted
IF data = 'Result' THEN
DELETE FROM actions WHERE attime = last_time;
-- then store the latest time
SET last_time = attime;
ITERATE for_loop;
END IF;
END FOR;
END
Your data must all belong to a single day, otherwise there will be strange overlaps that cannot be distinguished. It is better to use TIMESTAMP instead of TIME.

JOIN other table only if condition is true for ALL joined rows

I have two tables I'm trying to conditionally JOIN.
dbo.Users looks like this:
UserID
------
24525
5425
7676
dbo.TelemarketingCallAudits looks like this (date format dd/mm/yyyy):
UserID Date CampaignID
------ ---------- ----------
24525 21/01/2018 1
24525 26/08/2018 1
24525 17/02/2018 1
24525 12/01/2017 2
5425 22/01/2018 1
7676 16/11/2017 2
I'd like to return a table that contains ONLY users that I called at least 30 days ago (if CampaignID=1) and at least 70 days ago (if CampaignID=2).
The end result should look like this (today is 02/09/18):
UserID Date CampaignID
------ ---------- ----------
5425 22/01/2018 1
7676 16/11/2017 2
Note that because I called user 24524 with Campaign 1 only 7 days ago, I shall not see the user at all.
I tried this simple AND/OR condition and then I found out it will still return the users I shouldn't see because they do have rows indicating other calls and it simply ignoring the conditioned calls... which misses the goal obviously.
I have no idea on how to condition the overall appearance of the user if ANY of his associated rows in the second table did not meet the condition.
AND
(
internal_TelemarketingCallAudits.CallAuditID IS NULL --No telemarketing calls is fine
OR
(
internal_TelemarketingCallAudits.CampaignID = 1 --Campaign 1
AND
DATEADD(dd, 75, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
)
OR
(
internal_TelemarketingCallAudits.CampaignID != 1 --Other campaigns
AND
DATEADD(dd, 10, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
)
)
I really appreciate your help.
Try this: SQL Fiddle
select *
from dbo.Users u
inner join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
and mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
I'm not comparing all rows here; but rather saw that you're interested in when the most recent call is; then you only care if that's in the X day window. There's a bit of additional complexity given the X days varies by campaign; so it's not the most recent call you care about so much as the most likely to fall within that window. To get around that, I sort each users' calls by those which are in the window first followed by those which aren't; then sort by most recent first within those 2 groups. This gives me the field r.
By filtering on r = 1 for each user, we only get the most recent call (adjusted for campaign windows). By filtering on LastCalledInWindow = 0 we exclude those who have been called within the campaign's window.
NB: I've used an inner query (aliased c) to hold the campaign ids and their corresponding windows. In reality you'd probably want a campaigns table holding that same information instead of coding inside the query itself.
Hopefully everything else is self-explanatory; but give me a nudge in the comments if you need any further information.
UPDATE
Just realised you'd also said "no calls is fine"... Here's a tweaked version to allow for scenarios where the person has not been called.
SQL Fiddle Example.
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
where
(
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
)
or mrc.r is null --no calls at all
Update: Including a default campaign offset
To include a default, you could do something like the code below (SQL Fiddle Example). Here, I've put each campaign's offset value in the Campaigns table, but created a default campaign with ID = -1 to handle anything for which there is no offset defined. I use a left join between the audit table and the campaigns table so that we get all records from the audit table, regardless of whether there's a campaign defined, then a cross join to get the default campaign. Finally, I use a coalesce to say "if the campaign isn't defined, use the default campaign".
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
left outer join Campaigns c
on tca.CampaignId = c.CampaignId
cross join Campaigns dflt
where dflt.CampaignId = -1
) mrc
on mrc.UserId = u.UserId
where
(
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
)
or mrc.r is null --no calls at all
That said, I'd recommend not using a default, but rather ensuring that every campaign has an offset defined. i.e. Presumably you already have a campaigns table; and since this offset value is defined per campaign, you can include a field in that table for holding this offset. Rather than leaving this as null for some records, you could set it to your default value; thus simplifying the logic / avoiding potential issues elsewhere where that value may subsequently be used.
You'd also asked about the order by clause. There is no order by 1/0; so I assume that's a typo. Rather the full statement is row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r.
The purpose of this piece is to find the "most important" call for each user. By "most important" I basically mean the most recent, since that's generally what we're after; though there's one caveat. If a user is part of 2 campaigns, one with an offset of 30 days and one with an offset of 60 days, they may have had 2 calls, one 32 days ago and one 38 days ago. Though the call from 32 days ago is more recent, if that's on the campaign with the 30 day offset it's outside the window, whilst the older call from 38 days ago may be on the campaign with an offset of 60 days, meaning that it's within the window, so is more of interest (i.e. this user has been called within a campaign window).
Given the above requirement, here's how this code meets it:
row_number() produces a number from 1, counting up, for each row in the (sub)query's results. The counter is reset to 1 for each partition
partition by tca.UserId says that we're partitioning by the user id; so for each user there will be 1 row for which row_number() returns 1, then for each additional row for that user there will be a consecutive number returned.
The order by part of this statement defines which of each users' rows gets #1, then how the numbers progress thereafter; i.e. the first row according to the order by gets number 1, the next number 2, etc.
case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end returns 1 for calls within their campaign's window, and 0 for those outside of the window. Since we're ordering by this result in ascending order, that says that any records within their campaign's window should be returned before any outside of their campaign's window.
we then order by tca.[Date] desc; i.e. the more recent calls are returned before the later calls.
finally, we name the output of this row number as r and in the outer query filter on r = 1; meaning that for each user we only take one row, and that's the first row according to the order criteria above; i.e. if there's a row in its campaign's window we take that, after which it's whichever call was most recent (within those in the window if there were any; then outside that window if there weren't).
Take a look at the output of the subquery to get a better idea of exactly how this works: SQL Fiddle
I hope that explanation makes some sense / helps you to understand the code? Sadly I can't find a way to explain it more concisely than the code itself does; so if it doesn't make sense try playing with the code and seeing how that affects the output to see if that helps your understanding.

Oracle SQL Statement - Identify & Count Unique Callers

I'm looking to make some improvements to our telephony call data - and have a requirement to identify if a CALLER is unique - if they call more than once on a given date (CALL_DATE) - it flags as a 1 value, if only once a 0 value.
Any ideas how I can modify this existing statement to reflect this?
SELECT /*+ PARALLEL (4) */
A.CALL_ID,
A.CALL_DATE,
O.OT_OUTLET_CODE,
A.CALL_TIME,
TO_CHAR(TO_DATE(A.CALL_TIME, 'HH24:MI:SS')+A.TALK_TIME/(24*60*60),'HH24:MI:SS') "CALL_END_TIME",
A.TALK_TIME,
A.RING_TIME,
A.OUTCOME,
CASE WHEN A.TRANSFER_TO = '10000' THEN 1 ELSE 0 END AS "VOICEMAIL"
FROM
OWBI.ODS_FACT_TIGER_TELEPHONY A,
OWBI.WHS_DIM_CAL_DATE C,
OWBI.WHS_DIM_OUTLET O
WHERE
A.CALL_DATE = C.CD_DAY_DATE
AND A.WHS_DIM_OUTLET = O.DIMENSION_KEY
AND C.EY_YEAR_CODE IN ('2019')
AND C.EW_WEEK_IN_YEAR IN ('1') -- **FILTER ON PREVIOUS BUSINESS WEEK NUMBER**
ORDER BY A.CALL_DATE DESC;
What you are describing sounds like a job for the analytic count(*) function.
Add this to the SELECT clause and don't change anything else:
case when count(*) over (partition by a.call_id, a.call_date) = 1 then 0
else 1 end as unique_flag

Return overlapping date records in SQL

I used the following query to fetch the overlapping records in SQL:
SELECT QUOTE_ID,FUNCTION_ID,FUNCTION_DT,FUNC_SPACE_ID,FN_START_TIME,FN_END_TIME,DATE_AUTH_LEVEL
FROM R_13_ALL_RESERVED A
WHERE
A.FUNC_SPACE_ID = '401-ZFU-52'
AND A.FUNCTION_DT = TO_DATE('09/03/2015','MM/DD/YYYY')
AND EXISTS ( SELECT 'X'
FROM R_13_ALL_RESERVED B
WHERE A.PROPERTY = B.PROPERTY
AND A.FUNCTION_DT = B.FUNCTION_DT
AND A.FUNCTION_ID <> B.FUNCTION_ID
AND ( ( A.FN_START_TIME > B.FN_START_TIME
AND A.FN_START_TIME < B.FN_END_TIME)
OR ( B.FN_START_TIME > A.FN_START_TIME
AND B.FN_START_TIME < A.FN_END_TIME)
OR ( A.FN_START_TIME = B.FN_START_TIME
AND A.FN_END_TIME = B.FN_END_TIME)
)
)
But eventhough the dates are not overlapping it still returns the records as overlapping.
I am missing some thing here?
Also if the date records overlap, I need to compare the count of function_id records with DATE_AUTH_LEVEL, if 2 function_id records overlap and the count of function_id would be 2 and DATE_AUTH_LEVEL is 1, such record should in the result set.
Please find the data set in SQLFiddle
http://sqlfiddle.com/#!9/95874/1
Desired Output : The SQL should return overlapping FN_START_TIME and FN_END_TIME for a function_space_id and it's function_dt
In the provided example, row 5 and 6 overlap for the function space id '401-ZFU-12' and function_dt 'August, 15 2015' and all others are not overlapping
The simplest predicate (where clause condition) for detecting the overlap of two ranges is to compare the start of the first range with the end of the 2nd range, and the start of the 2nd range with the end of the first range:
WHERE R1.Start_Date <= R2.End_Date
AND R2.Start_Date <= R1.End_Date
As you can see each of the two inequalities looks at a start and end value from separate records (R1 and R2 and then R2 and R1 respectively) all that remains is to add the conditions that will correlate the records, and also ensure that you aren't comparing a row to itself So if you want to find all Common_IDs that have Distinct_IDs with over lapping date ranges:
select *
from Your_Table R1
where exists (select 1 from Your_Table R2
where R1.Common_ID = R2.Common_ID
and R1.Distinct_ID <> R2.Distinct_ID
and R1.Start_Date <= R2.End_Date
and R2.Start_Date <= R1.End_Date)
If there is no Distinct_ID to use, you can use R1.rowid <> R2.rowid in place of R1.Distinct_ID <> R2.Distinct_ID
Here is an approach to troubleshooting the issue on your end.
My first suspicion is that the results of your exists clause are too broad and thus returning rows for every record matching in the outer clause unexpectedly. Likely there are rows that do not fall on the desired date or spaceid that share one component of their interval with your inner criteria.
Inspect the results of the inner select statement (the one within the exists clause) for an example row, exchanging all the 'A' aliased values with actual values from one of the rows returned you did not expect to receive.
Additionally, you can inspect what I think would be a semi join in the execution profile to see what the join criteria are. If you expect it to be filtered by a constant for 'FUNC_SPACE_ID' of '401-ZFU-52', you will discover that it is not.

SQL query determine stopped time in a range

I have to determine stopped time of an vehicle that sends back to server its status data every 30 second and this data is stored in a table of a database.
The fields of a status record consist of (vehicleID, ReceiveDate, ReceiveTime, Speed, Location).
Now what I want to do is, determine each suspension time at the point that vehicle speed came to zero to the status the vehicle move again and so on for next suspension time.
For example on a given day, a given vehicle may have 10 stopped status and I must determine duration of each by a query.
The result can be like this:
id Recvdate Rtime Duration
1 2010-05-01 8:30 45min
1 2110-05-01 12:21 3hour
This is an application of windows functions (called analytic functions in Oracle).
Your goal is to assign a "block number" to each sequence of stops. That is, all stops in a sequence (for a vehicle) will have the same block number, and this will be different from all other sequences of stops.
Here is a way to assign the block number:
Create a speed flag that says 1 when speed > 0 and 0 when speed = 0.
Enumerate all the records where the speed flag = 1. These are "blocks".
Do a self join to put each flag = 0 in a block (this requires grouping and taking the max blocknum).
Summarize by duration or however you want.
The following code is a sketch of what I mean. It won't solve your problem, because you are not clear about how to handle day breaks, what information you want to summarize, and it has an off-by-1 error (in each sequence of stops it includes the previous non-stop, if any).
with vd as
(
select vd.*,
(case when SpeedFlag = 1
then ROW_NUMBER() over (partition by id, SpeedFlag) end) as blocknum
from
(
select vd.*, (case when speed = 0 then 0 else 1 end) as SpeedFlag
from vehicaldata vd
) vd
)
select id, blocknum, COUNT(*) as numrecs, SUM(duration) as duration
from
(
select vd.id, vd.rtime, vd.duration, MAX(vdprev.blocknum) as blocknum
from vd
left outer join vd vdprev
on vd.id = vdprev.id
and vd.rtime > vdprev.rtime
group by vd.id, vd.rtime, vd.duration
) vd
group by id, blocknum