Oracle SQL Statement - Identify & Count Unique Callers - sql

I'm looking to make some improvements to our telephony call data - and have a requirement to identify if a CALLER is unique - if they call more than once on a given date (CALL_DATE) - it flags as a 1 value, if only once a 0 value.
Any ideas how I can modify this existing statement to reflect this?
SELECT /*+ PARALLEL (4) */
A.CALL_ID,
A.CALL_DATE,
O.OT_OUTLET_CODE,
A.CALL_TIME,
TO_CHAR(TO_DATE(A.CALL_TIME, 'HH24:MI:SS')+A.TALK_TIME/(24*60*60),'HH24:MI:SS') "CALL_END_TIME",
A.TALK_TIME,
A.RING_TIME,
A.OUTCOME,
CASE WHEN A.TRANSFER_TO = '10000' THEN 1 ELSE 0 END AS "VOICEMAIL"
FROM
OWBI.ODS_FACT_TIGER_TELEPHONY A,
OWBI.WHS_DIM_CAL_DATE C,
OWBI.WHS_DIM_OUTLET O
WHERE
A.CALL_DATE = C.CD_DAY_DATE
AND A.WHS_DIM_OUTLET = O.DIMENSION_KEY
AND C.EY_YEAR_CODE IN ('2019')
AND C.EW_WEEK_IN_YEAR IN ('1') -- **FILTER ON PREVIOUS BUSINESS WEEK NUMBER**
ORDER BY A.CALL_DATE DESC;

What you are describing sounds like a job for the analytic count(*) function.
Add this to the SELECT clause and don't change anything else:
case when count(*) over (partition by a.call_id, a.call_date) = 1 then 0
else 1 end as unique_flag

Related

Date-wise and bucket capacity-wise serial number

I couldn't give an appropriate title to my problem. Let me explain it through example.
Suppose I have the following table
INPUT
What I want
First, I want to group the transactions by date (dd-MM-yyyy)
Then I want to create a chunk/bucket of at most 2 items. Thus I want to assign a sub_batch_ref_id to each chunk/bucket of 2 items. In each chunk/bucket transactions must belong to exactly one date.
Convention of SUB_BATCH_REF_ID is BATCH_REF{global serial number of chunk/bucket}
A chunk or bucket can contain at most 2 items of the same date
I understand that this can be achieved through any high-level programming language (except data-oriented language like SQL) quite easily but I don't have such provision. The solution can be shaped into the following pseudocode for better understanding:
Pseudocode
//I have the following map (assumed)
Map<Date, List<Transaction>> dateWiseTransactions;
BUCKET_CAPACITY = 2
GLOBAL_SERIAL = 0
for each entry in dateWiseTransactions
LOOP
GLOBAL_SERIAL = GLOBAL_SERIAL + 1;
for each transaction in entry.value i.e. List<Transaction>
LOOP
if loopIndex > BUCKET_CAPACITY //loopIndex starts from 1
GLOBAL_SERIAL = GLOBAL_SERIAL + 1
end if;
transaction.SUB_BATCH_REF_ID = CONCAT(transaction.BATCH_REF, GLOBAL_SERIAL)
END LOOP;
END LOOP;
EXPECTED OUTPUT
What I tried
I tried to partition the transaction data by date first then assigned a row number but I couldn't come to a solution.
SELECT
T.*,
ROW_NUMBER() over (PARTITION BY TRUNC(INSERT_DATE) ORDER BY TRANSACTION_ID) rn
FROM TRANSACTION T
WHERE BATCH_REF='XYZ'
Any help is much appreciated.
SQL FIDDLE
This is not an optimal solution. Your actual problem is a rather complex graph problem -- because you want transactions only to be used once over all the dates.
One solution is to assign a "working" date to the transaction. The following does this randomly:
SELECT T.*
FROM (SELECT T.*,
ROW_NUMBER() over (PARTITION BY TRUNC(INSERT_DATE) ORDER BY TRANSACTION_ID) AS rn
FROM (SELECT T.*,
ROW_NUMBER() OVER (PARTITION BY BATCH_REF, TRANSACTION_ID ORDER BY DBMS_RANDOM.RANDOM) as seqnum
FROM TRANSACTION T
) T
WHERE BATCH_REF = 'XYZ' AND
SEQNUM = 1
) T
WHERE rn <= 2;

SQL - How to make exclude condition

I have the following table. I'm trying to get the rows that met my specific condition.
Table look as follows:
account|transactiontypecode|
-------|-------------------|
1000058| 8|
1000067| 2|
1000067| 8|
The query output would retrieve only the account 1000058, as it applies to the transactiontypecode 8. The other account applies too, but it also has another transactiontypecode that does not applies.
So requirement would be to get the accounts that meet specifics transaction codes, and excludes accounts even though it can also have the code required but has codes unwanted too.
This was my guess over above issue, among others, but I think that other eyes may guide me on a better direction.
with cte1 as (
select
gp.account,
case
when gp.transactiontypecode in (2,8,17) then TRUE
else false
end as txcheck
from
gp.t2001 gp
group by
1, 2)
select
account,
txcheck
from
cte1
where
txcheck is true and txcheck is not false;
If anyone can help me achieve above requirement, would be great!
Just use not exists if you want the entire rows:
select t.*
from gp.t2001 t
where t.transactiontypecode = 8 and
not exists (select 1
from gp.t2001 t2
where t2.account = t.account and
t2.transactiontypecode <> 8
);
Or aggregation if you just want the account:
select t.account
from gp.t2001 t
group by t.account
having min(transactiontypecode) = max(transactiontypecode) and
min(transactiontypecode) = 8;
You can use aggregation in a HAVING clause checking the count of codes to be exactly one and the code is 8 -- wrap it in e.g. max(), if there's only one value the maximum is that one value:
SELECT gp.account
FROM gp.t2001 gp
GROUP BY gp.account
HAVING count(gp.transactiontypecode) = 1
AND max(gp.transactiontypecode) = 8;
Or, if it is allowed that the code of 8 can occur multiple times for an account and you want all of them not having any other code, change it using conditional aggregation to count the codes of 8 and compare it to the overall count of codes. If they match they're all 8:
...
HAVING count(CASE
WHEN gp.transactiontypecode = 8 THEN
1
END) = count(*);
Another option, if the code may occur more than once is to use NOT EXISTS to check for other rows with another code:
SELECT DISTINCT
gp1.account
FROM gp.t2001 gp1
WHERE gp1.transactiontypecode = 8
AND NOT EXISTS (SELECT *
FROM gp.t2001 gp2
WHERE gp2.account = gp1.account
AND gp2.transactiontypecode <> 8);
Try something like this:
SELECT account
FROM [Table]
GROUP BY account
HAVING
COUNT(transactiontypecode) = 1 AND
transactiontypecode = 8
The COUNT inside having clause should give you the accounts with only 1 transaction type code. Then, you can apply any other condition.

JOIN other table only if condition is true for ALL joined rows

I have two tables I'm trying to conditionally JOIN.
dbo.Users looks like this:
UserID
------
24525
5425
7676
dbo.TelemarketingCallAudits looks like this (date format dd/mm/yyyy):
UserID Date CampaignID
------ ---------- ----------
24525 21/01/2018 1
24525 26/08/2018 1
24525 17/02/2018 1
24525 12/01/2017 2
5425 22/01/2018 1
7676 16/11/2017 2
I'd like to return a table that contains ONLY users that I called at least 30 days ago (if CampaignID=1) and at least 70 days ago (if CampaignID=2).
The end result should look like this (today is 02/09/18):
UserID Date CampaignID
------ ---------- ----------
5425 22/01/2018 1
7676 16/11/2017 2
Note that because I called user 24524 with Campaign 1 only 7 days ago, I shall not see the user at all.
I tried this simple AND/OR condition and then I found out it will still return the users I shouldn't see because they do have rows indicating other calls and it simply ignoring the conditioned calls... which misses the goal obviously.
I have no idea on how to condition the overall appearance of the user if ANY of his associated rows in the second table did not meet the condition.
AND
(
internal_TelemarketingCallAudits.CallAuditID IS NULL --No telemarketing calls is fine
OR
(
internal_TelemarketingCallAudits.CampaignID = 1 --Campaign 1
AND
DATEADD(dd, 75, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
)
OR
(
internal_TelemarketingCallAudits.CampaignID != 1 --Other campaigns
AND
DATEADD(dd, 10, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
)
)
I really appreciate your help.
Try this: SQL Fiddle
select *
from dbo.Users u
inner join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
and mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
I'm not comparing all rows here; but rather saw that you're interested in when the most recent call is; then you only care if that's in the X day window. There's a bit of additional complexity given the X days varies by campaign; so it's not the most recent call you care about so much as the most likely to fall within that window. To get around that, I sort each users' calls by those which are in the window first followed by those which aren't; then sort by most recent first within those 2 groups. This gives me the field r.
By filtering on r = 1 for each user, we only get the most recent call (adjusted for campaign windows). By filtering on LastCalledInWindow = 0 we exclude those who have been called within the campaign's window.
NB: I've used an inner query (aliased c) to hold the campaign ids and their corresponding windows. In reality you'd probably want a campaigns table holding that same information instead of coding inside the query itself.
Hopefully everything else is self-explanatory; but give me a nudge in the comments if you need any further information.
UPDATE
Just realised you'd also said "no calls is fine"... Here's a tweaked version to allow for scenarios where the person has not been called.
SQL Fiddle Example.
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
where
(
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
)
or mrc.r is null --no calls at all
Update: Including a default campaign offset
To include a default, you could do something like the code below (SQL Fiddle Example). Here, I've put each campaign's offset value in the Campaigns table, but created a default campaign with ID = -1 to handle anything for which there is no offset defined. I use a left join between the audit table and the campaigns table so that we get all records from the audit table, regardless of whether there's a campaign defined, then a cross join to get the default campaign. Finally, I use a coalesce to say "if the campaign isn't defined, use the default campaign".
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
left outer join Campaigns c
on tca.CampaignId = c.CampaignId
cross join Campaigns dflt
where dflt.CampaignId = -1
) mrc
on mrc.UserId = u.UserId
where
(
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
)
or mrc.r is null --no calls at all
That said, I'd recommend not using a default, but rather ensuring that every campaign has an offset defined. i.e. Presumably you already have a campaigns table; and since this offset value is defined per campaign, you can include a field in that table for holding this offset. Rather than leaving this as null for some records, you could set it to your default value; thus simplifying the logic / avoiding potential issues elsewhere where that value may subsequently be used.
You'd also asked about the order by clause. There is no order by 1/0; so I assume that's a typo. Rather the full statement is row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r.
The purpose of this piece is to find the "most important" call for each user. By "most important" I basically mean the most recent, since that's generally what we're after; though there's one caveat. If a user is part of 2 campaigns, one with an offset of 30 days and one with an offset of 60 days, they may have had 2 calls, one 32 days ago and one 38 days ago. Though the call from 32 days ago is more recent, if that's on the campaign with the 30 day offset it's outside the window, whilst the older call from 38 days ago may be on the campaign with an offset of 60 days, meaning that it's within the window, so is more of interest (i.e. this user has been called within a campaign window).
Given the above requirement, here's how this code meets it:
row_number() produces a number from 1, counting up, for each row in the (sub)query's results. The counter is reset to 1 for each partition
partition by tca.UserId says that we're partitioning by the user id; so for each user there will be 1 row for which row_number() returns 1, then for each additional row for that user there will be a consecutive number returned.
The order by part of this statement defines which of each users' rows gets #1, then how the numbers progress thereafter; i.e. the first row according to the order by gets number 1, the next number 2, etc.
case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end returns 1 for calls within their campaign's window, and 0 for those outside of the window. Since we're ordering by this result in ascending order, that says that any records within their campaign's window should be returned before any outside of their campaign's window.
we then order by tca.[Date] desc; i.e. the more recent calls are returned before the later calls.
finally, we name the output of this row number as r and in the outer query filter on r = 1; meaning that for each user we only take one row, and that's the first row according to the order criteria above; i.e. if there's a row in its campaign's window we take that, after which it's whichever call was most recent (within those in the window if there were any; then outside that window if there weren't).
Take a look at the output of the subquery to get a better idea of exactly how this works: SQL Fiddle
I hope that explanation makes some sense / helps you to understand the code? Sadly I can't find a way to explain it more concisely than the code itself does; so if it doesn't make sense try playing with the code and seeing how that affects the output to see if that helps your understanding.

WHERE conditions being listed in a column if they are met

I have a file that i receive each morning which contains details of customers whos information doesnt meet certain criteria, i have built a script with many WHERE conditions that, if met, will show customers information and put them in a file but im having trouble finding out why they are wrong.
As i have many conditions in the where clause, is there a way to show which column has the incorrect information
For example i could have a table like this:
NAME|ADDRESS |PHONE|COUNTRY
John|123avenue |12345|UK
My conditions could be
SELECT * FROM CUSTOMERS
WHERE NAME LIKE 'J%'
AND LEFT(PHONE,1) = '1'
so it would show in the file as two conditions are met, but as i have over 80 rows and 40 conditions, its hard to look at each row and find out why its in their.
Is there a way i can add a column which will tell me which WHERE condition has been met?
As worded, no. You should reverse your logic. Add fields that show what's wrong, then use those fields in a WHERE clause.
SELECT
*,
CASE WHEN LEFT(phone, 1) = '1' THEN 1 ELSE 0 END AS phone_starts_with_1,
CASE WHEN LEFT(name, 1) = 'Z' THEN 1 ELSE 0 END AS name_starts_with_z
FROM
customers
WHERE
phone_starts_with_1 = 1
OR name_starts_with_z = 1
Depending on which dialect of SQL you use, you may need to nest this, such that the new fields are resolved before you can use them in the WHERE clause...
SELECT
*
FROM
(
SELECT
*,
CASE WHEN LEFT(phone, 1) = '1' THEN 1 ELSE 0 END AS phone_starts_with_1,
CASE WHEN LEFT(name, 1) = 'Z' THEN 1 ELSE 0 END AS name_starts_with_z
FROM
customers
)
checks
WHERE
phone_starts_with_1 = 1
OR name_starts_with_z = 1

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.