How to aggregate event for denormalization? - hive

A user clickstream is represented by events with type and event_timestamp properties. For example:
userid type event_timestamp (yyyy-MM-ddThh:mm:ss.SSS)
01 install 2018-01-01T00:00:00.000
01 level_up 2018-01-15T00:00:00.000
01 new_item 2018-02-03T00:00:00.000
All input data are stored in partition of event_timestamp field, e.g. into 2018-01-01, 2018-01-02... its folders.
For do denormalization there has been a hackle (hive) like this (just an idea, syntax is not checked):
select userid,
MIN(install_date),
MIN(level_up_date),
MIN(new_item_date)
from (
select
userid,
CASE when type = 'install' then event_timestamp else null as install_date,
CASE when type = 'level_up' then event_timestamp else null as level_up_date,
CASE when type = 'new_item' then event_timestamp else null as new_item_date
from event_table
) group by userid;
When this performed onto all data everything works. But what about partitioning?
When the input data is split by event_timestamp and the processing is performed onto new arrived data only (e.g. input partitions are processed separately) - instead of 1 row I got 3 (in different partitions of course):
userid install_date level_up_date new_item_date
01 2018-01-01 null null
01 null 2018-01-15
01 null null 2018-02-03
Instead of:
userid install_date level_up_date new_item_date
01 2018-01-01 2018-01-15 2018-02-03
Note that time gap between the dates is unlimited - a user sends install even this year and level_up next year.
Is there any common way to solve this? Theoretically, I can stay with storing different events into different partitions and perform select userid, MIN (install_date), MIN (level_up_date), MIN (new_item_date) from processed_data` on entire processed data set.
But this is full data set scan.

This is called conditional aggregation. The following would work.
select userid,
MIN(CASE when type = 'install' then event_timestamp END) as install_date,
MIN(CASE when type = 'level_up' then event_timestamp END) as level_up_date,
MIN(CASE when type = 'new_item' then event_timestamp END) as new_item_date
from event_table
group by userid

Related

MSSQL - Delete duplicate rows using common column values

I haven't used SQL in quite a while, so I'm a bit lost here. I wanted to check for rows with duplicate values in the "Duration" and "date" columns to remove them from the query results. I would need to keep the rows where column = "Transfer" since these hold more information about the call and how it was routed through our system.
I want to use this for a dashboard, which would include counting the total number of calls from that query, which is why I cannot have both.
Here's the (Simplified) code used:
SELECT status, user, duration, phonenumber, date
FROM (SELECT * FROM view_InboundPhoneCalls) as Phonecalls
WHERE date>=DATEADD(dd, -15, getdate())
--GROUP BY duration
Which gives something of the sort:
Status
User
Duration
phonenumber 
date
Received
Receptionnist
00:34:03
 from: +1234567890 
2021-09-30 16:01:57 
Received
Receptionnist
00:03:12
 from: +9876543210 
2021-09-30 16:02:40 
Transfer
User1
00:05:12
 +14161654965;Receptionnist;User1 
2021-09-30 16:01:57 
Received
Receptionnist
00:05:12
 from: +14161654965 
2021-09-30 16:01:57 
The end result would be something like this:
Status
User
Duration
phonenumber 
date
Received
Receptionnist
00:34:03
 from: +1234567890 
2021-09-30 16:01:57 
Received
Receptionnist
00:03:12
 from: +9876543210 
2021-09-30 16:02:40 
Transfer
Receptionnist
00:05:12
 +14161654965;Receptionnist;User1 
2021-09-30 16:01:57 
The normal "trick" is to detect duplicates first. One of the easier ways is a CTE (Common Table Expression) along with the ROW_NUMBER() function.
Part One - Mark the duplicates
WITH
cte_Sorted_List
(
status, usertype, duration, phonenumber, dated, duplicate_check
)
AS
( -- only use required fields to speed up
SELECT status, user, duration, phonenumber, date,
-- marks depend on correct columns!
Row_Number() OVER
( -- sort over relevant columns to show
PARTITION BY user, phonenumber, date, duration
-- with correct sort order
-- bit of hack: As T comes after R
-- logic: mark records to show as row number 1 in duplicate list
ORDER BY status DESC
) AS duplicate_check
FROM view_InboundPhoneCalls
-- and lose all unnecessary data
WHERE date>=DATEADD(dd, -15, getdate())
)
Part two - show relevant rows
SELECT
status, usertype, duration, phonenumber, dated
FROM
cte_Sorted_List
WHERE
Duplicate_Check = 1
;
First CTE extracts required fields in single pass, then that data only is used for output.
You could go for a blacklist, say with a CTE, then filter out the undesired rows.
Something like:
WITH Blacklist ([date], [duration]) AS (
SELECT [date], [duration] FROM view_InboundPhoneCalls
GROUP BY [date], [duration]
Having count(*) > 1
)
SELECT status, user, duration, phonenumber, date
FROM
(SELECT * FROM view_InboundPhoneCalls) as Phonecalls
LEFT JOIN
Blacklist
ON Phonecalls.[date] = Blacklist.[date]
AND Phonecalls.[duration] = Blacklist.[duration]
Where
Blacklist.[date] is null
Or
(Blacklist.[date] is not null AND Phonecalls.[Status] == 'Transfer')
You can use row-numbering for this, along with a custom ordering. There is no need for any joins.
SELECT status, [user], duration, phonenumber, date
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY duration, date
ORDER BY CASE WHEN Status = 'Transfer' THEN 1 ELSE 2 END)
FROM view_InboundPhoneCalls
WHERE date >= DATEADD(day, -15, getdate())
) as Phonecalls
WHERE rn = 1

Find minimum overlap of Each Status

I need to find date ranges where status is Missing/Not Ready in all the groups ( Only the overlapping date ranges where all the groups have status of missing/notready)
'''
ID. Group. Eff_Date. Exp_Date Status
1. 1 1/1/18 10:00 3/4/18 15:23 Ready
1 1 3/4/18 15:24. 7/12/18 13:54. Not Ready
1. 1 7/12/18 13:55. 11/22/19 11:20 Missing
1. 1. 11/22/19 11:21. 9/25/20 1:12. Ready
1. 1. 9/25/20 1:13 12/31/99. Missing
1. 2 1/1/16 10:00 2/2/17 17:20 Ready
1 2 2/2/17 17:21. 5/25/18 1:23. Missing
1. 2 5/25/18 1:24 9/2/18 4:15 Not Ready
1. 2 9/2/18 4:16. 6/3/21 7:04. Missing
1. 2 6/3/21 7:04. 12/31/99. Ready
Output for not ready: ( below are the dates where each group has not ready status)
5/25/18 1:24. 7/12/18 13:54 Not Ready
Missing: ( Below are the date where each group has Missing status)
9/25/20 1:13 6/3/21 7:04 Missing
'''
Note-> Each ID can have any number of groups. Database is Snowflake
You can do this by unpivoting and counting. Assuming that the periods do not overlap for a given id:
with x as (
select eff_date as date, 1 as inc
from t
where status = 'Missing'
union all
select end_date, -1 as inc
from t
where status = 'Missing'
)
select date, next_date, active_on_date
from (select date,
sum(sum(inc)) over (order by date) as active_on_date,
lead(date) over (order by date) as next_date
from x
group by date
) x
where active_on_date = (select count(distinct id) from t);
Note: This handles one status at a time, which is what this question is asking. If you want to handle all event types, then ask a new question with appropriate sample data, desired results, and explanation.

Improving the performance of a query

My background is Oracle but we've moved to Hadoop on AWS and I'm accessing our logs using Hive SQL. I've been asked to return a report where the number of high severity errors on the system of any given type exceeds 9 in any rolling period of 30 days (9 but I use 2 in the example to keep the example data volumes down) by uptime. I've written code to do this but I don't really understand performance tuning in Hive. A lot of the stuff I learned in Oracle doesn't seem applicable.
Can this be improved?
Data is roughly
CREATE TABLE LOG_TABLE
(SYSTEM_ID VARCHAR(1),
EVENT_TYPE VARCHAR(2),
EVENT_ID VARCHAR(3),
EVENT_DATE DATE,
UPTIME INT);
INSERT INOT LOG_TABLE
VALUES
('1','A1','138','2018-10-29',34),
('1','A2','146','2018-11-13',49),
('1','A3','140','2018-11-02',38),
('1','B1','130','2018-10-13',18),
('1','B1','150','2018-11-19',55),
('1','B2','137','2018-10-27',32),
('2','A1','128','2018-10-11',59),
('2','A1','131','2018-10-16',64),
('2','A1','136','2018-10-25',73),
('2','A2','139','2018-10-31',79),
('2','A2','145','2018-11-11',90),
('2','A2','147','2018-11-14',93),
('2','A3','135','2018-10-24',72),
('2','B1','124','2018-10-03',51),
('2','B1','133','2018-10-19',67),
('2','B2','134','2018-10-22',70),
('2','B2','142','2018-11-06',85),
('2','B2','148','2018-11-15',94),
('2','B2','149','2018-11-17',96),
('3','A2','127','2018-10-10',122),
('3','A3','123','2018-10-01',113),
('3','A3','125','2018-10-06',118),
('3','A3','126','2018-10-07',119),
('3','A3','141','2018-11-05',148),
('3','A3','144','2018-11-10',153),
('3','B1','132','2018-10-18',130),
('3','B1','143','2018-11-08',151),
('3','B2','129','2018-10-12',124);
and code that works is as follows. I do a self join on the log table to return all the records with the gap between them and include those with a gap of 30 days or less. I then select those where there are more than 2 events into a second cte and from these I count distinct event types and event ids by system and uptime range
WITH EVENTGAP AS
(SELECT T1.EVENT_TYPE,
T1.SYSTEM_ID,
T1.EVENT_ID,
T2.EVENT_ID AS EVENT_ID2,
T1.EVENT_DATE,
T2.EVENT_DATE AS EVENT_DATE2,
T1.UPTIME,
DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) AS EVENT_GAP
FROM LOG_TABLE T1
INNER JOIN LOG_TABLE T2
ON (T1.EVENT_TYPE=T2.EVENT_TYPE
AND T1.SYSTEM_ID=T2.SYSTEM_ID)
WHERE DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) BETWEEN 0 AND 30
AND T1.UPTIME BETWEEN 0 AND 299
AND T2.UPTIME BETWEEN 0 AND 330),
EVENTCOUNT
AS (SELECT EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE,
COUNT(1)
FROM EVENTGAP
GROUP BY EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE
HAVING COUNT(1)>2)
SELECT EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END AS UPTIME_BAND,
COUNT(DISTINCT EVENTGAP.EVENT_ID2) AS EVENT_COUNT,
COUNT(DISTINCT EVENTGAP.EVENT_TYPE) AS TYPE_COUNT
FROM EVENTGAP
WHERE EVENTGAP.EVENT_ID IN (SELECT DISTINCT EVENTCOUNT.EVENT_ID FROM EVENTCOUNT)
GROUP BY EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END
This gives the following result, which should be unique counts of event ids and event types that have 3 or more events falling in any rolling 30 day period. Some events may be in more than one period but will only be counted once.
EVENTGAP.SYSTEM_ID UPTIME_BAND EVENT_COUNT TYPE_COUNT
2 50-99 10 3
3 100-149 4 1
In both Hive and Oracle, you would want to do this using window functions, using a window frame clause. The exact logic is different in the two databases.
In Hive you can use range between if you convert event_date to a number. A typical method is to subtract a fixed value from it. Another method is to use unix timestamps:
select lt.*
from (select lt.*,
count(*) over (partition by event_type
order by unix_timestamp(event_date)
range between 60*24*24*30 preceding and current row
) as rolling_count
from log_table lt
) lt
where rolling_count >= 2 -- or 9

SQL - select processes that were cancelled with a date

i have a table, which showing statuses of processes ( especially i searching canceled processes), there is no sorting out there. I want to select all of them that they were resume again. I want to do this "sticking a specific date to canceled process and check if there are still other statuses after the cancellation status.
Example:
[id] [moddate] [status]
1 01/01/17 started
1 02/01/17 waiting for signature
1 04/01/17 canceled
1 09/01/17 delivery documents
1 11/01/17 complited <-- I want to select these statuses, (Canceled and then somehow resumed)
I got something like this on start:
SELECT * FROM DATABASE
WHERE APPLICATIONSTATUSSYMBOL LIKE 'CANCELED%'
AND APPLICATIONDATE BETWEEN '17/01/01' AND '17/07/24';
One method for doing this uses window functions:
select d.*
from (select d.*,
max(case when status = 'canceled' then applicationdate end) over (partition by id) as canceldate
from database
where applicationdate between date '2017-01-01' and date '2017-07-24'
) d
where applicationdate > canceldate;

Group report together, possibly with SQL?

I have a table called Register which contains the following fields:
Date, AMPM, Mark.
A day can have two records for a day. Its fairly easy to select and display all the records in a list ordered by date ascending.
What I would like to do is display the data as a grid. Something along the lines of.
| Mon | Tues| Wed| Thurs| Fri | Sat
9/8/2014 | /\ | /P | /\ | L | /\ | /
Have a week beginning and then group the 5 together. I'm not even sure sql is the best option for this, but the groupby commands seem to suggest it may be able to do this.
The Data structure is as follows.
Date, AMPM, Mark
9/8/2014, AM, /
9/8/2014, PM, \
9/9/2014, AM, /
9/9/2014, PM, P
9,10,2014, AM, /
9,10,2014, PM, \
9,11,2014, PM, L
....
The mark field can contain a number of letters. P for instance means they are participating in a sporting activity. L means they were late.
Does anyone have any resources they can point me towards the right direction that would be helpful. I'm not even sure what this type of report is called and whether I should be using SQL or javascript to group this data in a presentable format. The / \ represents AM and the a PM.
The following query would get you the desired result. If you need Sunday also, you'll have to add a small condition to test for when days_after_last_Monday = 6 in the CASE statement.
select
last_Monday Week_Starting,
max(
case
when days_after_last_Monday = 0 then mark
else null
end) Mon, --if the # of days between previous Monday and reg_date is zero, then get the according mark
max(
case
when days_after_last_Monday = 1 then mark
else null
end) Tues,
max(
case
when days_after_last_Monday = 2 then mark
else null
end) Wed,
max(
case
when days_after_last_Monday = 3 then mark
else null
end) Thurs,
max(
case
when days_after_last_Monday = 4 then mark
else null
end) Fri,
max(
case
when days_after_last_Monday = 5 then mark
else null
end) Sat
from
(
select
reg_date,
last_Monday,
julianday(reg_date) - julianday(last_Monday) as days_after_last_monday, --determine the number of days between previous Monday and reg_date
mark
from
(
select
reg_date,
case
when cast (strftime('%w', reg_date) as integer) = 1 then date(reg_date, 'weekday 1')
else date(reg_date, 'weekday 1', '-7 days')
end last_monday, --determine the date of previous Monday
mark
from
(
select
reg_date,
group_concat(mark, '') mark --concatenate am and pm marks for each reg_date
from
(
SELECT
reg_date,
ampm,
mark
FROM register
order by reg_date, ampm --order by ampm so that am rows are selected before pm
)
group by reg_date
)
)
)
group by last_Monday
order by last_Monday;
SQL Fiddle demo