Improving the performance of a query - sql
My background is Oracle but we've moved to Hadoop on AWS and I'm accessing our logs using Hive SQL. I've been asked to return a report where the number of high severity errors on the system of any given type exceeds 9 in any rolling period of 30 days (9 but I use 2 in the example to keep the example data volumes down) by uptime. I've written code to do this but I don't really understand performance tuning in Hive. A lot of the stuff I learned in Oracle doesn't seem applicable.
Can this be improved?
Data is roughly
CREATE TABLE LOG_TABLE
(SYSTEM_ID VARCHAR(1),
EVENT_TYPE VARCHAR(2),
EVENT_ID VARCHAR(3),
EVENT_DATE DATE,
UPTIME INT);
INSERT INOT LOG_TABLE
VALUES
('1','A1','138','2018-10-29',34),
('1','A2','146','2018-11-13',49),
('1','A3','140','2018-11-02',38),
('1','B1','130','2018-10-13',18),
('1','B1','150','2018-11-19',55),
('1','B2','137','2018-10-27',32),
('2','A1','128','2018-10-11',59),
('2','A1','131','2018-10-16',64),
('2','A1','136','2018-10-25',73),
('2','A2','139','2018-10-31',79),
('2','A2','145','2018-11-11',90),
('2','A2','147','2018-11-14',93),
('2','A3','135','2018-10-24',72),
('2','B1','124','2018-10-03',51),
('2','B1','133','2018-10-19',67),
('2','B2','134','2018-10-22',70),
('2','B2','142','2018-11-06',85),
('2','B2','148','2018-11-15',94),
('2','B2','149','2018-11-17',96),
('3','A2','127','2018-10-10',122),
('3','A3','123','2018-10-01',113),
('3','A3','125','2018-10-06',118),
('3','A3','126','2018-10-07',119),
('3','A3','141','2018-11-05',148),
('3','A3','144','2018-11-10',153),
('3','B1','132','2018-10-18',130),
('3','B1','143','2018-11-08',151),
('3','B2','129','2018-10-12',124);
and code that works is as follows. I do a self join on the log table to return all the records with the gap between them and include those with a gap of 30 days or less. I then select those where there are more than 2 events into a second cte and from these I count distinct event types and event ids by system and uptime range
WITH EVENTGAP AS
(SELECT T1.EVENT_TYPE,
T1.SYSTEM_ID,
T1.EVENT_ID,
T2.EVENT_ID AS EVENT_ID2,
T1.EVENT_DATE,
T2.EVENT_DATE AS EVENT_DATE2,
T1.UPTIME,
DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) AS EVENT_GAP
FROM LOG_TABLE T1
INNER JOIN LOG_TABLE T2
ON (T1.EVENT_TYPE=T2.EVENT_TYPE
AND T1.SYSTEM_ID=T2.SYSTEM_ID)
WHERE DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) BETWEEN 0 AND 30
AND T1.UPTIME BETWEEN 0 AND 299
AND T2.UPTIME BETWEEN 0 AND 330),
EVENTCOUNT
AS (SELECT EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE,
COUNT(1)
FROM EVENTGAP
GROUP BY EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE
HAVING COUNT(1)>2)
SELECT EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END AS UPTIME_BAND,
COUNT(DISTINCT EVENTGAP.EVENT_ID2) AS EVENT_COUNT,
COUNT(DISTINCT EVENTGAP.EVENT_TYPE) AS TYPE_COUNT
FROM EVENTGAP
WHERE EVENTGAP.EVENT_ID IN (SELECT DISTINCT EVENTCOUNT.EVENT_ID FROM EVENTCOUNT)
GROUP BY EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END
This gives the following result, which should be unique counts of event ids and event types that have 3 or more events falling in any rolling 30 day period. Some events may be in more than one period but will only be counted once.
EVENTGAP.SYSTEM_ID UPTIME_BAND EVENT_COUNT TYPE_COUNT
2 50-99 10 3
3 100-149 4 1
In both Hive and Oracle, you would want to do this using window functions, using a window frame clause. The exact logic is different in the two databases.
In Hive you can use range between if you convert event_date to a number. A typical method is to subtract a fixed value from it. Another method is to use unix timestamps:
select lt.*
from (select lt.*,
count(*) over (partition by event_type
order by unix_timestamp(event_date)
range between 60*24*24*30 preceding and current row
) as rolling_count
from log_table lt
) lt
where rolling_count >= 2 -- or 9
Related
See the distribution of secondary requests grouped by time interval in sql
I have the following table: RequestId,Type, Date, ParentRequestId 1 1 2020-10-15 null 2 2 2020-10-19 1 3 1 2020-10-20 null 4 2 2020-11-15 3 For this example I am interested in the request type 1 and 2, to make the example simpler. My task is to query a big database and to see the distribution of the secondary transaction based on the difference of dates with the parent one. So the result would look like: Interval,Percentage 0-7 days,50 % 8-15 days,0 % 16-50 days, 50 % So for the first line from teh expected result we have the request with the id 2 and for the third line from the expected result we have the request with the id 4 because the date difference fits in this interval. How to achieve this? I'm using sql server 2014.
We like to see your attempts, but by the looks of it, it seems like you're going to need to treat this table as 2 tables and do a basic GROUP BY, but make it fancy by grouping on a CASE statement. WITH dateDiffs as ( /* perform our date calculations first, to get that out of the way */ SELECT DATEDIFF(Day, parent.[Date], child.[Date]) as daysDiff, 1 as rowsFound FROM (SELECT RequestID, [Date] FROM myTable WHERE Type = 1) parent INNER JOIN (SELECT ParentRequestID, [Date] FROM myTable WHERE Type = 2) child ON parent.requestID = child.parentRequestID ) /* Now group and aggregate and enjoy your maths! */ SELECT case when daysDiff between 0 and 7 then '0-7' when daysDiff between 8 and 15 then '8-15' when daysDiff between 16 and 50 THEN '16-50' else '50+' end as myInterval, sum(rowsFound) as totalFound, (select sum(rowsFound) from dateDiffs) as totalRows, 1.0 * sum(rowsFound) / (select sum(rowsFound) from dateDiffs) * 100.00 as percentFound FROM dateDiffs GROUP BY case when daysDiff between 0 and 7 then '0-7' when daysDiff between 8 and 15 then '8-15' when daysDiff between 16 and 50 THEN '16-50' else '50+' end;
This seems like basically a join and group by query: with dates as ( select 0 as lo, 7 as hi, '0-7 days' as grp union all select 8 as lo, 15 as hi, '8-15 days' union all select 16 as lo, 50 as hi, '16-50 days' ) select d.grp, count(*) as cnt, count(*) * 1.0 / sum(count(*)) over () as raio from dates left join (t join t tp on tp.RequestId = t. ParentRequestId ) on datediff(day, tp.date, t.date) between d.lo and d.hi group by d.grp order by d.lo; The only trick is generating all the date groups, so you have rows with zero values.
Determine cluster of access time within 10min intervals per user per day in SQL Server
How to query in SQL from the sample data, it will group or cluster the access_time per user per day within 10min intervals?
This is a complete guess, based on reading between the lines, and is untested due to a lack of consumable sample data. It, however, looks like you are after a triangular JOIN (these can perform poorly, especially as this won't be SARGable) and a DENSE_RANK: SELECT YT.[date], YT.User_ID, YT2.AccessTime, DENSE_RANK() OVER (PARTITION BY YT.[date], YT.User_ID ORDER BY YT1.AccessTime) AS Cluster FROM dbo.YourTable YT JOIN dbo.YourTable YT2 ON YT.[date] = YT2.[date] AND YT.User_ID = YT2.User_ID AND YT.AccessTime <= YT2.AccessTime --This will join the row to itself AND DATEADD(MINUTE,10,YT.AccessTime) >= YT2.AccessTime; --That is intentional
If I have understood your problem you want to group all accesses for a user in a day when all accesses of that group are in a time interval of 10 minutes. Not counting single accesses, so an access distant more than 10 minutes from every other is not counted as a cluster. You can identify the clusters joining the accesses table with itself to get all possible time intervals of 10 minutes and number them. Finally simply rejoin access table to get accesses for each cluster: ; with user_clusters as ( select a1.date, a1.user_id, a1.access_time cluster_start, a2.access_time cluster_end, ROW_NUMBER() over (partition by a1.date, a1.user_id order by a1.access_time) user_cluster_id from ACCESS_TIMES a1 join ACCESS_TIMES a2 on a1.date = a2.date and a1.user_id = a2.user_id and a1.access_time < a2.access_time and datediff(minute, a1.access_time, a2.access_time)<10 ) select * from user_clusters c join ACCESS_TIMES a on a.date = c.date and a.user_id = c.user_id and a.access_time between c.cluster_start and cluster_end order by a.date, a.user_id, c.user_cluster_id, a.access_time output: date user_id access_time user_cluster_id '2020-09-19', 'AA083P', '2020-09-19 18:15:00', 1 '2020-09-19', 'AA083P', '2020-09-19 18:22:00', 1 '2020-09-19', 'AA083P', '2020-09-19 18:22:00', 2 '2020-09-19', 'AA083P', '2020-09-19 18:28:00', 2 '2020-09-20', 'AB162Y', '2020-09-20 19:34:00', 1 '2020-09-20', 'AB162Y', '2020-09-20 19:37:00', 1
Creating average for specific timeframe
I'm setting up a time series with each row = 1 hr. The input data has sometimes multiple values per hour. This can vary. Right now the specific code looks like this: select patientunitstayid , generate_series(ceil(min(nursingchartoffset)/60.0), ceil(max(nursingchartoffset)/60.0)) as hr , avg(case when nibp_systolic >= 1 and nibp_systolic <= 250 then nibp_systolic else null end) as nibp_systolic_avg from nc group by patientunitstayid order by patientunitstayid asc; and generates this data: It takes the average of the entire time series for each patient instead of taking it for each hour. How can I fix this?
I'm expecting something like this: select nc.patientunitstayid, gs.hr, avg(case when nc.nibp_systolic >= 1 and nc.nibp_systolic <= 250 then nibp_systolic end) as nibp_systolic_avg from (select nc.*, min(nursingchartoffset) over (partition by patientunitstayid) as min_nursingchartoffset, max(nursingchartoffset) over (partition by patientunitstayid) as max_nursingchartoffset from nc ) nc cross join lateral generate_series(ceil(min_nursingchartoffset/60.0), ceil(max_nursingchartoffset/60.0) ) as gs(hr) group by nc.patientunitstayid, hr order by nc.patientunitstayid asc, hr asc; That is, you need to be aggregating by hr. I put this into the from clause, to highlight that this generates rows. If you are using an older version of Postgres, then you might not have lateral joins. If so, just use a subquery in the from clause. EDIT: You can also try: from (select nc.*, generate_series(ceil(min(nursingchartoffset) over (partition by patientunitstayid) / 60.0), ceil(max(nursingchartoffset) over (partition by patientunitstayid)/ 60.0) ) hr from nc ) nc And adjust the references to hr in the outer query.
Trying to calculate a SUM from another column in Materialized View
I am trying to calculate the sum of working days per month in a Oracle MV Here is my request: CREATE MATERIALIZED VIEW DIM_DATE_MV BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND START WITH sysdate NEXT (TRUNC(sysdate)+1) + 7 / 24 as SELECT CAL.DATE_D as ID_DATE, (CASE WHEN ( (TRIM(TO_CHAR(CAL.DATE_D,'Day','nls_date_language=english')) IN ('Saturday','Sunday')) OR (TRIM(TO_CHAR(CAL.DATE_D,'DD-MM')) IN ('01-01', '01-05', '08-05', '14-07', '15-08', '01-11', '11-11', '25-12')) OR (TO_CHAR(CAL.DATE_D, 'DD-MM-YYYY') IN (SELECT TO_CHAR(DOFF.DATE_OFF, 'DD-MM-YYYY') FROM ODSISIC.DAY_OFF DOFF where DOFF.IMPACT='ALL')) ) THEN 0 ELSE 1 END) as IS_WORKING_DAY, (CASE WHEN TO_CHAR(CAL.DATE_D , 'YYYY-MM') = TO_CHAR(CAL.DATE_D , 'YYYY-MM') THEN (Select SUM(IS_WORKING_DAY) from DIM_DATE_MV group by CAL.YEAR_MONTH_NUM) ELSE 0 END) as NB_WORKING_DAY_MONTH FROM ODSISIC.ORACLE_CALENDAR CAL LEFT JOIN ODSISIC.DAY_OFF DOFF ON DOFF.DATE_OFF = CAL.DATE_D IS_WORKING_DAY = 0 if it's Holidays, Weekend or Date in the table DATE_OFF which contains all holidays with a different date from year to year. I want the SUM GROUP BY month of IS_WORKING_DAY = 1 in NB_WORKING_DAY_MONTH. How can I calculate this SUM directly in my query rather than creating an intermediate table for my join with the DAY_OFF table ? Thanks :)
After thinking intelligently, I resolved by redoing my SQL query : CREATE MATERIALIZED VIEW DIM_DATE_MV BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND START WITH sysdate NEXT (TRUNC(sysdate)+1) + 7 / 24 as SELECT CAL.DATE_D as ID_DATE, IS_WORKING_DAY as IS_WORKING_DAY, A.SUM as NB_WORKING_DAY_MONTH FROM (SELECT SUM(IS_WORKING_DAY) as SUM, OCAL.YEAR_MONTH_NUM as ID_MONTH from ODSISIC.ORACLE_CALENDAR OCAL group by OCAL.YEAR_MONTH_NUM) A INNER JOIN ODSISIC.ORACLE_CALENDAR CAL on CAL.YEAR_MONTH_NUM = A.ID_MONTH LEFT JOIN ODSISIC.DAY_OFF DOFF ON DOFF.DATE_OFF = CAL.DATE_D ; I calculated the workdays before creating the view (which implies that my table DATE_OFF must be fed before ORACLE_CALENDAR) I added a join to populate my table according to the id_month. Its working fine now
Datetime SQL statement (Working in SQL Developer)
I'm new to the SQL scene but I've started to gather some data that makes sense to me after learning a little about SQL Developer. Although, I do need help with a query. My goal: To use the current criteria I have and select records only when the date-time value is within 5 minutes of the latest date-time. Here is my current sql statement `SELECT ABAMS.T_WORKORDER_HIST.LINE_NO AS Line, ABAMS.T_WORKORDER_HIST.STATE AS State, ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE, ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO, ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO_EXT, ASMBLYTST.V_SEQ_SERIAL_ALL.UPD_REASON_CODE, ABAMS.V_SERIAL_LINESET.LINESET_DATE AS "Lineset Time", ABAMS.T_WORKORDER_HIST.SERIAL_NO AS ESN, ABAMS.T_WORKORDER_HIST.ITEM_NO AS "Shop Order", ABAMS.T_WORKORDER_HIST.CUST_NAME AS Customer, ABAMS.T_ITEM_POLICY.PL_LOC_DROP_ZONE_NO AS PLDZ, ABAMS.T_WORKORDER_HIST.CONFIG_NO AS Configuration, ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN AS "Last Sta", ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_LOC, ASMBLYTST.V_LAST_ENG_LOCATION.LAST_MES_LOC, ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_TIME, ASMBLYTST.V_LAST_ENG_LOCATION.LAST_MES_TIME FROM ABAMS.T_WORKORDER_HIST LEFT JOIN ABAMS.V_SERIAL_LINESET ON ABAMS.V_SERIAL_LINESET.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO LEFT JOIN ASMBLYTST.V_EDP_ENG_LAST_ABSN ON ASMBLYTST.V_EDP_ENG_LAST_ABSN.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO LEFT JOIN ASMBLYTST.V_SEQ_SERIAL_ALL ON ASMBLYTST.V_SEQ_SERIAL_ALL.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO LEFT JOIN ABAMS.T_ITEM_POLICY ON ABAMS.T_ITEM_POLICY.ITEM_NO = ABAMS.T_WORKORDER_HIST.ITEM_NO LEFT JOIN ABAMS.T_CUR_STATUS ON ABAMS.T_CUR_STATUS.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO INNER JOIN ASMBLYTST.V_LAST_ENG_LOCATION ON ASMBLYTST.V_LAST_ENG_LOCATION.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO WHERE ABAMS.T_WORKORDER_HIST.LINE_NO = 10 AND (ABAMS.T_WORKORDER_HIST.STATE = 'PROD' OR ABAMS.T_WORKORDER_HIST.STATE = 'SCHED') AND ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE BETWEEN TRUNC(SysDate) - 10 AND TRUNC(SysDate) + 1 AND (ABAMS.V_SERIAL_LINESET.LINESET_DATE IS NOT NULL OR ABAMS.V_SERIAL_LINESET.LINESET_DATE IS NULL) AND (ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN < '1800' OR ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN IS NULL) ORDER BY ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN DESC Nulls Last, ABAMS.V_SERIAL_LINESET.LINESET_DATE Nulls Last, ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE, ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO, ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO_EXT` Here are some of the records I get from the table ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_TIME 2018-06-14 01:28:25 2018-06-14 01:29:26 2018-06-14 01:27:30 2018-06-13 22:44:03 2018-06-14 01:28:45 2018-06-14 01:27:37 2018-06-14 01:27:41 What I essentially want is for 2018-06-13 22:44:03 to be excluded from the query because it is not within the 5 minute window from the latest record Which in this data set is 2018-06-14 01:29:26 The one dynamic problem i seem to have is that the values for date-time are constantly updating. Any ideas? Thank you!
Here are two different solutions, each uses a table called "ASET". ASET contains 20 records 1 minute apart: WITH aset (ttime, cnt) AS (SELECT systimestamp AS ttime, 1 AS cnt FROM DUAL UNION ALL SELECT ttime + INTERVAL '1' MINUTE AS ttime, cnt + 1 AS cnt FROM aset WHERE cnt < 20) select * from aset; Now using ASET for our data, the following query finds the maximum date in ASET, and restricts the results to the six records within 5 minutes of ASET: SELECT * FROM aset WHERE ttime >= (SELECT MAX (ttime) FROM aset) - INTERVAL '5' MINUTE; An alternative is to use an analytic function: with bset AS (SELECT ttime, cnt, MAX (ttime) OVER () - ttime AS delta FROM aset) SELECT * FROM bset WHERE delta <= INTERVAL '5' MINUTE