Hive query takes forever on Superset - sql

I have a query that was written in Presto SQL format (100 lines of insert a query result to a table that already exists) and takes within 10 minutes to get the result.
Now I am going to use Airflow and need to change the query to Hive SQL format to append previous month's data, there is no error, but it is taking 75+ minutes now and the query is still running and not returning any result.
Shall I 'stop' it or is there anything else to consider?
SET hive.limit.query.max.table.partition = 1000000;
INSERT INTO TABLE schema.temp_tbl partition(year_month_key)
Select
distinct
tbl.account_id,
tbl.theme_status,
streaming.streaming_hours,
tbl.year_month as year_month_key
From
(
Select
tbl_0.year_month,
tbl_0.account_id,
case when max(tbl_0.theme_status) = 1 then 'With Theme' else 'No Theme' end as theme_status
From
(Select
streaming.year_month,
streaming.account_id,
case when theme_events.account_id is not null then 1 else 0 end as theme_status
from
(
Select
substring(date_key, 1, 7) as year_month,
last_day(add_months(date_key, -1)) as year_month_ed,
date_key,
upper(account_id) as account_id,
play_seconds
from agg_device_streaming_metrics_daily
Where date_key between date_add(last_day(add_months(current_date, -2)),1) and last_day(add_months(current_date, -1))
and play_seconds > 0
) streaming
left join
(
Select
upper(theme.virtualuserid) as account_id,
min(theme.createddate) as min_createddate,
min(theme.date_key) as date_key
From
(
select * from theme_activate_event_history
where date_key between '2019-01-01' and '2020-01-01'
and activate = 'true' and themetype in ('ThemeBundle','ScreenSaver','Skin','Audio')
union
select * from theme_activate_event_history
where date_key between '2020-01-01' and '2021-01-01'
and activate = 'true' and themetype in ('ThemeBundle','ScreenSaver','Skin','Audio')
union
select * from theme_activate_event_history
where date_key between '2021-01-01' and '2022-01-01'
and activate = 'true' and themetype in ('ThemeBundle','ScreenSaver','Skin','Audio')
union
select * from theme_activate_event_history
where date_key between cast('2022-01-01' as date) and last_day(add_months(current_date, -1))
and activate = 'true' and themetype in ('ThemeBundle','ScreenSaver','Skin','Audio')
) theme
group by theme.virtualuserid
) theme_events
on streaming.account_id = theme_events.account_id
and date(theme_events.date_key) <= date(streaming.year_month_ed)
) tbl_0
group by tbl_0.year_month, tbl_0.account_id
) tbl
inner join
(Select
substring(date_key, 1, 7) as year_month,
upper(account_id) as account_id,
cast(sum(play_seconds) / 3600 as double) as streaming_hours
from agg_device_streaming_metrics_daily
Where date_key between date_add(last_day(add_months(current_date, -2)),1) and last_day(add_months(current_date, -1))
and play_seconds > 0
group by substring(date_key, 1, 7), upper(account_id)
) streaming
on tbl.account_id = streaming.account_id and tbl.year_month = streaming.year_month;

Related

SQL - Adding conditions to SELECT

I have a table which has a timestamp and inCycle status of a machine. I'm using two CTE's and doing an INNER JOIN on row number so I can easily compare the timestamp of one row to the next. I have the DATEDIFF working and now I need to look at the inCycle status. Basically, if the inCycleThis and inCycleNext both = 1, I need to add it to an InCycle total.
Similarly (Shown table will make this clear):
incycleThis/next = 0,1 = not in cycle
incycleThis/next = 0,0 = not in cycle
incycleThis/next = 1,1 = in cycle
If I was doing this client side, this would be pretty simple. I need to do this in a stored procedure though due to there being a lot of records. I'd love to use an 'IF' in the SELECT section, but it seems that's not how it works.
The result I'm looking for at the end is simply: InCycle = Xtime. Something like:
SUM(Diff_seconds if((InCycleThis = 1 AND InCycleNext = 1) OR (InCycleThis = 1 AND InCycleNext = 0))
This is what I have so far:
WITH History_CTE (DT, MID, FRO, IC, RowNum)
AS
(
SELECT DateAndTime
,MachineID
,FeedRateOverride
,InCycle
,ROW_NUMBER()OVER(ORDER BY MachineID, DateAndTime) AS "row number"
FROM History
WHERE DateAndTime >= '2020-11-15'
AND DateAndTime < '2020-11-16'
),
History2_CTE (DT2, MID2, FRO2, IC2, RowNum2)
AS
(
SELECT DateAndTime
,MachineID
,FeedRateOverride
,InCycle
,ROW_NUMBER()OVER(ORDER BY MachineID, DateAndTime) AS "row number"
FROM History
WHERE DateAndTime >= '2020-11-15'
AND DateAndTime < '2020-11-16'
)
SELECT DT as 'TimeStamp'
,DT2 as 'TimeStamp Next Row'
,MID
,FRO
,IC as 'InCycle this'
,IC2 as 'InCycle next'
,RowNum
,DATEDIFF(s, History2_CTE.DT2, History_CTE.DT) AS 'Diff_seconds'
FROM History_CTE
INNER JOIN
History2_CTE ON History_CTE.RowNum = History2_CTE.RowNum2 + 1
Consider adding a third CTE to first conditionally calculate your needed value. Then aggregate for final statement. Recall CTEs can reference previously defined CTEs. Be sure to always quailfy columns with table aliases in JOIN queries.
WITH
... first two ctes...
, sub AS (
SELECT h1.DT AS 'TimeStamp'
, h2.DT2 AS 'TimeStamp Next Row'
, h1.MID
, h1.FRO
, h1.IC AS 'InCycle this'
, h2.IC2 AS 'InCycle next'
, h1.RowNum
, DATEDIFF(s, h2.DT2, h1.DT) AS 'Diff_seconds'
, CASE
WHEN (h1.IC = 1 AND h2.IC2 = 1) OR (h1.IC= 1 AND h2.IC2 = 0)
THEN DATEDIFF(s, h2.DT2, h1.DT)
END AS 'IC_Diff_seconds'
FROM History_CTE h1
INNER JOIN History2_CTE h2
ON h1.RowNum = h2.RowNum2 + 1
)
SELECT SUM([Diff_seconds]) AS Diff_seconds_Total
, SUM([IC_Diff_seconds]) AS IC_Diff_seconds_Total
FROM sub
And if needing to add groupings, incorporate GROUP BY:
SELECT h1.MID
, h1.FRO
, SUM([Diff_seconds]) AS Diff_seconds_Total
, SUM([IC_Diff_seconds]) AS IC_Diff_seconds_Total
FROM sub
GROUP BY h1.MID
, h1.FRO
Even aggregate calculations by day:
SELECT CONVERT(date, [TimeStamp]) AS [Day]
, SUM([Diff_seconds]) AS Diff_seconds_Total
, SUM([IC_Diff_seconds]) AS IC_Diff_seconds_Total
FROM sub
GROUP BY CONVERT(date, [TimeStamp])
The result I'm looking for at the end is simply: InCycle = Xtime. Something like:
SUM(Diff_seconds if((InCycleThis = 1 AND InCycleNext = 1) OR (InCycleThis = 1 AND InCycleNext = 0))
As I understand your question, you just need to sum the difference betwen the timestamp of "in cycle" rows and the timestamp of the next row.
select machineid,
sum(datediff(s, dateandtime, lead_dateandtime)) as total_in_time
from (
select h.*,
lead(dateandtime) over(partition by machineid order by dateandtime) as lead_dateandtime
from history h
) h
where inclycle = 1
group by machineid

Left Join Lateral is Very Slow

I have the following query
WITH time_series AS (
SELECT *
FROM generate_series(now() - interval '1days', now(), INTERVAL '1 hour') AS ts
), recent_instances AS (
SELECT instance_id,
(CASE WHEN last_update_granted_ts IS NOT NULL THEN last_update_granted_ts ELSE created_ts END),
version,
4 status
FROM instance_application
WHERE group_id=$1
AND last_check_for_updates >= now() - interval '1days'
ORDER BY last_update_granted_ts DESC
), instance_versions AS (
SELECT instance_id, created_ts, version, status
FROM instance_status_history
WHERE instance_id IN (SELECT instance_id
FROM recent_instances)
AND status = 4
UNION
(SELECT * FROM recent_instances)
ORDER BY created_ts DESC
)
SELECT ts,
(CASE WHEN version IS NULL THEN '' ELSE version END),
sum(CASE WHEN version IS NOT null THEN 1 ELSE 0 END) total
FROM (
SELECT *
FROM time_series
LEFT JOIN LATERAL (
SELECT distinct ON (instance_id) instance_Id, version, created_ts
FROM instance_versions
WHERE created_ts <= time_series.ts
ORDER BY instance_Id, created_ts DESC
) _ ON true
) AS _
GROUP BY 1,2
ORDER BY ts DESC;
So instance_versions subquery is executed with every value of timestamps generated from time_series query(see the last select statement). But for some reason the lateral join is very slow,the rows returned by the subquery of lateral join ranges in around 12k-15k(for a single timestamp from time_series query) which is not a big number and the final no of rows returned after the Lateral join ranges from 250k-350k. Is there a way i can optimize this?

Oracle - Display 'No Rows' when query returns no results

I would like my Oracle SQL output to display 'No Rows Found' when the query returns no results.
I am trying to use the NVL function but Im getting an error stating
'ERROR at line 21: ORA-00907: missing right parenthesis'
SELECT NVL((
SELECT TO_CHAR(CHGDATE, 'yyyy-mm')
,CHGFIELD
,DBNAME
,COUNT(*)
FROM APPCHANGEHIST A
,DATABASEFIELD D
WHERE A.CHGFIELD = D.FIELDNUM
AND trunc(CHGDATE) BETWEEN add_months(to_date(to_char((sysdate - to_char(sysdate, 'dd') + 1), 'dd-mon-yyyy')), - 1)
AND to_date(to_char((sysdate - to_char(sysdate, 'dd')), 'dd-mon-yyyy'))
AND CHGFIELD = 79
AND OLDVALUE IS NOT NULL
AND EXISTS (
SELECT 1
FROM USERPROF
WHERE USERID = A.CHGREQUESTOR
)
GROUP BY TO_CHAR(CHGDATE, 'yyyy-mm')
,CHGFIELD
,DBNAME
ORDER BY 1
,4 DESC
), "No Rows");
I don't have issues when I run this statement alone without the NVL
SELECT TO_CHAR(CHGDATE, 'yyyy-mm')
,CHGFIELD
,DBNAME
,COUNT(*)
FROM APPCHANGEHIST A
,DATABASEFIELD D
WHERE A.CHGFIELD = D.FIELDNUM
AND trunc(CHGDATE) BETWEEN add_months(to_date(to_char((sysdate - to_char(sysdate, 'dd') + 1), 'dd-mon-yyyy')), - 1)
AND to_date(to_char((sysdate - to_char(sysdate, 'dd')), 'dd-mon-yyyy'))
AND CHGFIELD = 79
AND OLDVALUE IS NOT NULL
AND EXISTS (
SELECT 1
FROM USERPROF
WHERE USERID = A.CHGREQUESTOR
)
GROUP BY TO_CHAR(CHGDATE, 'yyyy-mm')
,CHGFIELD
,DBNAME
ORDER BY 1
,4 DESC
Ok, at a high level, you can use the following pattern:
WITH results AS
(
SELECT *
FROM dual d
WHERE d.dummy = 'Y'
)
SELECT *
FROM results
UNION ALL
SELECT 'No Rows Found'
FROM dual
WHERE NOT EXISTS (SELECT 'X'
FROM results);
You can play with this by changing the value in the WITH clause between 'X' and 'Y'.
In your query, you would just replace the SELECT within the WITH clause with your query.

SQL - '1' IF hour in month EXISTS, '0' IF NOT EXISTS

I have a table that has aggregations down to the hour level YYYYMMDDHH. The data is aggregated and loaded by an external process (I don't have control over). I want to test the data on a monthly basis.
The question I am looking to answer is: Does every hour in the month exist?
I'm looking to produce output that will return a 1 if the hour exists or 0 if the hour does not exist.
The aggregation table looks something like this...
YYYYMM YYYYMMDD YYYYMMDDHH DATA_AGG
201911 20191101 2019110100 100
201911 20191101 2019110101 125
201911 20191101 2019110103 135
201911 20191101 2019110105 95
… … … …
201911 20191130 2019113020 100
201911 20191130 2019113021 110
201911 20191130 2019113022 125
201911 20191130 2019113023 135
And defined as...
CREATE TABLE YYYYMMDDHH_DATA_AGG AS (
YYYYMM VARCHAR,
YYYYMMDD VARCHAR,
YYYYMMDDHH VARCHAR,
DATA_AGG INT
);
I'm looking to produce the following below...
YYYYMMDDHH HOUR_EXISTS
2019110100 1
2019110101 1
2019110102 0
2019110103 1
2019110104 0
2019110105 1
... ...
In the example above, two hours do not exist, 2019110102 and 2019110104.
I assume I'd have to join the aggregation table against a computed table that contains all the YYYYMMDDHH combos???
The database is Snowflake, but assume most generic ANSI SQL queries will work.
You can get what you want with a recursive CTE
The recursive CTE generates the list of possible Hours. And then a simple left outer join gets you the flag for if you have any records that match that hour.
WITH RECURSIVE CTE (YYYYMMDDHH) as
(
SELECT YYYYMMDDHH
FROM YYYYMMDDHH_DATA_AGG
WHERE YYYYMMDDHH = (SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)
UNION ALL
SELECT TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') YYYYMMDDHH
FROM CTE C
WHERE TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)
)
SELECT
C.YYYYMMDDHH,
IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM CTE C
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
ON C.YYYYMMDDHH = A.YYYYMMDDHH;
If your timerange is too long you'll have issues with the cte recursing too much. You can create a table or temp table with all of the possible hours instead. For example:
CREATE OR REPLACE TEMPORARY TABLE HOURS (YYYYMMDDHH VARCHAR) AS
SELECT TO_VARCHAR(DATEADD(HOUR, SEQ4(), TO_TIMESTAMP((SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG), 'YYYYMMDDHH')), 'YYYYMMDDHH')
FROM TABLE(GENERATOR(ROWCOUNT => 10000)) V
ORDER BY 1;
SELECT
H.YYYYMMDDHH,
IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM HOURS H
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
ON H.YYYYMMDDHH = A.YYYYMMDDHH
WHERE H.YYYYMMDDHH <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG);
You can then fiddle with the generator count to make sure you have enough hours.
You can generate a table with every hour of the month and LEFT OUTER JOIN your aggregation to it:
WITH EVERY_HOUR AS (
SELECT TO_CHAR(DATEADD(HOUR, HH, TO_DATE(YYYYMM::TEXT, 'YYYYMM')),
'YYYYMMDDHH')::NUMBER YYYYMMDDHH
FROM (SELECT DISTINCT YYYYMM FROM YYYYMMDDHH_DATA_AGG) t
CROSS JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NULL) - 1 HH
FROM TABLE(GENERATOR(ROWCOUNT => 745))
) h
QUALIFY YYYYMMDDHH < (YYYYMM + 1) * 10000
)
SELECT h.YYYYMMDDHH, NVL2(a.YYYYMM, 1, 0) HOUR_EXISTS
FROM EVERY_HOUR h
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG a ON a.YYYYMMDDHH = h.YYYYMMDDHH
Here's something that might help get you started. I'm guessing you want to have 'synthetic' [YYYYMMDD] values? Otherwise, if the value aren't there, then they shouldn't appear in the list
DROP TABLE IF EXISTS #_hours
DROP TABLE IF EXISTS #_temp
--Populate a table with hours ranging from 00 to 23
CREATE TABLE #_hours ([hour_value] VARCHAR(2))
DECLARE #_i INT = 0
WHILE (#_i < 24)
BEGIN
INSERT INTO #_hours
SELECT FORMAT(#_i, '0#')
SET #_i += 1
END
-- Replicate OP's sample data set
CREATE TABLE #_temp (
[YYYYMM] INTEGER
, [YYYYMMDD] INTEGER
, [YYYYMMDDHH] INTEGER
, [DATA_AGG] INTEGER
)
INSERT INTO #_temp
VALUES
(201911, 20191101, 2019110100, 100),
(201911, 20191101, 2019110101, 125),
(201911, 20191101, 2019110103, 135),
(201911, 20191101, 2019110105, 95),
(201911, 20191130, 2019113020, 100),
(201911, 20191130, 2019113021, 110),
(201911, 20191130, 2019113022, 125),
(201911, 20191130, 2019113023, 135)
SELECT X.YYYYMM, X.YYYYMMDD, X.YYYYMMDDHH
-- Case: If 'target_hours' doesn't exist, then 0, else 1
, CASE WHEN X.target_hours IS NULL THEN '0' ELSE '1' END AS [HOUR_EXISTS]
FROM (
-- Select right 2 characters from converted [YYYYMMDDHH] to act as 'target values'
SELECT T.*
, RIGHT(CAST(T.[YYYYMMDDHH] AS VARCHAR(10)), 2) AS [target_hours]
FROM #_temp AS T
) AS X
-- Right join to keep all of our hours and only the target hours that match.
RIGHT JOIN #_hours AS H ON H.hour_value = X.target_hours
Sample output:
YYYYMM YYYYMMDD YYYYMMDDHH HOUR_EXISTS
201911 20191101 2019110100 1
201911 20191101 2019110101 1
NULL NULL NULL 0
201911 20191101 2019110103 1
NULL NULL NULL 0
201911 20191101 2019110105 1
NULL NULL NULL 0
With (almost) standard sql, you can do a cross join of the distinct values of YYYYMMDD to a list of all possible hours and then left join to the table:
select concat(d.YYYYMMDD, h.hour) as YYYYMMDDHH,
case when t.YYYYMMDDHH is null then 0 else 1 end as hour_exists
from (select distinct YYYYMMDD from tablename) as d
cross join (
select '00' as hour union all select '01' union all
select '02' union all select '03' union all
select '04' union all select '05' union all
select '06' union all select '07' union all
select '08' union all select '09' union all
select '10' union all select '11' union all
select '12' union all select '13' union all
select '14' union all select '15' union all
select '16' union all select '17' union all
select '18' union all select '19' union all
select '20' union all select '21' union all
select '22' union all select '23'
) as h
left join tablename as t
on concat(d.YYYYMMDD, h.hour) = t.YYYYMMDDHH
order by concat(d.YYYYMMDD, h.hour)
Maybe in Snowflake you can construct the list of hours with a sequence much easier instead of all those UNION ALLs.
This version accounts for the full range of days, across months and years. It's a simple cross join of the set of possible days with the set of possible hours of the day -- left joined to actual dates.
set first = (select min(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);
set last = (select max(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);
with
hours as (select row_number() over (order by null) - 1 h from table(generator(rowcount=>24))),
days as (
select
row_number() over (order by null) - 1 as n,
to_date($first::text, 'YYYYMMDD')::date + n as d,
to_char(d, 'YYYYMMDD') as yyyymmdd
from table(generator(rowcount=>($last-$first+1)))
)
select days.yyyymmdd || lpad(hours.h,2,0) as YYYYMMDDHH, nvl2(t.yyyymmddhh,1,0) as HOUR_EXISTS
from days cross join hours
left join YYYYMMDDHH_DATA_AGG t on t.yyyymmddhh = days.yyyymmdd || lpad(hours.h,2,0)
order by 1
;
$first and $last can be packed in as sub-queries if you prefer.

Union ALL fetches empty rows

I have two query joined with a union All.
SELECT select 'Finished' AS Status,amount AS amount,units As Date
from table1 WHERE Pdate > cdate AND name =#name
UNION ALL
SELECT select 'Live' AS Live,amount,units
from table1 Where Pdate = cdate And name =#name
Result
Status amount units
Finished 100 20
Live 200 10
When either of the query fetches empty set I get only one row and, if both fetches empty set then I no rows
So how can I get result like this
Status amount Units
Finished 100 20
Live 0 0
OR
Status amount Units
Finished 0 0
Live 200 10
OR
Status amount Units
Finished 0 0
Live 0 0
Thanks.
I would think you can do it using sum? And if sum doesn't return 0 when there are no rows then replace with Coalesce(sum(amount), 0) as amount
SELECT select 'Finished' AS Status,sum(amount) AS amount, sum(units) As Unit
from table1 WHERE Pdate > cdate AND name =#name
UNION ALL
SELECT select 'Live' AS Status, sum(amount) as amount, sum(units) as Unit
from table1 Where Pdate = cdate And name =#name
And if you are not trying to sum the results then just coalesce should work? coalesce(amount, 0) As amount etc...
I just want to point out that your query is needlessly complex, with nested selects and a union all. A better way to write the query is:
select (case when pdate > cdate then 'Finished' else 'Live' end) AS Status,
amount AS amount, units As Date
from table1
WHERE Pdate >= cdate AND name = #name
This query does not produce what you want, since it only produces rows where there is data.
One way to get the additional rows is to augment the original data and then check if it is needed.
select status, amount, units as Date
from (select Status, amount, units,
row_number() over (partition by status order by amount desc, units desc) as seqnum
from (select (case when pdate > cdate then 'Finished' else 'Live' end) AS Status,
amount, units, name
from table1
WHERE Pdate >= cdate AND name = #name
) union all
(select 'Finished', 0, 0, #name
) union all
(select 'Live', 0, 0, #name
)
) t
where (amount > 0 or units > 0) or
(seqnum = 1)
This adds in the extra rows that you want. It then enumerates them, so they would go last in any sequence. They are ignored, unless they are the first in the sequence.
Try something like this
with stscte as
(
select 'Finished' as status
union all
select 'Live'
),
datacte
as(
select 'Finished' AS Status,amount AS amount,units As Date
from table1 WHERE Pdate > cdate AND name =#name
UNION ALL
select 'Live' ,amount,units
from table1 Where Pdate = cdate And name =#name
)
select sc.status,isnull(dc.amount,0) as amount,isnull(dc.unit,0) as unit
from stscte sc left join datacte dc
on sc.status = dc.status