Hi brilliant thinkers,
I want to create a CASE condition to give me a "yes" for active_users that is if there exists within 60 days, a more recent uuid_ts for the same anonymous_id.
SELECT t1.anonymous_id user_id,
t1.uuid_ts activity_date,
t2.uuid_ts signup_date,
-- Activity Lifetime: difference of number of days signed up to last activity
DATE_DIFF(CAST(t2.uuid_ts AS DATE), CAST(t1.uuid_ts AS DATE), DAY) AS activity_lifetime,
-- New Users: If month of activity is same as sign_up month
(CASE WHEN DATE_DIFF(CAST(t1.uuid_ts AS DATE), CAST(t2.uuid_ts AS DATE), MONTH)=0 THEN TRUE ELSE FALSE END) AS new_user,
-- Active Users: If month of activity is greater than sign_up month AND activity is found
(CASE WHEN DATE_DIFF(CAST(t1.uuid_ts AS DATE), CAST(t2.uuid_ts AS DATE), MONTH)>0
-- ** ____ NEED HELP HERE ____ **
AND anonymous_id NOT IN (SELECT anonymous_id FROM datascience.last_user_activity)
AND DATE_ADD(activity_date, INTERVAL 60 DAY) > (S)
FROM datascience.last_user_activity AS t1
INNER JOIN datascience.full_signup_completed AS t2
ON t2.anonymous_id = t1.anonymous_id
WHERE DATE(t1.uuid_ts) IS NOT NULL AND DATE(t2.uuid_ts) IS NOT NULL
ORDER BY activity_lifetime DESC
SAMPLE DATA:
anon_id|signup_date|activity_date|
__________________________________
123 |01-01-2019 |02-01-2019 |
123 |01-01-2019 |02-02-2019 |
123 |01-01-2019 |02-03-2019 |
123 |01-01-2019 |02-04-2019 |
WANTED:
anon_id|signup_date|activity_date| active
__________________________________
123 |01-01-2019 |02-01-2019 | yes
123 |01-01-2019 |02-02-2019 | yes
123 |01-01-2019 |02-03-2019 | no
123 |01-01-2019 |02-04-2019 | no
if a future date exists in the same row, within the range of 60 days, then the field active shows "yes", else a "no".
Still not 100% sure this is what you are looking for, but I hope it helps:
WITHIN 60 days:
(The output would be "yes, yes, yes, no" since 02-04-2019 > 02-03-2019 and within 60 days)
WITH
sample_data AS (
SELECT
'123' AS anon_id, DATE('2019-01-01') AS signup_date,
DATE('2019-01-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-02-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-03-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-04-02') AS activity_date)
SELECT
anon_id,
signup_date,
activity_date,
(CASE
WHEN EXISTS( SELECT 'found' FROM sample_data t2 WHERE t2.anon_id = t1.anon_id AND t2.activity_date > t1.activity_date AND t2.activity_date <= DATE_ADD(t1.activity_date, INTERVAL 60 DAY)) THEN 'yes'
ELSE
'no'
END
) AS active
FROM
sample_data t1
ORDER BY 1,2,3
60 DAYS or BEYOND:
(The output would be "yes, no, no, no" since February has 28 days and March 31, so between 02-02-2019 and 02-04-2019 there are 59 days)
WITH
sample_data AS (
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-01-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-02-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-03-02') AS activity_date
UNION ALL
SELECT
'123' AS anon_id,
DATE('2019-01-01') AS signup_date,
DATE('2019-04-02') AS activity_date)
SELECT
anon_id,
signup_date,
activity_date,
(CASE
WHEN EXISTS( SELECT 'found' FROM sample_data t2 WHERE t2.anon_id = t1.anon_id AND t2.activity_date >= DATE_ADD(t1.activity_date, INTERVAL 60 DAY)) THEN 'yes'
ELSE
'no'
END
) AS active
FROM
sample_data t1
ORDER BY 1,2,3
Your question/logic/dates are a bit unclear, but I think the following query should point you in the right direction.
with joined as (
-- Join your tables and handle casting here (only have to do it once)
select
anonymous_id,
date(full_signup_completed.uuid_ts) as signup_date,
extract(month from full_signup_completed.uuid_ts) as signup_month,
date(last_user_activity.uuid_ts) as activity_date,
extract(month from last_user_activity.uuid_ts) as activity_month
from datascience.full_signup_completed
left join datascience.last_user_activity using(anonymous_id)
where full_signup_completed.uuid_ts is not null and last_user_activity.uuid_ts is not null
),
activity60 as (
-- for each activity date, is there a future activity date within 60 days?
select j1.anonymous_id,j1.activity_date, true as has_activity_within_60_days
from joined j1
cross join joined j2
where j1.anonymous_id = j2.anonymous_id and date_diff(j2.activity_date, j1.activity_date, day) <= 60
group by 1,2
),
final as (
-- Get all of your logic
select
joined.*,
date_diff(activity_date,signup_date, day) as activity_lifetime,
signup_month = activity_month as new_user, -- Evaluates to T/F
(activity_month > signup_month) and has_activity_within_60_days as your_custom_field -- Evaluates to aT/F
from joined
inner join activity60 using(anonymous_id,activity_date)
)
select * from final
order by activity_lifetime desc
In your example, are your dates in DD-MM-YYYY format? If not, i'm not sure how the 60 day constraint makes sense.
Related
I have a table that has aggregations down to the hour level YYYYMMDDHH. The data is aggregated and loaded by an external process (I don't have control over). I want to test the data on a monthly basis.
The question I am looking to answer is: Does every hour in the month exist?
I'm looking to produce output that will return a 1 if the hour exists or 0 if the hour does not exist.
The aggregation table looks something like this...
YYYYMM YYYYMMDD YYYYMMDDHH DATA_AGG
201911 20191101 2019110100 100
201911 20191101 2019110101 125
201911 20191101 2019110103 135
201911 20191101 2019110105 95
… … … …
201911 20191130 2019113020 100
201911 20191130 2019113021 110
201911 20191130 2019113022 125
201911 20191130 2019113023 135
And defined as...
CREATE TABLE YYYYMMDDHH_DATA_AGG AS (
YYYYMM VARCHAR,
YYYYMMDD VARCHAR,
YYYYMMDDHH VARCHAR,
DATA_AGG INT
);
I'm looking to produce the following below...
YYYYMMDDHH HOUR_EXISTS
2019110100 1
2019110101 1
2019110102 0
2019110103 1
2019110104 0
2019110105 1
... ...
In the example above, two hours do not exist, 2019110102 and 2019110104.
I assume I'd have to join the aggregation table against a computed table that contains all the YYYYMMDDHH combos???
The database is Snowflake, but assume most generic ANSI SQL queries will work.
You can get what you want with a recursive CTE
The recursive CTE generates the list of possible Hours. And then a simple left outer join gets you the flag for if you have any records that match that hour.
WITH RECURSIVE CTE (YYYYMMDDHH) as
(
SELECT YYYYMMDDHH
FROM YYYYMMDDHH_DATA_AGG
WHERE YYYYMMDDHH = (SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)
UNION ALL
SELECT TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') YYYYMMDDHH
FROM CTE C
WHERE TO_VARCHAR(DATEADD(HOUR, 1, TO_TIMESTAMP(C.YYYYMMDDHH, 'YYYYMMDDHH')), 'YYYYMMDDHH') <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG)
)
SELECT
C.YYYYMMDDHH,
IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM CTE C
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
ON C.YYYYMMDDHH = A.YYYYMMDDHH;
If your timerange is too long you'll have issues with the cte recursing too much. You can create a table or temp table with all of the possible hours instead. For example:
CREATE OR REPLACE TEMPORARY TABLE HOURS (YYYYMMDDHH VARCHAR) AS
SELECT TO_VARCHAR(DATEADD(HOUR, SEQ4(), TO_TIMESTAMP((SELECT MIN(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG), 'YYYYMMDDHH')), 'YYYYMMDDHH')
FROM TABLE(GENERATOR(ROWCOUNT => 10000)) V
ORDER BY 1;
SELECT
H.YYYYMMDDHH,
IFF(A.YYYYMMDDHH IS NOT NULL, 1, 0) HOUR_EXISTS
FROM HOURS H
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG A
ON H.YYYYMMDDHH = A.YYYYMMDDHH
WHERE H.YYYYMMDDHH <= (SELECT MAX(YYYYMMDDHH) FROM YYYYMMDDHH_DATA_AGG);
You can then fiddle with the generator count to make sure you have enough hours.
You can generate a table with every hour of the month and LEFT OUTER JOIN your aggregation to it:
WITH EVERY_HOUR AS (
SELECT TO_CHAR(DATEADD(HOUR, HH, TO_DATE(YYYYMM::TEXT, 'YYYYMM')),
'YYYYMMDDHH')::NUMBER YYYYMMDDHH
FROM (SELECT DISTINCT YYYYMM FROM YYYYMMDDHH_DATA_AGG) t
CROSS JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NULL) - 1 HH
FROM TABLE(GENERATOR(ROWCOUNT => 745))
) h
QUALIFY YYYYMMDDHH < (YYYYMM + 1) * 10000
)
SELECT h.YYYYMMDDHH, NVL2(a.YYYYMM, 1, 0) HOUR_EXISTS
FROM EVERY_HOUR h
LEFT OUTER JOIN YYYYMMDDHH_DATA_AGG a ON a.YYYYMMDDHH = h.YYYYMMDDHH
Here's something that might help get you started. I'm guessing you want to have 'synthetic' [YYYYMMDD] values? Otherwise, if the value aren't there, then they shouldn't appear in the list
DROP TABLE IF EXISTS #_hours
DROP TABLE IF EXISTS #_temp
--Populate a table with hours ranging from 00 to 23
CREATE TABLE #_hours ([hour_value] VARCHAR(2))
DECLARE #_i INT = 0
WHILE (#_i < 24)
BEGIN
INSERT INTO #_hours
SELECT FORMAT(#_i, '0#')
SET #_i += 1
END
-- Replicate OP's sample data set
CREATE TABLE #_temp (
[YYYYMM] INTEGER
, [YYYYMMDD] INTEGER
, [YYYYMMDDHH] INTEGER
, [DATA_AGG] INTEGER
)
INSERT INTO #_temp
VALUES
(201911, 20191101, 2019110100, 100),
(201911, 20191101, 2019110101, 125),
(201911, 20191101, 2019110103, 135),
(201911, 20191101, 2019110105, 95),
(201911, 20191130, 2019113020, 100),
(201911, 20191130, 2019113021, 110),
(201911, 20191130, 2019113022, 125),
(201911, 20191130, 2019113023, 135)
SELECT X.YYYYMM, X.YYYYMMDD, X.YYYYMMDDHH
-- Case: If 'target_hours' doesn't exist, then 0, else 1
, CASE WHEN X.target_hours IS NULL THEN '0' ELSE '1' END AS [HOUR_EXISTS]
FROM (
-- Select right 2 characters from converted [YYYYMMDDHH] to act as 'target values'
SELECT T.*
, RIGHT(CAST(T.[YYYYMMDDHH] AS VARCHAR(10)), 2) AS [target_hours]
FROM #_temp AS T
) AS X
-- Right join to keep all of our hours and only the target hours that match.
RIGHT JOIN #_hours AS H ON H.hour_value = X.target_hours
Sample output:
YYYYMM YYYYMMDD YYYYMMDDHH HOUR_EXISTS
201911 20191101 2019110100 1
201911 20191101 2019110101 1
NULL NULL NULL 0
201911 20191101 2019110103 1
NULL NULL NULL 0
201911 20191101 2019110105 1
NULL NULL NULL 0
With (almost) standard sql, you can do a cross join of the distinct values of YYYYMMDD to a list of all possible hours and then left join to the table:
select concat(d.YYYYMMDD, h.hour) as YYYYMMDDHH,
case when t.YYYYMMDDHH is null then 0 else 1 end as hour_exists
from (select distinct YYYYMMDD from tablename) as d
cross join (
select '00' as hour union all select '01' union all
select '02' union all select '03' union all
select '04' union all select '05' union all
select '06' union all select '07' union all
select '08' union all select '09' union all
select '10' union all select '11' union all
select '12' union all select '13' union all
select '14' union all select '15' union all
select '16' union all select '17' union all
select '18' union all select '19' union all
select '20' union all select '21' union all
select '22' union all select '23'
) as h
left join tablename as t
on concat(d.YYYYMMDD, h.hour) = t.YYYYMMDDHH
order by concat(d.YYYYMMDD, h.hour)
Maybe in Snowflake you can construct the list of hours with a sequence much easier instead of all those UNION ALLs.
This version accounts for the full range of days, across months and years. It's a simple cross join of the set of possible days with the set of possible hours of the day -- left joined to actual dates.
set first = (select min(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);
set last = (select max(yyyymmdd::number) from YYYYMMDDHH_DATA_AGG);
with
hours as (select row_number() over (order by null) - 1 h from table(generator(rowcount=>24))),
days as (
select
row_number() over (order by null) - 1 as n,
to_date($first::text, 'YYYYMMDD')::date + n as d,
to_char(d, 'YYYYMMDD') as yyyymmdd
from table(generator(rowcount=>($last-$first+1)))
)
select days.yyyymmdd || lpad(hours.h,2,0) as YYYYMMDDHH, nvl2(t.yyyymmddhh,1,0) as HOUR_EXISTS
from days cross join hours
left join YYYYMMDDHH_DATA_AGG t on t.yyyymmddhh = days.yyyymmdd || lpad(hours.h,2,0)
order by 1
;
$first and $last can be packed in as sub-queries if you prefer.
first: I have two tables with a primary key((Agent_ID). I want to join both tables, filter Agent_Type =1 and status =1
Second: get the last active year total transaction value monthly wise who is not done any transaction for the last three months.
Agent table
Agent_ID Agent_Type
234 1
456 1
567 1
678 0
Agent_Transaction table
Agent_ID Amount Transaction_Date status
234 70 23/7/2019 1
234 54 11/6/2019 0
234 30 23/5/2019 1
456 56 12/1/2019 1
456 80 15/3/2019 1
456 99 20/2/2019 1
456 76 23/12/2018 1
567 56 10/10/2018 0
567 60 30/6/2018 1
456
select Agent_ID,CONCAT(Extract(MONTH from Agent_Transaction.Transaction_Date),
EXTRACT (YEAR FROM Agent_Transaction.Transaction_Date))as MONTH_YEAR,
SUM(Agent_Transaction.Amount)AS TOTAL
from Agent
inner join Agent_Transaction
on Agent_Transaction.Agent_ID = Agent.Agent_ID
where Agent.Agent_Type='1' AND Agent_Transaction.status='1' AND
(Agent_Transaction.Transaction_Date between ADD_MONTHS(SYSDATE,-3) and SYSDATE)
GROUP BY Agent.Agent_ID,
CONCAT(Extract(MONTH from Agent_Transaction.Transaction_Date),EXTRACT (YEAR FROM Agent_Transaction.Transaction_Date)),
Agent_Transaction.Amount
But I didn't get what I expected.
As far as I understood the requirement, you can use the following query:
SELECT
A.AGENT_ID,
TO_CHAR(TRUNC(ATR.TRANSACTION_DATE, 'MONTH'), 'MONYYYY'), -- YOU CAN USE DIFFERENT FORMAT ACCORDING TO REQUIREMENT
SUM(AMOUNT) AS TOTAL_MONTHWISE_AMOUNT -- MONTHWISE TRANSACTION TOTAL
FROM
AGENT A
JOIN AGENT_TRANSACTION ATR ON ( A.AGENT_ID = ATR.AGENT_ID )
WHERE
-- EXCLUDING THE AGENTS WHICH HAVE DONE NO TRANSACTION IN LAST THREE MONTHS USING FOLLOWING NOT IN
ATR.AGENT_ID NOT IN (
SELECT
DISTINCT ATR_IN1.AGENT_ID
FROM
AGENT_TRANSACTION ATR_IN1
WHERE
ATR_IN1.TRANSACTION_DATE > ADD_MONTHS(SYSDATE, - 3)
AND ATR_IN1.STATUS = 1 -- YOU CAN USE IT ACCORDING TO REQUIREMENT
)
-- FETCHING LAST YEAR DATA
AND EXTRACT(YEAR FROM ATR.TRANSACTION_DATE) = EXTRACT(YEAR FROM ADD_MONTHS(SYSDATE, - 12))
AND A.AGENT_TYPE = 1
AND ATR.STATUS = 1
GROUP BY
A.AGENT_ID,
TRUNC(ATR.TRANSACTION_DATE, 'MONTH');
Please comment if minor changes are required or you need different logic.
Cheers!!
-- Update --
Updated the query after OP described the original issue:
SELECT
AGENT_ID,
TO_CHAR(TRUNC(TRANSACTION_DATE, 'MONTH'), 'MONYYYY'), -- YOU CAN USE DIFFERENT FORMAT ACCORDING TO REQUIREMENT
SUM(AMOUNT) AS TOTAL_MONTHWISE_AMOUNT -- MONTHWISE TRANSACTION TOTAL
FROM
(
SELECT
A.AGENT_ID,
TRUNC(ATR.TRANSACTION_DATE, 'MONTH') AS TRANSACTION_DATE,
MAX(TRUNC(ATR.TRANSACTION_DATE, 'MONTH')) OVER(
PARTITION BY A.AGENT_ID
) AS LAST_TR_DATE,
AMOUNT,
AGENT_TYPE,
STATUS
FROM
AGENT A
JOIN AGENT_TRANSACTION ATR ON ( A.AGENT_ID = ATR.AGENT_ID )
WHERE
A.AGENT_TYPE = 1
AND ATR.STATUS = 1
)
WHERE
-- EXCLUDING THE AGENTS WHICH HAVE DONE NO TRANSACTION IN LAST THREE MONTHS USING FOLLOWING NOT IN
LAST_TR_DATE > ADD_MONTHS(SYSDATE, - 3)
-- FETCHING LAST YEAR DATA
AND TRANSACTION_DATE BETWEEN ADD_MONTHS(LAST_TR_DATE, - 12) AND LAST_TR_DATE
GROUP BY
AGENT_ID,
TRANSACTION_DATE;
Cheers!!
-- Update --
Your exact query should look like this:
SELECT
AGENT_ID,
TO_CHAR(TRUNC(TX_TIME, 'MONTH'), 'MONYYYY') AS MONTHYEAR,
SUM(TX_VALUE) AS TOTALMONTHWISE
FROM
(
SELECT
A.AGENT_ID,
TRUNC(ATR.TX_TIME, 'MONTH') AS TX_TIME, -- changed this alias name
MAX(TRUNC(ATR.TX_TIME, 'MONTH')) OVER(
PARTITION BY A.AGENT_ID
) AS LAST_TR_DATE,
ATR.TX_VALUE,
A.AGENT_TYPE_ID
FROM
TBLEZ_AGENT A
JOIN TBLEZ_TRANSACTION ATR ON ( A.AGENT_ID = ATR.SRC_AGENT_ID )
WHERE
A.AGENT_TYPE_ID = '3'
AND ATR.STATUS = '0'
AND ATR.TX_TYPE_ID = '5'
)
WHERE
LAST_TR_DATE < ADD_MONTHS(SYSDATE, - 3)
AND ( TX_TIME BETWEEN ADD_MONTHS(LAST_TR_DATE, - 12) AND LAST_TR_DATE )
GROUP BY
AGENT_ID,
TX_TIME;
-- UPDATE --
In response to this comment -- **Hi Tejash, How to get the total day-wise to above my scenario? **
SELECT
AGENT_ID,
TX_TIME,
SUM(TX_VALUE) AS TOTALDAYWISE
FROM
(
SELECT
A.AGENT_ID,
TRUNC(ATR.TX_TIME) AS TX_TIME, -- changed TRUNC INPUT PARAMETER -- REMOVED MONTH IN TRUNC
MAX(TRUNC(ATR.TX_TIME)) OVER( -- changed TRUNC INPUT PARAMETER -- REMOVED MONTH IN TRUNC
PARTITION BY A.AGENT_ID
) AS LAST_TR_DATE,
ATR.TX_VALUE,
A.AGENT_TYPE_ID
FROM
TBLEZ_AGENT A
JOIN TBLEZ_TRANSACTION ATR ON ( A.AGENT_ID = ATR.SRC_AGENT_ID )
WHERE
A.AGENT_TYPE_ID = '3'
AND ATR.STATUS = '0'
AND ATR.TX_TYPE_ID = '5'
)
WHERE
LAST_TR_DATE < ADD_MONTHS(SYSDATE, - 3)
AND ( TRUNC(TX_TIME, 'MONTH') BETWEEN ADD_MONTHS(LAST_TR_DATE, - 12) AND LAST_TR_DATE )
-- changed TRUNC INPUT PARAMETER -- ADDED MONTH IN TRUNC
GROUP BY
AGENT_ID,
TX_TIME;
Cheers!!
SELECT
A.AGENT_ID,
TO_CHAR(TRUNC(ATR.TX_TIME, 'MONTH'), 'MONYYYY') AS MONTHYEAR,
SUM(ATR.TX_VALUE) AS TOTALMONTHWISE
FROM
(
SELECT
A.AGENT_ID,
TRUNC(ATR.TX_TIME, 'MONTH') AS TRANSC_DATE,
MAX(TRUNC(ATR.TX_TIME, 'MONTH')) OVER(
PARTITION BY A.AGENT_ID
) AS LAST_TR_DATE,
ATR.TX_VALUE,
A.AGENT_TYPE_ID
FROM
TBLEZ_AGENT A
JOIN TBLEZ_TRANSACTION ATR ON ( A.AGENT_ID = ATR.SRC_AGENT_ID )
WHERE
A.AGENT_TYPE_ID = '3'
AND ATR.STATUS = '0'
AND ATR.TX_TYPE_ID = '5'
)
WHERE
LAST_TR_DATE > ADD_MONTHS(SYSDATE, - 3)
AND ( TX_TIME BETWEEN ADD_MONTHS(LAST_TR_DATE, - 12) AND LAST_TR_DATE )
GROUP BY
A.AGENT_ID,TRUNC(ATR.TX_TIME, 'MONTH')
I have a table which shows good, bad and other status for a Device everyday. I want to display a row per device with today's status and previous best status('Good' if anytime good in the time span, otherwise the previous day status). I am using join and query is as shared below.
SELECT t1.devid,
t1.status AS Today_status,
t2.status AS yest_status,
t2.runtime AS yest_runtime
FROM devtable t1
INNER JOIN devtable t2
ON t1.devid = t2.devid
AND t1.RUNTIME = '17-jul-2018'
AND t2.runtime > '30-jun-2018'
ORDER BY t1.devID, (CASE WHEN t2.status LIKE 'G%' THEN 0 END), t2.runtime;
Now I am not able to group it to a single record per device(getting many records per device). Can you suggest a solution on this?
This would be easier to interpret with sample data and results, but it sounds like you want something like:
select devid, runtime, status, prev_status,
coalesce(good_status, prev_status) as best_status
from (
select devid, runtime, status,
lag(status) over (partition by devid order by runtime) as prev_status,
max(case when status = 'Good' then status end) over (partition by devid) as good_status
from (
select devid, runtime, status
from devtable
where runtime > date '2018-06-30'
)
)
where runtime = date '2018-07-17';
The innermost query restricts the date range; if you need an upper bound on that (i.e. it isn't today as in your example) then include that as another filter.
The next layer out uses lag() and max() analytic functions to find the previous status, and any 'Good' status (via a case expression), for each ID.
The outer query then filters to only show the target end date, and uses coalesce() to show 'Good' if that existed, or the previous status if not.
Demo with some made-up sample data in a CTE:
with devtable (devid, runtime, status) as (
select 1, date '2018-06-30', 'Good' from dual -- should be ignored
union all select 1, date '2018-07-01', 'a' from dual
union all select 1, date '2018-07-16', 'b' from dual
union all select 1, date '2018-07-17', 'c' from dual
union all select 2, date '2018-07-01', 'Good' from dual
union all select 2, date '2018-07-16', 'e' from dual
union all select 2, date '2018-07-17', 'f' from dual
union all select 3, date '2018-07-01', 'g' from dual
union all select 3, date '2018-07-16', 'Good' from dual
union all select 3, date '2018-07-17', 'i' from dual
union all select 4, date '2018-07-01', 'j' from dual
union all select 4, date '2018-07-16', 'k' from dual
union all select 4, date '2018-07-17', 'Good' from dual
)
select devid, runtime, status, prev_status,
coalesce(good_status, prev_status) as best_status
from (
select devid, runtime, status,
lag(status) over (partition by devid order by runtime) as prev_status,
max(case when status = 'Good' then status end) over (partition by devid) as good_status
from (
select devid, runtime, status
from devtable
where runtime > date '2018-06-30'
)
)
where runtime = date '2018-07-17';
DEVID RUNTIME STAT PREV BEST
---------- ---------- ---- ---- ----
1 2018-07-17 c b b
2 2018-07-17 f e Good
3 2018-07-17 i Good Good
4 2018-07-17 Good k Good
You could remove the innermost query by moving that filter into the case expression:
select devid, runtime, status, prev_status,
coalesce(good_status, prev_status) as best_status
from (
select devid, runtime, status,
lag(status) over (partition by devid order by runtime) as prev_status,
max(case when runtime > date '2018-06-30' and status = 'Good' then status end)
over (partition by devid) as good_status
from devtable
)
where runtime = date '2018-07-17';
but that would probably do quite a lot more work as it would examine and calculate a lot of data you don't care about.
Analytic functions should do what you want. It is unclear what your results should look like, but this gathers the information you need:
SELECT d.*
FROM (SELECT d.*,
LAG(d.status) OVER (PARTITION BY d.devid ORDER BY d.runtime) as prev_status,
LAG(d.runtime) OVER (PARTITION BY d.devid ORDER BY d.runtime) as prev_runtime,
ROW_NUMBER() OVER (PARTITION BY d.devid ORDER BY d.runtime) as seqnum,
SUM(CASE WHEN status = 'GOOD' THEN 1 ELSE 0 END) OVER (PARTITION BY d.devid) as num_good
FROM devtable d
WHERE d.runtime = DATE '2018-07-17' AND
d.runtime > DATE '2018-06-2018'
) d
WHERE seqnum = 1;
I have a SQL Server question that I'm trying to figure out at work:
There is a table with a status field which can contain a status called "Participate." I am only trying to find records if the latest status of the day is "Participate" and only if the status changed on the same day from another status to "Participate."
I don't want any records where the status was already "Participate." It must have changed to that status on the same day. You can tell when the status was changed by the datetime field ChangedOn.
In the sample below I would only want to bring back ID 1880 since the status of "Participated" has the latest timestamp. I would not bring back ID 1700 since the last record is "Other," and I would not bring back ID 1600 since "Participated" is the only status of that day.
ChangedOn Status ID
02/01/17 15:23 Terminated 1880
02/01/17 17:24 Participated 1880
02/01/17 09:00 Other 1880
01/31/17 01:00 Terminated 1700
01/31/17 02:00 Participated 1700
01/31/17 03:00 Other 1700
01/31/17 02:00 Participated 1600
I was thinking of using a Window function, but I'm not sure how to get started on this. It's been a few months since I've written a query like this so I'm a bit out of practice.
Thanks!
You can use window functions for this:
select t.*
from (select t.*,
row_number() over (partition by cast(ChangedOn as date)
order by ChangedOn desc
) as seqnum,
sum(case when status <> 'Participate' then 1 else 0 end) over (partition by cast(ChangedOn as date)) as num_nonparticipate
from t
) t
where (seqnum = 1 and ChangedOn = 'Participate') and
num_nonparticipate > 0;
Can you check this?
WITH sample_table(ChangedOn,Status,ID)AS(
SELECT CONVERT(DATETIME,'02/01/2017 15:23'),'Terminated',1880 UNION ALL
SELECT '02/01/2017 17:24','Participated',1880 UNION ALL
SELECT '02/01/2017 09:00','Other',1880 UNION ALL
SELECT '01/31/2017 01:00','Terminated',1700 UNION ALL
SELECT '01/31/2017 02:00','Participated',1700 UNION ALL
SELECT '01/31/2017 03:00','Other',1700 UNION ALL
SELECT '01/31/2017 02:00','Participated',1600
)
SELECT ID FROM (
SELECT *
,ROW_NUMBER()OVER(PARTITION BY ID,CONVERT(VARCHAR,ChangedOn,112) ORDER BY ChangedOn) AS rn
,COUNT(0)OVER(PARTITION BY ID,CONVERT(VARCHAR,ChangedOn,112)) AS cnt
,CASE WHEN Status<>'Participated' THEN 1 ELSE 0 END AS ss
,SUM(CASE WHEN Status!='Participated' THEN 1 ELSE 0 END)OVER(PARTITION BY ID,CONVERT(VARCHAR,ChangedOn,112)) AS OtherStatusCnt
FROM sample_table
) AS t WHERE t.rn=t.cnt AND t.Status='Participated' AND t.OtherStatusCnt>0
--Return:
1880
try this with other sample data,
declare #t table(ChangedOn datetime,Status varchar(50),ID int)
insert into #t VALUES
('02/01/17 15:23', 'Terminated' ,1880)
,('02/01/17 17:24', 'Participated' ,1880)
,('02/01/17 09:00', 'Other' ,1880)
,('01/31/17 01:00', 'Terminated' ,1700)
,('01/31/17 02:00', 'Participated' ,1700)
,('01/31/17 03:00', 'Other' ,1700)
,('01/31/17 02:00', 'Participated' ,1600)
;
WITH CTE
AS (
SELECT *
,row_number() OVER (
PARTITION BY id
,cast(ChangedOn AS DATE) ORDER BY ChangedOn DESC
) AS seqnum
FROM #t
)
SELECT *
FROM cte c
WHERE seqnum = 1
AND STATUS = 'Participated'
AND EXISTS (
SELECT id
FROM cte c1
WHERE seqnum > 1
AND c.id = c1.id
)
2nd query,this is better
here CTE is same
SELECT *
FROM cte c
WHERE seqnum = 1
AND STATUS = 'Participated'
AND EXISTS (
SELECT id
FROM cte c1
WHERE STATUS != 'Participated'
AND c.id = c1.id
)
How do you create a moving average in SQL?
Current table:
Date Clicks
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520
2012-05-04 1,330
2012-05-05 2,260
2012-05-06 3,540
2012-05-07 2,330
Desired table or output:
Date Clicks 3 day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
This is an Evergreen Joe Celko question.
I ignore which DBMS platform is used. But in any case Joe was able to answer more than 10 years ago with standard SQL.
Joe Celko SQL Puzzles and Answers citation:
"That last update attempt suggests that we could use the predicate to
construct a query that would give us a moving average:"
SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;
Is the extra column or the query approach better? The query is
technically better because the UPDATE approach will denormalize the
database. However, if the historical data being recorded is not going
to change and computing the moving average is expensive, you might
consider using the column approach.
MS SQL Example:
CREATE TABLE #TestDW
( Date1 datetime,
LoadValue Numeric(13,6)
);
INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );
SQL Puzzle query:
SELECT S1.date1, AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
BETWEEN DATEADD(d, -2, S1.date1 )
AND S1.date1
GROUP BY S1.date1
order by 1;
One way to do this is to join on the same table a few times.
select
(Current.Clicks
+ isnull(P1.Clicks, 0)
+ isnull(P2.Clicks, 0)
+ isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
MyTable as Current
left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)
Adjust the DateAdd component of the ON-Clauses to match whether you want your moving average to be strictly from the past-through-now or days-ago through days-ahead.
This works nicely for situations where you need a moving average over only a few data points.
This is not an optimal solution for moving averages with more than a few data points.
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date
Example here.
Obviously you can change the interval to whatever you need. You could also use count() instead of a magic number to make it easier to change, but that will also slow it down.
General template for rolling averages that scales well for large data sets
WITH moving_avg AS (
SELECT 0 AS [lag] UNION ALL
SELECT 1 AS [lag] UNION ALL
SELECT 2 AS [lag] UNION ALL
SELECT 3 AS [lag] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1]) AS [avg_value1],
AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
And for weighted rolling averages:
WITH weighted_avg AS (
SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
select *
, (select avg(c2.clicks) from #clicks_table c2
where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1
Use a different join predicate:
SELECT current.date
,avg(periods.clicks)
FROM current left outer join current as periods
ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3
The having statement will prevent any dates without at least N values from being returned.
assume x is the value to be averaged and xDate is the date value:
SELECT avg(x) from myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) and xDate
In hive, maybe you could try
select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
For the purpose, I'd like to create an auxiliary/dimensional date table like
create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
while date is the key, date_1 for this day, date_2 contains this day and the day before; date_3...
Then you can do the equal join in hive.
Using a view like:
select date, date from date_dim
union all
select date, date_add(date, -1) from date_dim
union all
select date, date_add(date, -2) from date_dim
union all
select date, date_add(date, -3) from date_dim
NOTE: THIS IS NOT AN ANSWER but an enhanced code sample of Diego Scaravaggi's answer. I am posting it as answer as the comment section is insufficient. Note that I have parameter-ized the period for Moving aveage.
declare #p int = 3
declare #t table(d int, bal float)
insert into #t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)
select a.d, avg(b.bal)
from
#t a
left join #t b on b.d between a.d-(#p-1) and a.d
group by a.d
--#p1 is period of moving average, #01 is offset
declare #p1 as int
declare #o1 as int
set #p1 = 5;
set #o1 = 3;
with np as(
select *, rank() over(partition by cmdty, tenor order by markdt) as r
from p_prices p1
where
1=1
)
, x1 as (
select s1.*, avg(s2.val) as avgval from np s1
inner join np s2
on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
and s2.r between s1.r - (#p1 - 1) - (#o1) and s1.r - (#o1)
group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
)
I'm not sure that your expected result (output) shows classic "simple moving (rolling) average" for 3 days. Because, for example, the first triple of numbers by definition gives:
ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
but you expect 4.360 and it's confusing.
Nevertheless, I suggest the following solution, which uses window-function AVG. This approach is much more efficient (clear and less resource-intensive) than SELF-JOIN introduced in other answers (and I'm surprised that no one has given a better solution).
-- Oracle-SQL dialect
with
data_table as (
select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
select date '2012-05-07' AS dt, 2.330 AS clicks from dual
),
param as (select 3 days from dual)
select
dt AS "Date",
clicks AS "Clicks",
case when rownum >= p.days then
avg(clicks) over (order by dt
rows between p.days - 1 preceding and current row)
end
AS "3 day Moving Average"
from data_table t, param p;
You see that AVG is wrapped with case when rownum >= p.days then to force NULLs in first rows, where "3 day Moving Average" is meaningless.
We can apply Joe Celko's "dirty" left outer join method (as cited above by Diego Scaravaggi) to answer the question as it was asked.
declare #ClicksTable table ([Date] date, Clicks int)
insert into #ClicksTable
select '2012-05-01', 2230 union all
select '2012-05-02', 3150 union all
select '2012-05-03', 5520 union all
select '2012-05-04', 1330 union all
select '2012-05-05', 2260 union all
select '2012-05-06', 3540 union all
select '2012-05-07', 2330
This query:
SELECT
T1.[Date],
T1.Clicks,
-- AVG ignores NULL values so we have to explicitly NULLify
-- the days when we don't have a full 3-day sample
CASE WHEN count(T2.[Date]) < 3 THEN NULL
ELSE AVG(T2.Clicks)
END AS [3-Day Moving Average]
FROM #ClicksTable T1
LEFT OUTER JOIN #ClicksTable T2
ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
GROUP BY T1.[Date]
Generates the requested output:
Date Clicks 3-Day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010