Assemble daily activity histogram data from job start and end times

Assemble daily activity histogram data from job start and end times - sql

I am trying to find a pure SQL (Oracle) way to reveal how many jobs are running at various times of day in a batch scheduling system. The database table contains historical data about past runs of jobs, including the start time (AH_TimeStamp1) and the end time (AH_TimeStamp4).
My aim is to assemble data for plotting a histogram with the time of day on the X-axis and the number of jobs per time subdivision on the Y-axis. The time subdivisions could be hours (24 divisions per day) or finer-grained such as 10-minute intervals. Ideally the query should be constructed so that both the number of subdivisions per day and the offset from midnight are easily adjustable.
I could perform lots of UNIONs — one per time subdivision — but this seems inelegant and tedious, particularly for smaller time subdivisions.

My first answer above works fine for small datasets. But if you have a massive data set, and particularly if you want fine-grained time slots, you will need a divide and conquer strategy to make it acceptably performant. The SQL below is based on the principal that a hash join on an equality is eminently desirable to reduce the size of the temporary Cartesian product join between your date ticks and your time-range data. Some date ranges will have the same hour for start & end. Some will have the same minute. Some will be within 30 mins of each other, etc.. You can take advantage of this by progressively taking these rows from the closest pairings (same minute) progressively up to the more distant pairings (same 12 hour period), using a hash join in each case on the equality of each period bucket. Then pick up any stragglers that defy all your bucketizing at the end without any equality. Hopefully that will be a small set. Here's the SQL:
WITH ahx AS (SELECT /*+ PARALLEL(8) MATERIALIZE */
ah_timestamp1,
ah_timestamp3,
TO_CHAR(TRUNC(ah.ah_timestamp1) + (FLOOR((ah.ah_timestamp1 - TRUNC(ah.ah_timestamp1))*2)/2),'HH24') ah_start_12hour,
TO_CHAR(TRUNC(ah.ah_timestamp3) + (FLOOR((ah.ah_timestamp3 - TRUNC(ah.ah_timestamp3))*2)/2),'HH24') ah_end_12hour,
TO_CHAR(TRUNC(ah.ah_timestamp1) + (FLOOR((ah.ah_timestamp1 - TRUNC(ah.ah_timestamp1))*6)/6),'HH24') ah_start_4hour,
TO_CHAR(TRUNC(ah.ah_timestamp3) + (FLOOR((ah.ah_timestamp3 - TRUNC(ah.ah_timestamp3))*6)/6),'HH24') ah_end_4hour,
TO_CHAR(ah.ah_timestamp1,'HH24') ah_start_hour,
TO_CHAR(ah.ah_timestamp3,'HH24') ah_end_hour,
TO_CHAR(TRUNC(ah.ah_timestamp1) + (FLOOR((ah.ah_timestamp1 - TRUNC(ah.ah_timestamp1))*48)/48),'HH24:MI') ah_start_30mins,
TO_CHAR(TRUNC(ah.ah_timestamp3) + (FLOOR((ah.ah_timestamp3 - TRUNC(ah.ah_timestamp3))*48)/48),'HH24:MI') ah_end_30mins,
TO_CHAR(ah.ah_timestamp1,'HH24:MI') ah_start_min,
TO_CHAR(ah.ah_timestamp3,'HH24:MI') ah_end_min
FROM ah),
ticks AS (SELECT /*+ NO_MERGE */
tick,
divisor,
time_offset,
TO_CHAR(TRUNC(tick_date) + (FLOOR((tick_date - TRUNC(tick_date))*2)/2),'HH24') tick_12hour,
TO_CHAR(TRUNC(tick_date) + (FLOOR((tick_date - TRUNC(tick_date))*6)/6),'HH24') tick_4hour,
TO_CHAR(tick_date,'HH24') tick_hour,
TO_CHAR(TRUNC(tick_date) + (FLOOR((tick_date - TRUNC(tick_date))*48)/48),'HH24:MI') tick_30mins,
TO_CHAR(tick_date,'HH24:MI') tick_min
FROM (SELECT ROWNUM tick,
divisor,
((ROWNUM-1)/divisor) time_offset,
TRUNC(SYSDATE)+((ROWNUM-1)/divisor) tick_date
FROM (SELECT 144 divisor FROM dual)
CONNECT BY level < divisor))
SELECT time_period,
SUM(cnt) cnt
FROM (SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'min' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_min = ahx.ah_end_min
AND ticks.tick_min = ahx.ah_start_min
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'30 mins' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_min != ahx.ah_end_min
AND ahx.ah_start_30mins = ahx.ah_end_30mins
AND ticks.tick_30mins = ahx.ah_start_30mins
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'hour' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_30mins != ahx.ah_end_30mins
AND ahx.ah_start_hour = ahx.ah_end_hour
AND ticks.tick_hour = ahx.ah_start_hour
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'4 hour' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_hour != ahx.ah_end_hour
AND ahx.ah_start_4hour = ahx.ah_end_4hour
AND ticks.tick_4hour = ahx.ah_start_4hour
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'12 hour' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_4hour != ahx.ah_end_4hour
AND ahx.ah_start_12hour = ahx.ah_end_12hour
AND ticks.tick_12hour = ahx.ah_start_12hour
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT /*+ PARALLEL(8) USE_HASH(ahx ticks) */
TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI') time_period,
COUNT(*) cnt,
'other' bucket
FROM ticks,
ahx
WHERE ahx.ah_start_12hour != ahx.ah_end_12hour
AND ahx.ah_start_4hour != ahx.ah_end_4hour
AND ahx.ah_start_hour != ahx.ah_end_hour
AND ahx.ah_start_30mins != ahx.ah_end_30mins
AND ahx.ah_start_min != ahx.ah_end_min
AND TRUNC(ahx.ah_timestamp1)+time_offset BETWEEN ahx.ah_timestamp1 AND ahx.ah_timestamp4
GROUP BY TO_CHAR(TRUNC(ahx.ah_timestamp1)+time_offset,'HH24:MI')
UNION ALL
SELECT tick_min time_period,
0 cnt,
'empty' bucket
FROM ticks)
GROUP BY time_period
ORDER BY 1

You need to generate a calendar of hours or minutes or whatever and drive your count off each of them. Something like this:
SELECT TO_CHAR(TRUNC(ah.ah_timestamp1)+((tick-1)/divisor),'HH24:MI') hour_of_day,
COUNT(*)
FROM (SELECT ROWNUM tick,24 divisor
FROM dual
CONNECT BY level < 24) x,
archiver_header ah
WHERE TRUNC(ah.ah_timestamp1)+((tick-1)/divisor) BETWEEN ah.ah_timestamp1 AND ah.ah_timestamp3
GROUP BY TO_CHAR(TRUNC(ah.ah_timestamp1)+((tick-1)/divisor),'HH24:MI')
For 10-minute intervals, change the divisor to 144.

Related

PostgreSQL showing different time periods in a single query

I have a query that will return the ratio of issuances from (issuances from specific network with specific time period / total issuances). so the issuances from specific network with a specific time period divided to total issuances from all networks. Right now it returns the ratios of issuances only from last year (year-to-date I mean), I want to include several time periods in it such as one month ago, 2 month ago etc. LEFT JOIN usually works but I couldn't figure it out for this one. How do I do it?
Here is the query:
SELECT IR1.network,
count(*) / ((select count(*) FROM issuances_extended
where status = 'completed' and
issued_at >= date_trunc('year',current_date)) * 1.) as issuance_ratio_ytd
FROM issuances_extended as IR1 WHERE status = 'completed' and
(issued_at >= date_trunc('year',current_date))
GROUP BY
IR1.network
order by IR1.network

I would break your query into CTEs something like this:
with periods (period_name, period_range) as (
values
('YTD', daterange(date_trunc('year', current_date), null)),
('LY', daterange(date_trunc('year', current_date - 'interval 1 year'),
date_trunc('year', current_date))),
('MTD', daterange(date_trunc('month', current_date - 'interval 1 month'),
date_trunc('month', current_date));
-- Add whatever other intervals you want to see
), period_totals as ( -- Get period totals
select p.period_name, p.period_range, count(*) as total_issuances
from periods p
join issuances_extended i
on i.status = 'completed'
and i.issued_at <# p.period_range
)
select p.period_name, p.period_range,
i.network, count(*) as network_issuances,
1.0 * count(*) / p.total_issuances as issuance_ratio
from period_totals p
join issuances_extended i
on i.status = 'completed'
and i.issued_at <# p.period_range
group by p.period_name, p.period_range, i.network, p.total_issuances;
The problem with this is that you get rows instead of columns, but you can use a spreadsheet program or reporting tool to pivot if you need to. This method simplifies the calculations and lets you add whatever period ranges you want by adding more values to the periods CTE.

Something like this? Obviously not tested
SELECT
IR1.network,
count(*)/((select count(*) FROM issuances_extended
where status = 'completed' and
issued_at between mon.t and current_date ) * 1.) as issuance_ratio_ytd
FROM
issuances_extended as IR1 ,
(
SELECT
generate_series('2022-01-01'::date,
'2022-07-01'::date, '1 month') AS t)
AS mon
WHERE
status = 'completed' and
(issued_at between mon.t and current_date)
GROUP BY
IR1.network
ORDER BY
IR1.network

I've managed to join these tables, so I am answering my question for those who would need some help. To add more tables all you have to do is put new queries in LEFT JOIN and acknowledge them in the base query (IR3, IR4, blabla etc.)
SELECT
IR1.network,
count(*) / (
(
select
count(*)
FROM
issuances_extended
where
status = 'completed'
and issued_at >= date_trunc('year', current_date)
) * 1./ 100
) as issuances_ratio_ytd,
max(coalesce(IR2.issuances_ratio_m0, 0)) as issuances_ratio_m0
FROM
issuances_extended as IR1
LEFT JOIN (
SELECT
network,
count(*) / (
(
select
count(*)
FROM
issuances_extended
where
status = 'completed'
and issued_at >= date_trunc('month', current_date)
) * 1./ 100
) as issuances_ratio_m0
FROM
issuances_extended
WHERE
status = 'completed'
and (issued_at >= date_trunc('month', current_date))
GROUP BY
network
) AS IR2 ON IR1.network = IR2.network
WHERE
status = 'completed'
and (issued_at >= date_trunc('year', current_date))
GROUP BY
IR1.network,
IR2.issuances_ratio_m0
order by
IR1.network

Postgres: Session duration per event (row)

I'm trying to write a query that builds a session duration per each event.
The database houses events from a webapp, each with a session-id and a timestamp.
Each row represents one event.
I thought I could solve this with a recursive query, but every attempt runs for minutes with no return. It's driving me crazy.
This is what I have so far.
with recursive session_time as (
select
f.data->'sessionId' as session_id,
f.ts,
null::timestamp with time zone as prev_timestamp,
0 as session_duration
from arbiter_events as f
union
select
n.data->'sessionId' as session_id,
n.ts,
st.ts as prev_timestamp,
(EXTRACT(epoch from (n.ts - (
select
st.ts
from arbiter_events p
where p.ts < n.ts
order by p.ts desc
limit 1
))) + st.session_duration)::integer as session_duration
from arbiter_events as n
inner join session_time st on st.session_id = n.data->'sessionId'
)
SELECT
ae.customer,
ae.username,
ae.data->'category' as category,
ae.data->'subCategory' as subcategory,
st.session_id,
st.session_duration
from arbiter_events ae
left join session_time st on ae.data->'sessionId' = st.session_id;

Datetime SQL statement (Working in SQL Developer)

I'm new to the SQL scene but I've started to gather some data that makes sense to me after learning a little about SQL Developer. Although, I do need help with a query.
My goal:
To use the current criteria I have and select records only when the date-time value is within 5 minutes of the latest date-time. Here is my current sql statement
`SELECT ABAMS.T_WORKORDER_HIST.LINE_NO AS Line,
ABAMS.T_WORKORDER_HIST.STATE AS State,
ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE,
ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO,
ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO_EXT,
ASMBLYTST.V_SEQ_SERIAL_ALL.UPD_REASON_CODE,
ABAMS.V_SERIAL_LINESET.LINESET_DATE AS "Lineset Time",
ABAMS.T_WORKORDER_HIST.SERIAL_NO AS ESN,
ABAMS.T_WORKORDER_HIST.ITEM_NO AS "Shop Order",
ABAMS.T_WORKORDER_HIST.CUST_NAME AS Customer,
ABAMS.T_ITEM_POLICY.PL_LOC_DROP_ZONE_NO AS PLDZ,
ABAMS.T_WORKORDER_HIST.CONFIG_NO AS Configuration,
ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN AS "Last Sta",
ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_LOC,
ASMBLYTST.V_LAST_ENG_LOCATION.LAST_MES_LOC,
ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_TIME,
ASMBLYTST.V_LAST_ENG_LOCATION.LAST_MES_TIME
FROM ABAMS.T_WORKORDER_HIST
LEFT JOIN ABAMS.V_SERIAL_LINESET
ON ABAMS.V_SERIAL_LINESET.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO
LEFT JOIN ASMBLYTST.V_EDP_ENG_LAST_ABSN
ON ASMBLYTST.V_EDP_ENG_LAST_ABSN.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO
LEFT JOIN ASMBLYTST.V_SEQ_SERIAL_ALL
ON ASMBLYTST.V_SEQ_SERIAL_ALL.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO
LEFT JOIN ABAMS.T_ITEM_POLICY
ON ABAMS.T_ITEM_POLICY.ITEM_NO = ABAMS.T_WORKORDER_HIST.ITEM_NO
LEFT JOIN ABAMS.T_CUR_STATUS
ON ABAMS.T_CUR_STATUS.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO
INNER JOIN ASMBLYTST.V_LAST_ENG_LOCATION
ON ASMBLYTST.V_LAST_ENG_LOCATION.SERIAL_NO = ABAMS.T_WORKORDER_HIST.SERIAL_NO
WHERE ABAMS.T_WORKORDER_HIST.LINE_NO = 10
AND (ABAMS.T_WORKORDER_HIST.STATE = 'PROD'
OR ABAMS.T_WORKORDER_HIST.STATE = 'SCHED')
AND ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE BETWEEN TRUNC(SysDate) - 10 AND TRUNC(SysDate) + 1
AND (ABAMS.V_SERIAL_LINESET.LINESET_DATE IS NOT NULL
OR ABAMS.V_SERIAL_LINESET.LINESET_DATE IS NULL)
AND (ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN < '1800'
OR ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN IS NULL)
ORDER BY ASMBLYTST.V_EDP_ENG_LAST_ABSN.LAST_ASMBLY_ABSN DESC Nulls Last,
ABAMS.V_SERIAL_LINESET.LINESET_DATE Nulls Last,
ASMBLYTST.V_SEQ_SERIAL_ALL.BUILD_DATE,
ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO,
ASMBLYTST.V_SEQ_SERIAL_ALL.SEQ_NO_EXT`
Here are some of the records I get from the table
ASMBLYTST.V_LAST_ENG_LOCATION.LAST_ASMBLY_TIME
2018-06-14 01:28:25
2018-06-14 01:29:26
2018-06-14 01:27:30
2018-06-13 22:44:03
2018-06-14 01:28:45
2018-06-14 01:27:37
2018-06-14 01:27:41
What I essentially want is for
2018-06-13 22:44:03
to be excluded from the query because it is not within the 5 minute window from the latest record Which in this data set is
2018-06-14 01:29:26
The one dynamic problem i seem to have is that the values for date-time are constantly updating.
Any ideas?
Thank you!

Here are two different solutions, each uses a table called "ASET".
ASET contains 20 records 1 minute apart:
WITH
aset (ttime, cnt)
AS
(SELECT systimestamp AS ttime, 1 AS cnt
FROM DUAL
UNION ALL
SELECT ttime + INTERVAL '1' MINUTE AS ttime, cnt + 1 AS cnt
FROM aset
WHERE cnt < 20)
select * from aset;
Now using ASET for our data, the following query finds the maximum date in ASET, and restricts the results to the six records within 5 minutes of ASET:
SELECT *
FROM aset
WHERE ttime >= (SELECT MAX (ttime)
FROM aset)
- INTERVAL '5' MINUTE;
An alternative is to use an analytic function:
with bset
AS
(SELECT ttime, cnt, MAX (ttime) OVER () - ttime AS delta
FROM aset)
SELECT *
FROM bset
WHERE delta <= INTERVAL '5' MINUTE

Cohort/ Retention query in BigQuery using Google Analytics exported data

I need help formulating a cohort/retention query
I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again
The output I (eventually) need looks like this...
The table I am dealing with is an export of Google Analytics to BigQuery
If anyone could help me with this or anyone who has written a query similar that I can manipulate?
Thanks

Just to give you simple idea / direction
Below is for BigQuery Standard SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
You can test it with below dummy data from your question
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
The OutputFromQuery data is as below:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
and the final output is:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0

I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)
It should work :)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day

So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.
Can anyone review this for me and let me know what you think?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
Thank you in advance!

If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):
As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.
I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.

Refine my T-SQL query to increase performance

Created a SQL query to summarize some data. It is slow, so I thought I'd ask for some help.
Table is a log table that has :
loc, tag, entrytime, exittime, visits, entrywt, exitwt
My test log has 700,000 records in it. The entrytime and exittime are epoch values.
I know my query is inefficient as it rips through the table 4 times.
select
loc, edate, tag,
(select COUNT(*) from mylog as ml
where mvlog.loc = ml.loc
and mvlog.edate = CONVERT(date, DATEADD(ss, ml.entrytime, '19700101'))
and mvlog.tag = ml.tag) as visits,
(select SUM(entrywt - exitwt) from mylog as ml2
where mvlog.loc = ml2.loc
and mvlog.edate = CONVERT(date, DATEADD(ss, ml2.entrytime, '19700101'))
and mvlog.tag = ml2.tag) as consumed,
(select SUM(exittime - entrytime) from mylog as ml3
where mvlog.loc = ml3.loc
and mvlog.edate = CONVERT(date, DATEADD(ss, ml3.entrytime, '19700101'))
and mvlog.tag = ml3.tag) as occupancy
from
eventlogV as mvlog with (INDEX(pt_index))
Index pt_index is made up of columns tag and loc.
When I run this query, it completes in roughly 30 seconds. Since my query is inefficient, I am sure it can be better.
Any ideas appreciated.

Seems like you can just LEFT JOIN mylog to eventlogV once and get the same results.
SELECT mvlog.loc,
mvlog.edate,
mvlog.tag,
COUNT(ml.loc) AS visits,
SUM(entrywt - exitwt) AS consumed,
SUM(exittime - entrytime) AS occupancy
FROM eventlogV AS mvlog
LEFT OUTER JOIN mylog ml ON mvlog.loc = ml.loc
AND mvlog.edate = CONVERT(DATE,DATEADD(ss,ml.entrytime,'19700101'))
AND mvlog.tag = ml.tag
GROUP BY mvlog.loc,
mvlog.edate,
mvlog.tag

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Assemble daily activity histogram data from job start and end times - sql

Related

PostgreSQL showing different time periods in a single query

Postgres: Session duration per event (row)

Datetime SQL statement (Working in SQL Developer)

Cohort/ Retention query in BigQuery using Google Analytics exported data

Refine my T-SQL query to increase performance

Categories

Resources