Funnel query with Amazon Redshift / PostgreSQL - sql

I'm trying to analyze a funnel using event data in Redshift and have difficulties finding an efficient query to extract that data.
For example, in Redshift I have:
timestamp action user id
--------- ------ -------
2015-05-05 12:00 homepage 1
2015-05-05 12:01 product page 1
2015-05-05 12:02 homepage 2
2015-05-05 12:03 checkout 1
I would like to extract the funnel statistics. For example:
homepage_count product_page_count checkout_count
-------------- ------------------ --------------
100 50 25
Where homepage_count represent the distinct number of users who visited the homepage, product_page_count represents the distinct numbers of users who visited the homepage after visiting the homepage, and checkout_count represents the number of users who checked out after visiting the homepage and the product page.
What would be the best query to achieve that with Amazon Redshift? Is it possible to do with a single query?

I think the best method might be to add flags to the data for the first visit of each type for each user and then use these for aggregation logic:
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts.productpage and ts.productpage > ts.homepage then 1 else 0 end) as checkout_count
from (select userid,
min(case when action = 'homepage' then timestamp end) as ts_homepage,
min(case when action = 'product page' then timestamp end) as ts_productpage,
min(case when action = 'checkout' then timestamp end) as ts_checkout
from table t
group by userid
) t

The above answer is very much correct . I have modified it for people using it for AWS Mobile Analytics and Redshift.
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts_productpage and ts_productpage > ts_homepage then 1 else 0 end) as checkout_count
from (select client_id,
min(case when event_type = 'App Launch' then event_timestamp end) as ts_homepage,
min(case when event_type = 'SignUp Success' then event_timestamp end) as ts_productpage,
min(case when event_type = 'Start Quiz' then event_timestamp end) as ts_checkout
from awsma.v_event
group by client_id
) ts;

Just in case more precise model required: when product page can be opened twice. First time before home page and second one after. This case usually should be considered as conversion as well.
Redshift SQL query:
SELECT
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL
THEN user_id END
) Step1,
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL
THEN user_id END
) Step2,
COUNT(
DISTINCT CASE WHEN
cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL AND cur_checkout_time IS NOT NULL
THEN user_id END
) Step3
FROM (
SELECT
user_id,
timestamp,
COALESCE(homepage_time,
LAG(homepage_time) IGNORE NULLS OVER(PARTITION BY user_id
ORDER BY time)
) cur_homepage_time,
COALESCE(productpage_time,
LAG(productpage_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_productpage_time,
COALESCE(checkout_time,
LAG(checkout_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_checkout_time
FROM
(
SELECT
timestamp,
user_id,
(CASE WHEN event = 'homepage'
THEN timestamp END) homepage_time,
(CASE WHEN event = 'product page'
THEN timestamp END) productpage_time,
(CASE WHEN event = 'checkout'
THEN timestamp END) checkout_time
FROM events
WHERE timestamp > '2016-05-01' AND timestamp < '2017-01-01'
ORDER BY user_id, timestamp
) event_times
ORDER BY user_id, timestamp
) event_windows
This query fills each row's cur_homepage_time, cur_productpage_time and cur_checkout_time with recent timestamp of event occurrences. So in case for some specific time (read row) event occured then particular column is not NULL.
More info here.

Related

SQL aggregate function inside an aggregate function

I know it's not possible to nest aggregate functions. But I want to achieve something like this and quite confused about how to do this compromising performance.
SELECT
date,
count(CASE WHEN SUM(active_time) > 5 THEN user_id END) AS total_active_users,
count(CASE WHEN SUM(active_time) > 5 AND is_admin = true THEN user_id END) AS total_active_admin_users
FROM
(
SELECT date, user_id, user_name, active_time, is_admin FROM users
)
GROUP BY date
It's really appreciated if someone could suggest a way to achieve this.
Perhaps you want something like this:
select date,
sum(case when sum_active_time > 5 then 1 else 0 end) as total_active_users,
sum(case when sum_active_time > 5 and is_admin then 1 else 0 end) as total_active_admin_users
from (select u.*, sum(active_time) over (partition by user_id) as sum_active_time
from users
) u
group by date;
However, I would expect user_id to be unique in a table called users. That makes me wonder why you need to do a count or sum at all. So, you might want:
select date,
sum(case when active_time > 5 then 1 else 0 end) as total_active_users,
sum(case when active_time > 5 and is_admin then 1 else 0 end) as total_active_admin_users
from users
group by date;
SELECT date,
COUNT(user_id) as total_active_users,
COUNT(CASE WHEN is_admin = 1 THEN user_id END ) as total_active_admin_users
FROM (
SELECT date, is_admin, user_id
FROM users
GROUP BY date, is_admin, user_id
HAVING SUM(active_time) > 5
) t
GROUP BY date

Hive rolling sum of data over date

I am working on Hive and am facing an issue with rolling counts. The sample data I am working on is as shown below:
and the output I am expecting is as shown below:
I tried using the following query but it is not returning the rolling count:
select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt
desc)
as rnum from table.A
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1
group by event_dt, status;
Please help me with this if some one has solved a similar issue.
You seem to just want conditional aggregation:
select event_dt,
sum(case when status = 'Registered' then 1 else 0 end) as registered,
sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when status = 'suspended' then 1 else 0 end) as suspended,
sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A
group by event_dt
order by event_dt;
EDIT:
This is a tricky problem. The solution I've come up with does a cross-product of dates and users and then calculates the most recent status as of each date.
So:
select a.event_dt,
sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
from (select distinct event_dt from table.A) d cross join
(select distinct account from table.A) ac left join
(select a.*,
row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
from table.A a
) a
on a.event_dt = d.event_dt and
a.account = ac.account and
a.seqnum = 1 -- get the last one on the date
) a left join
table.A aa
on aa.timestamp = a.last_status_timestamp and
aa.account = a.account
group by d.event_dt
order by d.event_dt;
What this is doing is creating a derived table with rows for all accounts and dates. This has the status on certain days, but not all days.
The cumulative max for last_status_timestamp calculates the most recent timestamp that has a valid status. This is then joined back to the table to get the status on that date. Voila! This is the status used for the conditional aggregation.
The cumulative max and join is a work-around because Hive does not (yet?) support the ignore nulls option in lag().

How I can group by and count in PostgreSQL to prevent empty cells in result

I have the table in PostgreSQL DB
Need to calculate SUM of counts for each event_type (example for 4 and 1)
When I use query like this
SELECT account_id, date,
CASE
WHEN event_type = 1 THEN SUM(count)
ELSE null
END AS shows,
CASE
WHEN event_type = 4 THEN SUM(count)
ELSE null
END AS clicks
FROM widgetstatdaily WHERE account_id = 272 AND event_type = 1 OR event_type = 4 GROUP BY account_id, date, event_type ORDER BY date
I receive this table
With <null> fields. It's because I have event_type in select and I need to GROUP BY on it.
How I can make query to receive grouped by account_id and date result without null's in cells? Like (first row)
272 2018-03-28 00:00:00.000000 57 2
May be I can group it after receiving result
You need conditional aggregation and some other fixes. Try this:
SELECT account_id, date,
SUM(CASE WHEN event_type = 1 THEN count END) as shows,
SUM(CASE WHEN event_type = 4 THEN count END) as clicks
FROM widgetstatdaily
WHERE account_id = 272 AND
event_type IN (1, 4)
GROUP BY account_id, date
ORDER BY date;
Notes:
The CASE expression should be an argument to the SUM().
The ELSE NULL is redundant. The default without an ELSE is NULL.
The logic in the WHERE clause is probably not what you intend. That is fixed using IN.
try its
SELECT account_id, date,
SUM(CASE WHEN event_type = 1 THEN count else 0 END) as shows,
SUM(CASE WHEN event_type = 4 THEN count else 0 END) as clicks
FROM widgetstatdaily
WHERE account_id = 272 AND
event_type IN (1, 4)
GROUP BY account_id, date
ORDER BY date;

Finding entry page, exit page and bounces -sql

My table structure is as follows:
Sessionid Pageurl timestamp
abc1 /testpage1 1465374987308
abc1 /testpage2 1465375020477
abc2 /testpage2 1465374987308
I wish to create a report of entry page count, exit page count and bounces count per page.
For any session, the first page is entry page and last page an exit page.
A bounce occurs when user leaves after viewing the first page(session has a single entry)
Final report would be as below..
pageurl EntrypageCount ExitPagecount BounceCount
/testpage1 1 0 0
/testpage2 1 2 1
I have been able to get bounces but on per day basis.
For bounces, the base select is..
SELECT sessionid, min(timestamp),CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
FROM auditdata GROUP BY sessionid.
But can not figure out how to get them by pageurl.
All help is sincerely appreciated.
Thanks
The following is one way (demo).
SELECT Pageurl,
COUNT(CASE WHEN timestamp = First THEN 1 END) AS EntrypageCount,
COUNT(CASE WHEN timestamp = Last THEN 1 END) AS ExitPagecount,
COUNT(CASE WHEN Count = 1 THEN 1 END) AS BounceCount
FROM (SELECT Pageurl,
timestamp,
MIN(timestamp) OVER (PARTITION BY Sessionid) AS First,
MAX(timestamp) OVER (PARTITION BY Sessionid) AS Last,
COUNT(*) OVER (PARTITION BY Sessionid) AS Count
FROM auditdata) T
GROUP BY Pageurl;
The above uses window functions, which most modern RDBMSs support, a version without would be.
SELECT Pageurl,
COUNT(CASE WHEN timestamp = First THEN 1 END) AS EntrypageCount,
COUNT(CASE WHEN timestamp = Last THEN 1 END) AS ExitPagecount,
COUNT(CASE WHEN Count = 1 THEN 1 END) AS BounceCount
FROM auditdata a
JOIN (SELECT Sessionid,
MIN(timestamp) AS First,
MAX(timestamp) AS Last,
COUNT(*) AS Count
FROM auditdata
GROUP BY Sessionid) g
ON a.Sessionid = g.Sessionid
GROUP BY Pageurl;

SQL sum of column value, unique per user per day

I have a postgres table that looks like this:
id | user_id | state | created_at
The state can be any of the following:
new, paying, paid, completing, complete, payment_failed, completion_failed
I need a statement that returns a report with the following:
sum of all paid states by date
sum of all completed states by date
sum of all new, paying, completing states by date with only one per user per day to be counted
sum of all payment_failed, completion_failed by date with only one per user per day to be counted
So far I have this:
SELECT
DATE(created_at) AS date,
SUM(CASE WHEN state = 'complete' THEN 1 ELSE 0 END) AS complete,
SUM(CASE WHEN state = 'paid' THEN 1 ELSE 0 END) AS paid
FROM orders
WHERE created_at BETWEEN ? AND ?
GROUP BY DATE(created_at)
A sum of the in progress and failed states is easy enough by adding this to the select:
SUM(CASE WHEN state IN('new','paying','completing') THEN 1 ELSE 0 END) AS in_progress,
SUM(CASE WHEN state IN('payment_failed','completion_failed') THEN 1 ELSE 0 END) AS failed
But i'm having trouble figuring out how to make only one per user_id per day in_progress and failed states to be counted.
The reason I need this is to manipulate the failure rate in our stats, as many users who trigger a failure or incomplete order go on to trigger more which inflates our failure rate.
Thanking you in advance.
SELECT created_at::date AS the_date
,SUM(CASE WHEN state = 'complete' THEN 1 ELSE 0 END) AS complete
,SUM(CASE WHEN state = 'paid' THEN 1 ELSE 0 END) AS paid
,COUNT(DISTINCT CASE WHEN state IN('new','paying','completing')
THEN user_id ELSE NULL END) AS in_progress
,COUNT(DISTINCT CASE WHEN state IN('payment_failed','completion_failed')
THEN user_id ELSE NULL END) AS failed
FROM orders
WHERE created_at BETWEEN ? AND ?
GROUP BY created_at::date
I use the_date as alias, since it is unwise (while allowed) to use the key word date as identifier.
You could use a similar technique for complete and paid, one is as good as the other there:
COUNT(CASE WHEN state = 'complete' THEN 1 ELSE NULL END) AS complete
Try something like:
SELECT
DATE(created_at) AS date,
SUM(CASE WHEN state = 'complete' THEN 1 ELSE 0 END) AS complete,
SUM(CASE WHEN state = 'paid' THEN 1 ELSE 0 END) AS paid,
COUNT(DISTINCT CASE WHEN state IN('new','paying','completing') THEN user_id ELSE NULL END) AS in_progress,
COUNT(DISTINCT CASE WHEN state IN('payment_failed','completion_failed') THEN user_id ELSE NULL END) AS failed
FROM orders
WHERE created_at BETWEEN ? AND ?
GROUP BY DATE(created_at);
The main idea - COUNT (DISTINCT ...) will count unique user_id and wont count NULL values.
Details: aggregate functions, 4.2.7. Aggregate Expressions
The whole query with same style counts and simplified CASE WHEN ...:
SELECT
DATE(created_at) AS date,
COUNT(CASE WHEN state = 'complete' THEN 1 END) AS complete,
COUNT(CASE WHEN state = 'paid' THEN 1 END) AS paid,
COUNT(DISTINCT CASE WHEN state IN('new','paying','completing') THEN user_id END) AS in_progress,
COUNT(DISTINCT CASE WHEN state IN('payment_failed','completion_failed') THEN user_id END) AS failed
FROM orders
WHERE created_at BETWEEN ? AND ?
GROUP BY DATE(created_at);