Finding entry page, exit page and bounces -sql - sql

My table structure is as follows:
Sessionid Pageurl timestamp
abc1 /testpage1 1465374987308
abc1 /testpage2 1465375020477
abc2 /testpage2 1465374987308
I wish to create a report of entry page count, exit page count and bounces count per page.
For any session, the first page is entry page and last page an exit page.
A bounce occurs when user leaves after viewing the first page(session has a single entry)
Final report would be as below..
pageurl EntrypageCount ExitPagecount BounceCount
/testpage1 1 0 0
/testpage2 1 2 1
I have been able to get bounces but on per day basis.
For bounces, the base select is..
SELECT sessionid, min(timestamp),CASE WHEN count(*) = 1 THEN 1 ELSE 0 END AS bounces
FROM auditdata GROUP BY sessionid.
But can not figure out how to get them by pageurl.
All help is sincerely appreciated.
Thanks

The following is one way (demo).
SELECT Pageurl,
COUNT(CASE WHEN timestamp = First THEN 1 END) AS EntrypageCount,
COUNT(CASE WHEN timestamp = Last THEN 1 END) AS ExitPagecount,
COUNT(CASE WHEN Count = 1 THEN 1 END) AS BounceCount
FROM (SELECT Pageurl,
timestamp,
MIN(timestamp) OVER (PARTITION BY Sessionid) AS First,
MAX(timestamp) OVER (PARTITION BY Sessionid) AS Last,
COUNT(*) OVER (PARTITION BY Sessionid) AS Count
FROM auditdata) T
GROUP BY Pageurl;
The above uses window functions, which most modern RDBMSs support, a version without would be.
SELECT Pageurl,
COUNT(CASE WHEN timestamp = First THEN 1 END) AS EntrypageCount,
COUNT(CASE WHEN timestamp = Last THEN 1 END) AS ExitPagecount,
COUNT(CASE WHEN Count = 1 THEN 1 END) AS BounceCount
FROM auditdata a
JOIN (SELECT Sessionid,
MIN(timestamp) AS First,
MAX(timestamp) AS Last,
COUNT(*) AS Count
FROM auditdata
GROUP BY Sessionid) g
ON a.Sessionid = g.Sessionid
GROUP BY Pageurl;

Related

SQL - Set marker for special data-constellations

I need some SQL advice here...
I've got a table with an object (called "entityid") , an updated timestamp and a status of that object.
I now want to track, how often that object was set "inactive" by the user. But it should only count max. 1x inactive per day. If the status before was also inactive, it should not count!
So here's a little example i prepared in Excel to show where the marker should appear and where not:
Do you have any advice how I can solve this by using SQL ? (We're currently working with Redshift -> PostgreSQL).
If I understand correctly, you can use window functions. This returns the first "inactive" on each day:
select t.*,
(content_status = 'inactive' and
row_number() over (partition by entityid, updated_at::date, content_status) = 1
) as needed_marker
from t;
If I understand correctly, you can use window functions. This returns the first "inactive" on each day:
select t.*,
(content_status = 'inactive' and
row_number() over (partition by entityid, updated_at::date, content_status order by lastmodifiedtimestamp) = 1
) as needed_marker
from t;
Note: I'm not sure if updated_at is just the date. If it is, then the logic is more like:
select t.*,
(content_status = 'inactive' and
row_number() over (partition by entityid, updated_at, content_status order by lastmodifiedtimestamp) = 1
) as needed_marker
from t;
EDIT:
If you want the first time that the status changes from active to inactive, then:
select t.*,
(content_status = 'inactive' and
num_actives = 1 and
prev_status = 'active'
) as needed_marker
from (select t.*,
sum(case when status = 'active' then 1 else 0 end) over (partition by entityid, updated_at order by lastmodifiedtimestamp) as num_actives,
lag(content_status) over (partition by entityid, updated_at lastmodifiedtimestamp) as prev_status
from t
) t;
Actually, the subquery is not needed:
select t.*,
(content_status = 'inactive' and
sum(case when status = 'active' then 1 else 0 end) over (partition by entityid, updated_at order by lastmodifiedtimestamp) = 1 and
lag(content_status) over (partition by entityid, updated_at lastmodifiedtimestamp) = 'active'
) as needed_marker
from t;

Hive rolling sum of data over date

I am working on Hive and am facing an issue with rolling counts. The sample data I am working on is as shown below:
and the output I am expecting is as shown below:
I tried using the following query but it is not returning the rolling count:
select event_dt,status, count(distinct account) from
(select *, row_number() over (partition by account order by event_dt
desc)
as rnum from table.A
where event_dt between '2018-05-02' and '2018-05-04') x where rnum =1
group by event_dt, status;
Please help me with this if some one has solved a similar issue.
You seem to just want conditional aggregation:
select event_dt,
sum(case when status = 'Registered' then 1 else 0 end) as registered,
sum(case when status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when status = 'suspended' then 1 else 0 end) as suspended,
sum(case when status = 'reactive' then 1 else 0 end) as reactive
from table.A
group by event_dt
order by event_dt;
EDIT:
This is a tricky problem. The solution I've come up with does a cross-product of dates and users and then calculates the most recent status as of each date.
So:
select a.event_dt,
sum(case when aa.status = 'Registered' then 1 else 0 end) as registered,
sum(case when aa.status = 'active_acct' then 1 else 0 end) as active_acct,
sum(case when aa.status = 'suspended' then 1 else 0 end) as suspended,
sum(case when aa.status = 'reactive' then 1 else 0 end) as reactive
from (select d.event_dt, ac.account, a.status,
max(case when a.status is not null then a.timestamp end) over (partition by ac.account order by d.event_dt) as last_status_timestamp
from (select distinct event_dt from table.A) d cross join
(select distinct account from table.A) ac left join
(select a.*,
row_number() over (partition by account, event_dt order by timestamp desc) as seqnum
from table.A a
) a
on a.event_dt = d.event_dt and
a.account = ac.account and
a.seqnum = 1 -- get the last one on the date
) a left join
table.A aa
on aa.timestamp = a.last_status_timestamp and
aa.account = a.account
group by d.event_dt
order by d.event_dt;
What this is doing is creating a derived table with rows for all accounts and dates. This has the status on certain days, but not all days.
The cumulative max for last_status_timestamp calculates the most recent timestamp that has a valid status. This is then joined back to the table to get the status on that date. Voila! This is the status used for the conditional aggregation.
The cumulative max and join is a work-around because Hive does not (yet?) support the ignore nulls option in lag().

How I can group by and count in PostgreSQL to prevent empty cells in result

I have the table in PostgreSQL DB
Need to calculate SUM of counts for each event_type (example for 4 and 1)
When I use query like this
SELECT account_id, date,
CASE
WHEN event_type = 1 THEN SUM(count)
ELSE null
END AS shows,
CASE
WHEN event_type = 4 THEN SUM(count)
ELSE null
END AS clicks
FROM widgetstatdaily WHERE account_id = 272 AND event_type = 1 OR event_type = 4 GROUP BY account_id, date, event_type ORDER BY date
I receive this table
With <null> fields. It's because I have event_type in select and I need to GROUP BY on it.
How I can make query to receive grouped by account_id and date result without null's in cells? Like (first row)
272 2018-03-28 00:00:00.000000 57 2
May be I can group it after receiving result
You need conditional aggregation and some other fixes. Try this:
SELECT account_id, date,
SUM(CASE WHEN event_type = 1 THEN count END) as shows,
SUM(CASE WHEN event_type = 4 THEN count END) as clicks
FROM widgetstatdaily
WHERE account_id = 272 AND
event_type IN (1, 4)
GROUP BY account_id, date
ORDER BY date;
Notes:
The CASE expression should be an argument to the SUM().
The ELSE NULL is redundant. The default without an ELSE is NULL.
The logic in the WHERE clause is probably not what you intend. That is fixed using IN.
try its
SELECT account_id, date,
SUM(CASE WHEN event_type = 1 THEN count else 0 END) as shows,
SUM(CASE WHEN event_type = 4 THEN count else 0 END) as clicks
FROM widgetstatdaily
WHERE account_id = 272 AND
event_type IN (1, 4)
GROUP BY account_id, date
ORDER BY date;

Funnel query with Amazon Redshift / PostgreSQL

I'm trying to analyze a funnel using event data in Redshift and have difficulties finding an efficient query to extract that data.
For example, in Redshift I have:
timestamp action user id
--------- ------ -------
2015-05-05 12:00 homepage 1
2015-05-05 12:01 product page 1
2015-05-05 12:02 homepage 2
2015-05-05 12:03 checkout 1
I would like to extract the funnel statistics. For example:
homepage_count product_page_count checkout_count
-------------- ------------------ --------------
100 50 25
Where homepage_count represent the distinct number of users who visited the homepage, product_page_count represents the distinct numbers of users who visited the homepage after visiting the homepage, and checkout_count represents the number of users who checked out after visiting the homepage and the product page.
What would be the best query to achieve that with Amazon Redshift? Is it possible to do with a single query?
I think the best method might be to add flags to the data for the first visit of each type for each user and then use these for aggregation logic:
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts.productpage and ts.productpage > ts.homepage then 1 else 0 end) as checkout_count
from (select userid,
min(case when action = 'homepage' then timestamp end) as ts_homepage,
min(case when action = 'product page' then timestamp end) as ts_productpage,
min(case when action = 'checkout' then timestamp end) as ts_checkout
from table t
group by userid
) t
The above answer is very much correct . I have modified it for people using it for AWS Mobile Analytics and Redshift.
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts_productpage and ts_productpage > ts_homepage then 1 else 0 end) as checkout_count
from (select client_id,
min(case when event_type = 'App Launch' then event_timestamp end) as ts_homepage,
min(case when event_type = 'SignUp Success' then event_timestamp end) as ts_productpage,
min(case when event_type = 'Start Quiz' then event_timestamp end) as ts_checkout
from awsma.v_event
group by client_id
) ts;
Just in case more precise model required: when product page can be opened twice. First time before home page and second one after. This case usually should be considered as conversion as well.
Redshift SQL query:
SELECT
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL
THEN user_id END
) Step1,
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL
THEN user_id END
) Step2,
COUNT(
DISTINCT CASE WHEN
cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL AND cur_checkout_time IS NOT NULL
THEN user_id END
) Step3
FROM (
SELECT
user_id,
timestamp,
COALESCE(homepage_time,
LAG(homepage_time) IGNORE NULLS OVER(PARTITION BY user_id
ORDER BY time)
) cur_homepage_time,
COALESCE(productpage_time,
LAG(productpage_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_productpage_time,
COALESCE(checkout_time,
LAG(checkout_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_checkout_time
FROM
(
SELECT
timestamp,
user_id,
(CASE WHEN event = 'homepage'
THEN timestamp END) homepage_time,
(CASE WHEN event = 'product page'
THEN timestamp END) productpage_time,
(CASE WHEN event = 'checkout'
THEN timestamp END) checkout_time
FROM events
WHERE timestamp > '2016-05-01' AND timestamp < '2017-01-01'
ORDER BY user_id, timestamp
) event_times
ORDER BY user_id, timestamp
) event_windows
This query fills each row's cur_homepage_time, cur_productpage_time and cur_checkout_time with recent timestamp of event occurrences. So in case for some specific time (read row) event occured then particular column is not NULL.
More info here.

Multiple Queries in different table

(Also posted here.)
So I have two tables, one is invalid table and the other is valid table.
valid table:
id
status
date
invalid table:
id
status
date
I have to produce a report with this output:
date on-time late total valid invalid1 invalid2 total rate
--------- ------- ---- ----- ----- -------- -------- ----- ----
9/10/2011 4 10 14 3 3 3 6
date: common fields on the 2 tables, field to group by, how many records on that day has
on-time: count of all the id on the valid table
late: count of all the records(id) on the invalid table
total: total of on-time and late
valid: count of id on the valid table with the "valid" status
invalid1: count of id on the invalid table with "invalid1" status
invalid2: count of id on the invalid table with "invalid2" status
total: total of valid, invalid1, invalid2
rate: average of totals
It's basically multiple queries with different table. How can I achieve it?
Someting like this?
SELECT
*,
(result.total + result._total) / 2 AS rate
FROM (
SELECT
date,
SUM(CASE WHEN data.valid = 1 THEN 1 ELSE 0 END) AS ontime,
SUM(CASE WHEN data.valid = 0 THEN 1 ELSE 0 END) AS late,
COUNT(*) AS total,
SUM(CASE WHEN data.valid = 1 AND data.status = 'valid' THEN 1 ELSE 0 END) AS valid,
SUM(CASE WHEN data.valid = 0 AND data.status = 'invalid1' THEN 1 ELSE 0 END) AS invalid1,
SUM(CASE WHEN data.valid = 0 AND data.status = 'invalid2' THEN 1 ELSE 0 END) AS invalid2,
SUM(CASE WHEN data.status IN ('valid', 'invalid', 'invalid2') THEN 1 ELSE 0 END) AS _total
FROM (
SELECT
date,
status,
valid = 1
FROM
Valid
UNION ALL
SELECT
date,
status,
valid = 0
FROM
InValid ) AS data
GROUP BY
date) AS result
SELECT date, ontime, late, ontime+late total, valid, invalid1, invalid2, valid+invalid1+invalid2 total
FROM
(SELECT date,
COUNT(*) late,
COUNT(IIF(status = 'invalid1', 1, NULL)) invalid1,
COUNT(IIF(status = 'invalid2', 1, NULL)) invalid2,
FROM invalid
GROUP BY date
) JOIN (
SELECT date,
COUNT(*) ontime,
COUNT(IIF(status = 'valud', 1, NULL)) valid,
FROM valid
GROUP BY date
) USING (date)
First of all, it seems that you are holding exactly the same information in 2 tables - I would recommend merging those tables together and add an additional boolean column called valid to hold the info related to validity of the record.
The query on your existent DB structure might look something like this:
SELECT unioned.* FROM (
( SELECT v.date AS date, v.status AS status, v.id AS id, COUNT(id) AS valid, 0 AS invalid1, 0 AS invalid2 FROM valid v GROUP BY v.date)
UNION
( SELECT i1.date AS date, i1.status AS status, i1.id AS id, 0 AS valid, COUNT(i1.id) AS invalid1, 0 AS invalid2 FROM invalid1 i1 GROUP BY i1.date)
UNION
( SELECT i2.date AS date, i2.status AS status, i2.id AS id, 0 AS valid, 0 AS invalid1, COUNT(i.id) AS invalid2 FROM invalid1 i1 GROUP BY i1.date)
) AS unioned GROUP BY unioned.date