Calculating last order date by UserID from GA data - sql

I would like to calculate the last order date of an individual, by their UserID - my UserID is derived from a custom dimension from the automatically imported Google Analytics data.
I'm not sure how to go about this, i'm quite new to SQL, I think I might be looking for a window function, but not entirely sure!
Here is my code so far, but this returns the most recent order data against ALL IDs:
SELECT * FROM
(SELECT MAX(date) AS lastorddate, customDimension.value AS UserID
FROM `PROJECTNAME.ga_sessions_20*` AS t
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE customDimension.index = 2
AND totals.transactions > 0
GROUP BY Date, UserID)
GROUP BY UserID, lastorddate
ORDER BY lastorddate DESC
LIMIT 500

Below should work:
#standardSQL
SELECT MAX(date) AS lastorddate, customDimension.value AS UserID
FROM `PROJECTNAME.ga_sessions_20*` AS t
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE customDimension.index = 2
AND totals.transactions > 0
GROUP BY UserID
ORDER BY lastorddate DESC
LIMIT 500

Related

Calculate rolling year totals in sql

I am gathering something that is essentially am "enrollment date" for users. The "enrollment date" is not stored in the database (for a reason too long to explain here), so I have to deduce it from the data. I then want to reuse this CTE in numerous places throughout another query to gather values such as "total orders 1 year before enrollment" and "total orders 1 year after enrollment".
I haven't gotten this code to run, as it's much more complex in my actual data set (this code is paraphrased from the actual code) and I have a feeling it's not the best way to do this. As you can see, my date conditionals are mostly just placeholders, but I think it should be obvious what I am trying to do.
That said, I think this would mostly work. My question is, is there a better way to do this? Additionally, could I combine the rolling year before and rolling year after into one table somehow? (maybe window functions)? This is part of a much bigger query, so the more consolidation I could do, the better it would seem.
For what it's worth, the subquery to derive the "enrollment date" is also more complex than shown here.
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
Select*
from users
left join (select
user_id,
SUM(total_paid)
from orders where date > (select enroll.e_date where user_id = user_id) AND date < (select enroll.e_date where user_id = user_id + 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_after on rolling_year_after.user_id = users.user_id
left join (select
user_id,
SUM(total_paid)
from orders where date < (select enroll.e_date where user_id = user_id) and date > (select enroll.e_date where user_id = user_id - 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_before on rolling_year_before.user_id = users.user_id
Maybe something like this, not sure if its more performant, but looks a bit cleaner:
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
, rolling_year as (
select
user_id,
SUM(CASE WHEN date between enroll.edate and enroll.edate + 365 days then (total_paid) else 0 end) as rolling_year_after,
SUM(CASE WHEN date between enroll.edate - 365 days and enroll.edate then (total_paid) else 0 end) as rolling_year_before
from orders
left join enroll
on order.user_id = enroll.user_id
where order_type = 'special'
group by user_id
)
Select *
from users
left join rolling_year
on users.user_id = rolling_year.user_id

How to get 1 record on the basis of two column values in a single table?

The query is
select distinct b.UserID , cast(b.entrytime as date) ,count(*) as UserCount
from [dbo].[person] as a
join [dbo].[personcookie] as b
on a.UserID = b.UserID
where cast (b.entrytime as date) >= '08/21/2020'
and cast (b.leavetime as date) <= '08/27/2020' and a.distinction = 99
group by cast(b.entrytime as date), b.UserID
If same UserID has count more than 1 for same date, It should consider as 1. Now as it is shown in the image that USERID 10 has count 1 for 2020-08-26 and USERID 10 has count 2 for '2020-08-27'. It should show that user ID 10 has total count 2 for `2020-08-26 and 2020-08-27' (because for 2020-08-27 the count should be 1) as per the requirement.
I have added the image of tables and what output i want
It seems you want one result row per user, so group by user, not by user and date. You want to count dates per user, but each day only once. This is a distinct count.
select
p.userid,
count(distinct cast(pc.entrytime as date)) as date_count
from dbo.person as p
join dbo.personcookie as pc on pc.userid = p.userid
where p.distinction = 99
and pc.entrytime >= '2020-08-08'
and pc.leavetime < '2020-08-28'
group by p.userid
order by p.userid;
You seem to want dense_rank():
select p.UserID, cast(pc.entrytime as date),
dense_rank() over (partition by p.userID order by min(pc.entrytime)) as usercount
from [dbo].[person] p join
[dbo].[personcookie] pc
on pc.UserID = p.UserID
where cast(pc.entrytime as date) >= '2020-08-21' and
cast(pc.leavetime as date) <= '2020-08-27'
group by cast(pc.entrytime as date), p.UserID;
Notes:
The only "real" change is using dense_rank(), which enumerates the days for a given user.
Use meaningful table aliases, rather than arbitrary letters.
Use standard date/time constants. In SQL Server, that is either YYYYMMDD or YYYY-MM-DD.

How to generate session_id by sql?

My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null
If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.
I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp

Query for both daily aggregate, and then monthly aggregates in the same query?

I would like to count the number of daily unique active users by subreddit and day, and then aggregate these counts onto monthly unique active users by group and month. Doing each one individually is simple enough, but when I try to do them in one combined query, it tells me that I need to group by date_month_day in my second-level subquery, which would result in monthly_unique_users being the same as daily_unique_uauthors..(Error: Expression 'date_month_day' is not present in the GROUP BY list [invalidQuery]).
Here is the query I have so far:
SELECT * FROM
(
SELECT *,
(daily_unique_authors/monthly_unique_authors) * 1.0 AS ratio,
ROW_NUMBER() OVER (PARTITION BY date_month_day ORDER BY ratio DESC) rank
FROM
(
SELECT subreddit,
date_month_day,
daily_unique_authors,
SUM(daily_unique_authors) AS monthly_unique_authors,
LEFT(date_month_day, 7) as date_month
FROM
(
SELECT subreddit,
LEFT(DATE(SEC_TO_TIMESTAMP(created_utc)), 10) as date_month_day,
COUNT(UNIQUE(author)) as daily_unique_authors
FROM TABLE_QUERY([fh-bigquery:reddit_comments], "table_id CONTAINS \'20\' AND LENGTH(table_id)<8")
GROUP EACH BY subreddit, date_month_day
)
GROUP EACH BY subreddit, date_month))
WHERE rank <= 100
ORDER BY date_month ASC
The final output should ideally be something like:
subreddit date_month date_month_day daily_unique_users monthly_unique_users ratio
1 google 2005-12 2005-12-29 77 600 0.128
2 google 2005-12 2005-12-31 52 600 0.866
3 google 2005-12 2005-12-28 81 600 0.135
4 google 2005-12 2005-12-27 73 600 0.121
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY date_month_day ORDER BY ratio DESC) rank
FROM (
SELECT
daily.subreddit subreddit,
daily.date_month date_month,
date_month_day,
daily_unique_authors,
monthly_unique_authors,
1.0 * daily_unique_authors / monthly_unique_authors AS ratio
FROM (
SELECT subreddit,
DATE(TIMESTAMP_SECONDS(created_utc)) AS date_month_day,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_SECONDS(created_utc))) AS date_month,
COUNT(DISTINCT author) AS daily_unique_authors
FROM `fh-bigquery.reddit_comments.2018*`
GROUP BY subreddit, date_month_day, date_month
) daily
JOIN (
SELECT subreddit,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_SECONDS(created_utc))) AS date_month,
COUNT(DISTINCT author) AS monthly_unique_authors
FROM `fh-bigquery.reddit_comments.2018*`
GROUP BY subreddit, date_month
) monthly
ON daily.subreddit = monthly.subreddit
AND daily.date_month = monthly.date_month
)
)
WHERE rank <= 100
ORDER BY date_month
Note: I tried to leave the original logic and structure as much as possible as it is in the question - so OP will be able to correlate answer with question and make further adjustments if needed :o)

Counting transactions on 2 different levels

I am using Google Analytics data in BigQuery, my desired output is
USERID INTERACTIONS TRANSACTIONS SCORE CHANNEL
XXX 3 1 33.33 Paid
Below is my query so far - I am getting duplicate transactions and I can;t work out why, Unesting my hits led to a high count of interactions as every line was bring counted, so I added the AND hit.isentrance IS TRUE clause, which means I can't use COUNT( DISTINCT hit.transaction.transactionid) as the entry row will never contain an order ID - instead I have to use totals.transactions, which I where I think my issues could be coming from?
SELECT UserID, SUM(Campaign_Interactions) AS Interactions, SUM(Transactions) AS Transactions, ROUND(SUM(Transactions)/SUM(Campaign_Interactions), 2) AS Con_Score, MasterChannel FROM(
(SELECT customDimension.value AS UserID, visitid AS visitid1, trafficSource.campaign AS Campaign, COUNT(trafficSource.campaign) AS Campaign_Interactions, SUM (totals.transactions) AS Transactions, ROUND(MAX(totals.transactions)/COUNT(trafficSource.campaign), 2) AS Conversion_Score
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST (hits) AS hit
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 7 day) and
DATE_sub(current_date(), interval 1 day)
AND customDimension.index = 2
AND trafficSource.campaign IS NOT NULL
AND (customDimension.value NOT LIKE 'true' AND customDimension.value NOT LIKE 'undefined')
AND hit.isentrance IS TRUE
GROUP BY visitid1, Campaign, userID
ORDER BY Transactions DESC)
JOIN
(SELECT * FROM `xxx.7Days_VisitID_MasterChan`)
ON visitid1 = visitid)
GROUP BY UserID, MasterChannel
ORDER BY UserID
And a screenshot of results. Note that for the ID 00004180-16f5-46e4-9caa-c6b47e03d795 (near the bottom) there should be only 1 order, but we are seeing it on each row.
It's fine for the user to have interactions across multiple channels, this is expected. Multiple transactions across multiple channels is also fine, but I can see in our CRM that this UserID has made only one order in the last 7 days, so I'd only expect to see a single transaction against the ID here.