Counting transactions on 2 different levels - sql

I am using Google Analytics data in BigQuery, my desired output is
USERID INTERACTIONS TRANSACTIONS SCORE CHANNEL
XXX 3 1 33.33 Paid
Below is my query so far - I am getting duplicate transactions and I can;t work out why, Unesting my hits led to a high count of interactions as every line was bring counted, so I added the AND hit.isentrance IS TRUE clause, which means I can't use COUNT( DISTINCT hit.transaction.transactionid) as the entry row will never contain an order ID - instead I have to use totals.transactions, which I where I think my issues could be coming from?
SELECT UserID, SUM(Campaign_Interactions) AS Interactions, SUM(Transactions) AS Transactions, ROUND(SUM(Transactions)/SUM(Campaign_Interactions), 2) AS Con_Score, MasterChannel FROM(
(SELECT customDimension.value AS UserID, visitid AS visitid1, trafficSource.campaign AS Campaign, COUNT(trafficSource.campaign) AS Campaign_Interactions, SUM (totals.transactions) AS Transactions, ROUND(MAX(totals.transactions)/COUNT(trafficSource.campaign), 2) AS Conversion_Score
FROM `xxx.ga_sessions_20*` AS m
CROSS JOIN UNNEST(m.customdimensions) AS customDimension
CROSS JOIN UNNEST (hits) AS hit
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 7 day) and
DATE_sub(current_date(), interval 1 day)
AND customDimension.index = 2
AND trafficSource.campaign IS NOT NULL
AND (customDimension.value NOT LIKE 'true' AND customDimension.value NOT LIKE 'undefined')
AND hit.isentrance IS TRUE
GROUP BY visitid1, Campaign, userID
ORDER BY Transactions DESC)
JOIN
(SELECT * FROM `xxx.7Days_VisitID_MasterChan`)
ON visitid1 = visitid)
GROUP BY UserID, MasterChannel
ORDER BY UserID
And a screenshot of results. Note that for the ID 00004180-16f5-46e4-9caa-c6b47e03d795 (near the bottom) there should be only 1 order, but we are seeing it on each row.
It's fine for the user to have interactions across multiple channels, this is expected. Multiple transactions across multiple channels is also fine, but I can see in our CRM that this UserID has made only one order in the last 7 days, so I'd only expect to see a single transaction against the ID here.

Related

Calculate rolling year totals in sql

I am gathering something that is essentially am "enrollment date" for users. The "enrollment date" is not stored in the database (for a reason too long to explain here), so I have to deduce it from the data. I then want to reuse this CTE in numerous places throughout another query to gather values such as "total orders 1 year before enrollment" and "total orders 1 year after enrollment".
I haven't gotten this code to run, as it's much more complex in my actual data set (this code is paraphrased from the actual code) and I have a feeling it's not the best way to do this. As you can see, my date conditionals are mostly just placeholders, but I think it should be obvious what I am trying to do.
That said, I think this would mostly work. My question is, is there a better way to do this? Additionally, could I combine the rolling year before and rolling year after into one table somehow? (maybe window functions)? This is part of a much bigger query, so the more consolidation I could do, the better it would seem.
For what it's worth, the subquery to derive the "enrollment date" is also more complex than shown here.
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
Select*
from users
left join (select
user_id,
SUM(total_paid)
from orders where date > (select enroll.e_date where user_id = user_id) AND date < (select enroll.e_date where user_id = user_id + 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_after on rolling_year_after.user_id = users.user_id
left join (select
user_id,
SUM(total_paid)
from orders where date < (select enroll.e_date where user_id = user_id) and date > (select enroll.e_date where user_id = user_id - 365 days)
and order_type = 'special'
group by user_id
) as rolling_year_before on rolling_year_before.user_id = users.user_id
Maybe something like this, not sure if its more performant, but looks a bit cleaner:
With enroll as (Select
user_id,
MIN(date) as e_date
FROM `orders` o
WHERE (subscribed = True)
group by user_id
)
, rolling_year as (
select
user_id,
SUM(CASE WHEN date between enroll.edate and enroll.edate + 365 days then (total_paid) else 0 end) as rolling_year_after,
SUM(CASE WHEN date between enroll.edate - 365 days and enroll.edate then (total_paid) else 0 end) as rolling_year_before
from orders
left join enroll
on order.user_id = enroll.user_id
where order_type = 'special'
group by user_id
)
Select *
from users
left join rolling_year
on users.user_id = rolling_year.user_id

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Determine cluster of access time within 10min intervals per user per day in SQL Server

How to query in SQL from the sample data, it will group or cluster the access_time per user per day within 10min intervals?
This is a complete guess, based on reading between the lines, and is untested due to a lack of consumable sample data.
It, however, looks like you are after a triangular JOIN (these can perform poorly, especially as this won't be SARGable) and a DENSE_RANK:
SELECT YT.[date],
YT.User_ID,
YT2.AccessTime,
DENSE_RANK() OVER (PARTITION BY YT.[date], YT.User_ID ORDER BY YT1.AccessTime) AS Cluster
FROM dbo.YourTable YT
JOIN dbo.YourTable YT2 ON YT.[date] = YT2.[date]
AND YT.User_ID = YT2.User_ID
AND YT.AccessTime <= YT2.AccessTime --This will join the row to itself
AND DATEADD(MINUTE,10,YT.AccessTime) >= YT2.AccessTime; --That is intentional
If I have understood your problem you want to group all accesses for a user in a day when all accesses of that group are in a time interval of 10 minutes. Not counting single accesses, so an access distant more than 10 minutes from every other is not counted as a cluster.
You can identify the clusters joining the accesses table with itself to get all possible time intervals of 10 minutes and number them.
Finally simply rejoin access table to get accesses for each cluster:
; with
user_clusters as (
select a1.date, a1.user_id, a1.access_time cluster_start, a2.access_time cluster_end,
ROW_NUMBER() over (partition by a1.date, a1.user_id order by a1.access_time) user_cluster_id
from ACCESS_TIMES a1
join ACCESS_TIMES a2 on a1.date = a2.date and a1.user_id = a2.user_id
and a1.access_time < a2.access_time
and datediff(minute, a1.access_time, a2.access_time)<10
)
select *
from user_clusters c
join ACCESS_TIMES a on a.date = c.date and a.user_id = c.user_id and a.access_time between c.cluster_start and cluster_end
order by a.date, a.user_id, c.user_cluster_id, a.access_time
output:
date user_id access_time user_cluster_id
'2020-09-19', 'AA083P', '2020-09-19 18:15:00', 1
'2020-09-19', 'AA083P', '2020-09-19 18:22:00', 1
'2020-09-19', 'AA083P', '2020-09-19 18:22:00', 2
'2020-09-19', 'AA083P', '2020-09-19 18:28:00', 2
'2020-09-20', 'AB162Y', '2020-09-20 19:34:00', 1
'2020-09-20', 'AB162Y', '2020-09-20 19:37:00', 1

Calculating last order date by UserID from GA data

I would like to calculate the last order date of an individual, by their UserID - my UserID is derived from a custom dimension from the automatically imported Google Analytics data.
I'm not sure how to go about this, i'm quite new to SQL, I think I might be looking for a window function, but not entirely sure!
Here is my code so far, but this returns the most recent order data against ALL IDs:
SELECT * FROM
(SELECT MAX(date) AS lastorddate, customDimension.value AS UserID
FROM `PROJECTNAME.ga_sessions_20*` AS t
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE customDimension.index = 2
AND totals.transactions > 0
GROUP BY Date, UserID)
GROUP BY UserID, lastorddate
ORDER BY lastorddate DESC
LIMIT 500
Below should work:
#standardSQL
SELECT MAX(date) AS lastorddate, customDimension.value AS UserID
FROM `PROJECTNAME.ga_sessions_20*` AS t
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE customDimension.index = 2
AND totals.transactions > 0
GROUP BY UserID
ORDER BY lastorddate DESC
LIMIT 500

How to get the maximum interim value of a parameter in a select statement in sql server?

How to get the maximum interim value of a parameter in a select statement in sql server?
Example:
I have a table userconnection that contains the login and logout time as below:
action, time, user
Login, 2013-24-11 13:00:00, a
Login, 2013-24-11 13:30:00, b
Login, 2013-24-11 14:00:00, c
Logout, 2013-24-11 14:10:00, b
...
...
...
Can anyone help me with the query below to show max concurrent users at any time during the day (=3 from the above example set) and current time of the day (=2 from the above example set?
[select DateAdd(day, 0, DateDiff(day, 0, time)) calanderday,
sum(case when action = 'Login' then 1 when action = 'Logout' then -1
else 0 end) concurrentuser,
max of(concurrentuser interim values) maxconcurrentuser
from userconnection
where time > sysdate - 1
group by DateAdd(day, 0, DateDiff(day, 0, time))
order by calanderday]
I would much appreciate any help with how to get
max of(concurrentuser interim values) maxconcurrentuser?? in the above query without using user defined functions etc, just using inline queries.
I think that this will work, but obviously you've only given us minimal sample data to work from:
;With PairedEvents as (
select a.[user],a.time as timeIn,b.time as TimeOut
from
userconnection a
left join
userconnection b
on
a.[user] = b.[user] and
a.time < b.time and
b.action = 'logout'
left join
userconnection b_anti
on
a.[user] = b_anti.[user] and
a.time < b_anti.time and
b_anti.time < b.time and
b_anti.action = 'logout'
where
a.action = 'Login' and
b_anti.action is null
), PossibleMaxima as (
select pe.timeIn,COUNT(*) as Cnt
from
PairedEvents pe
inner join
PairedEvents pe_all
on
pe_all.timeIn <= pe.timeIn and
(
pe_all.timeOut > pe.timeIn or
pe_all.timeOut is null
)
group by pe.timeIn
), Ranked as (
select *,RANK() OVER (ORDER BY Cnt desc) as rnk
from PossibleMaxima
)
select * from Ranked where rnk = 1
This assumes that all login events can be paired with logout events, and that you don't have stray extras (a logout without a login, or two logins in a row without a logout).
It works by generating 3 CTEs. The first, PairedEvents associates the login rows with their associated logout rows (and needs the above assumption).
Then, in PossibleMaxima, we take each login event and try to find any PairedEvents rows that overlap that time. The number of times that that join succeeds is the number of users who were concurrently online.
Finally, we have the Ranked CTE that gives the maximum value the rank of 1. If there are multiple periods that achieve the maximum then they will each be ranked 1 and returned in the final result set.
If it's possible for multiple users to have identical login times then a slight tweak to PossibleMaxima may be required - but that's only if we need to.