How to find out the total number of sessions played by all users in a month time frame. The event user_engagement has a parameter session count which increments on each session. The issue being the user who plays 10 sessions would have session count 1 to 10. So how am I supposed to add only the max session count i.e 10 in this instance and similarly for all users.
SELECT
SUM(session_count) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
FROM
`xyz.analytics_111.events_*`
WHERE
event_name = "user_engagement" AND (_TABLE_SUFFIX BETWEEN "20200201" AND "20200229")
AND platform = "ANDROID"
Try below (BigQuery Standard SQL)
#standardSQL
SELECT
SUM(session_count) AS total_sessions,
COUNT(user_pseudo_id) AS users
FROM (
SELECT user_pseudo_id, MAX(session_count) session_count
FROM `xyz.analytics_111.events_*`
WHERE event_name = "user_engagement"
AND _TABLE_SUFFIX BETWEEN "20200201" AND "20200229"
AND platform = "ANDROID"
GROUP BY user_pseudo_id
)
I am unclear on what your data looks like. If there is one row per session, then you can simply use:
SELECT COUNT(*) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
. . .
If you can have multiple events per session, you can use a hacky approach:
SELECT COUNT(DISTINCT CONCAT(user_pseudo_id, ':', CAST(session_count as string)))
I offer this, because sometimes in a complex query, it is simpler to just tweak a single row. Otherwise, Mikhail's solution is reasonable.
However, I would suggest window functions instead:
SELECT SUM(CASE WHEN seqnum = 1 THEN session_count END) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
FROM (SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_count DESC) as seqnum
FROM `xyz.analytics_111.events_*`
WHERE e.event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20200201' AND '20200229' AND
platform = 'ANDROID'
) e;
The reason I recommend this is because you can keep the rest of the calculation without changing them. That is handy in a complex query.
Related
I have the following schema of a data model (I only have the schema, not the tables) on BigQuery with SQL Standard.
I have created this query to select the Top 10 users that generated more revenue in the last three months on the Love game:
SELECT
users.user_id,
SUM(pay.amount) AS total_rev
FROM
`my-database.User` AS users
INNER JOIN
`my-database.IAP_events` AS pay
ON
users.User_id = pay.User_id
INNER JOIN
`my-database.Games` AS games
ON
users.Game_id = games.Game_id
WHERE
games.game_name = "Love"
GROUP BY
users.user_id
ORDER BY
total_rev ASC
LIMIT
10
But then, the exercise says to only consider users that played during 10 different days in the last 3 months. I understand I would use a subquery with a count in the dates but I am a little lost on how to do it...
Thanks a lot!
EDIT: You need to count distinct dates, not transactions, so in the qualify clause you'll need to state COUNT(DISTINCT date_) OVER ... instead of COUNT(transaction_id) OVER .... Fixed the code already.
As far as I understood, you need to count the distinct transaction_id inside IAP_Events on a 3 previous months window, check that the count is greater than 10, and then sum the amounts of all the users included in that constraint.
To do so, you can use BigQuery's analytic functions, aka window functions:
with window_counting as (
select
user_id,
amount
from
iap_events
where
date_ >= date_sub(current_date(), interval 3 month)
qualify
count(distinct date_) over (partition by user_id) > 10
),
final as (
select
user_id,
sum(amount)
from
window_counting
group by
1
order by
2 desc
limit 10
)
select * from final
You will just need to add the needed joins inside the first CTE in order to filter by game_name :)
I have a table of events which has:
user_id
event_name
event_time
There are event names of types: meeting_started, meeting_ended, email_sent
I want to create a query that counts the number of times an email has been send during a meeting.
UPDATE: I'm using Google BigQuery.
Example query:
SELECT
event_name,
count(distinct user_id) users,
FROM
events_table WHERE
and event_name IN ('meeting_started', 'meeting_ended')
group by 1
How can I achieve that?
Thanks!
You can do this in BigQuery using last_value():
Presumably, an email is send during a meeting if the most recent "meeting" event is 'meeting_started'. So, you can solve this by getting the most recent meeting event for each event and then filtering:
select et.*
from (select et.*,
last_value(case when event_name in ('meeting_started', 'meeting_ended') then event_name end) ignore nulls) over
(partition by user_id order by event_time) as last_meeting_event
from events_table et
) et
where event_name = 'email_sent' and last_meeting_event = 'meeting_started'
This reads likes some kind of gaps-and-islands problem, where an island is a meeting, and you want emails that belong to islands.
How do we define an island? Assuming that meeting starts and ends properly interleave, we can just compare the count of starts and ends on a per-user basis. If there are more starts than ends, then a meeting is in progress. Using this logic, you can get all emails that were sent during a meeting like so:
select *
from (
select e.*,
countif(event_name = 'meeting_started') over(partition by user_id order by event_time) as cnt_started,
countif(event_name = 'meeting_ended' ) over(partition by user_id order by event_time) as cnt_ended
from events_table e
) e
where event_name = 'email_sent' and cnt_started > cnt_ended
It is unclear where you want to go from here. If you want the count of such emails, just use select count(*) instead of select * in the outer query.
So I'm currently on a project where I'm working with multiple sources, and one of them is SAP data.
I need to return "duplicates" in essence and find all the different users, that are linked to the same SAP User ID. There are entries that are valid however, as the data describes access roles to the different SAP systems. So it is normal if the same user occurs more than once. But I need to find where there is a different name assigned to the same User ID.
This is what I currently have:
select *
from (
select *,
row_number() over (partition by FULL_NAME order by USER_ID) as row_number
from SAP_TABLE
) as rows order by USER_ID desc
Any help would be appreciated. Thanks!
You would partition by the user_id
select *
from (
select *,
count(distinct (full_name)) over (partition by user_id) as rnk
from SAP_TABLE
) as rows
where rnk>1
order by USER_ID desc
are you looking for this?
select count(distinct FULL_NAME),
USER_ID
from SAP_TABLE
group by USER_ID
having count(distinct FULL_NAME) > 1
You can use this piece of code to find all the user id's that occur more than once.
SELECT USER_ID, COUNT(*)
FROM SAP_TABLE
GROUP BY 1
HAVING COUNT(*) > 1;
I want to remove all customer hits that I see on my site after they have registered. However, not all customers will register on the same day so I cannot simply filter on a specific date. I have a registration indicator of 1 or 0 and then a hit timestamp, along with unique indicators for the specific customers. I have tried this:
rank() over (partition by customer_id, registration_ind order by hit_timestamp asc) rnk
However, this still partitions by customer and isn't working for what I want.
Any help please?
THanks
Is this what you want?
select t.*
from (select t.*,
min(case when registration_ind = 1 then hit_timestamp end) over (partition by customer_id) as registration_timestamp
from t
) t
where registration_timestamp is null or
hit_timestamp < registration_timestamp;
It returns all rows before the first registration timestamp.
I'm trying to write a query based on accounts and their contracts. The table has all contracts for each account, whether the contract is active, expired, etc. I want the query to only bring back the contract with earliest start date per account, so only one row for each account. However i don't know the status of the earliest contract for each account. Some might have active, some might have pending. I run into the problem now where it brings back multiple records for each account if the contract status is in the list i specify. Simple sample code below:
Select t.account, t.contract, t.status Min(t.start_date)
From table t
where t.status in ('Active','Countersigned','Pending')
If your database supports it (e.g. Oracle, Postgres, SQL Server, but not MySQL or SQLite), you can use Window Functions. For instance, you can rank your contracts within each account by starting_at:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
Then you can use that in a subquery to join to accounts and only take contracts with a rank of 1. You'll need to put it in a subquery, because unfortunately (in Postgres at least) you
can't use window functions inside WHERE. So this won't work:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
WHERE rank = 1
but this will:
SELECT *
FROM (SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts) x
WHERE rank = 1
Note you can easily add filtering by status, etc. to any of these queries.
This should work:
select account, contract, status, MinDate
from
(
Select t.account, t.contract, t.status, t.start_date,
Min(t.start_date) over(partition by t.account) MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
) x
where start_date=MinDate
a solution that works if you don't have multiple contracts for each account on the same MIN(date) (in that case you'd get multiple rows for each account and you should decide which of these N contracts you want to see, I can't decide for you)
SELECT t.*
FROM (
Select t.account, Min(t.start_date) AS MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
GROUP BY t.account
) AS t2
INNER JOIN table t ON t.account = t2.account AND t.start_date = t2.MinDate