I'd like to retrieve sum of visits who have an custom dimension hit within their visit split date.
I get this data with the help of this query as sum for all selected dates, but how do I get it split by date?
Many thanks in advance!
select sum(sessions) as total_sessions, from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (TABLE_DATE_RANGE([XXX.ga_sessions_], TIMESTAMP('2016-09-01'), TIMESTAMP('2016-09-03')))
where totals.visits = 1
AND hits.customDimensions.index = 3
AND hits.customDimensions.value = 'play'
group each by fullvisitorid
)
ga_sessions tables have date field (see Analytics to BigQuery Export schema)
So, if you want to stay with BigQuery Legacy SQL for your above query - you can use this date field, as in below example
SELECT date, SUM(sessions) AS total_sessions FROM (
SELECT
date,
fullvisitorid,
COUNT(DISTINCT visitid) AS sessions
FROM (TABLE_DATE_RANGE([XXX.ga_sessions_], TIMESTAMP('2016-09-01'), TIMESTAMP('2016-09-03')))
WHERE totals.visits = 1
AND hits.customDimensions.index = 3
AND hits.customDimensions.value = 'play'
GROUP BY date, fullvisitorid
)
GROUP BY date
If you can/want Migrate from BigQuery Legacy SQL to BigQuery Standard SQL you can use below example:
SELECT
_TABLE_SUFFIX AS date,
COUNTIF(EXISTS (SELECT 1 FROM UNNEST(hits), UNNEST(customDimensions)
WHERE TRUE OR (index = 3 AND value = 'play'))) AS sessions
FROM `XXX.ga_sessions_*`
WHERE totals.visits = 1
AND _TABLE_SUFFIX BETWEEN '2016-09-01' AND '2016-09-03'
GROUP BY date
See more details about using Wildcard Tables
Can you try this with your table using standard SQL (uncheck "Use Legacy SQL" under "Show Options")? I may have misunderstood the question, but it computes the total number of visits for each day matching the condition on customDimensions, which I believe is what you want.
SELECT
_PARTITIONTIME,
COUNTIF(EXISTS (SELECT 1 FROM UNNEST(hits), UNNEST(customDimensions)
WHERE index = 3 AND value = 'play')) as sessions
FROM `XXX.ga_sessions_*`
WHERE totals.visits = 1
GROUP BY _PARTITIONTIME;
Related
I have written the sql query:
SELECT id
date_diff("day", create_date, date) as day
action_type
FROM "my_database"
It brings this:
id day action_type
1 0 upload
1 0 upload
1 0 upload
1 1 upload
1 1 upload
2 0 upload
2 0 upload
2 1 upload
How to change my query to get table with unique days in column day and average number "upload" action_type among all id's. So desired result must look like this:
day avg_num_action
0 2.5
1 1.5
It is 2.5, because (3+2)/2 (3 uploads of id:1 and 2 uploads for id:2). same for 1.5
Please try this. Consider your given query as a table. If any WHERE condition needed then please enable this other wise disable where clause.
SELECT t.day
, COUNT(*) / COUNT(DISTINCT t.id) avg_num_action
FROM (SELECT id,
date_diff("day", create_date, date) as day,
action_type
FROM "my_database") t
WHERE t.action_type = 'upload'
GROUP BY t.day
Create a table from your given result set and write query based on that.
SELECT t.tday
, COUNT(*) / COUNT(DISTINCT t.id) avg_num_action
FROM my_database t
GROUP BY t.tday
Please check from url https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=871935ea2b919c4e24eb83fcbce78973
Update: I think my two-steps approach is more complicated than needed. Rahul Biswas shows how this can be done in one step. I suggest you use and accept his answer.
Original answer:
Two steps:
Count entries per ID and day
Take the average count per day
The query:
with rows as (select id, date_diff('day', create_date, date) as day from mytable)
, per_id_and_day as (select id, day, count(*) as cnt from rows group by id, day)
select day, avg(cnt)
from per_id_and_day
group by day
order by day;
You don't need a subquery for this logic:
SELECT date_diff("day", create_date, date) as day,
COUNT(*) * 1.0 / COUNT(DISTINCT id)
FROM "my_database"
GROUP BY date_diff("day", create_date, date)
I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.
When i m running this query giving result in two different rows of same date one contains zero other contains events count????
How to solve this, any help will be really appreciated!
(Select
distinct(case
when event_text = 'poll_vote' THEN device_id Else 0 END) as
pollvote,event_date from
(Select event_date,event_text,count(distinct users) as device_id from
(SELECT event.name as event_text, ( user.value.value.string_value)
AS users,
CAST(TIMESTAMP_ADD(TIMESTAMP_MICROS(event.timestamp_micros),
INTERVAL 330 MINUTE) AS date) AS event_date
FROM
`dataset.tablename`,
UNNEST(event_dim) AS event,
UNNEST(user_dim.user_properties) AS user
where
user.key="context_device_id"
GROUP BY
event_date,event_text,users)
GROUP BY
event_text,event_date))
Using ‘GROUP BY’ for event_date only should give you only one column as you wanted. Here are some of the GROUP BY examples.
Taking what has been described on https://webmasters.stackexchange.com/a/87523
As well as my own understanding, I've come up with what I think would be considered "Returning Users"
1.First a query to show users who had their first "latest visit" within a two year time period:
SELECT
parsedDate,
CASE
# return fullVisitorId when the first latest visit is between 2 years and today
WHEN parsedDate BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE() THEN fullVisitorId
END fullVisitorId
FROM (
SELECT
# convert the date field from string to date and get the latest date
PARSE_DATE('%Y%m%d',
MAX(date)) parsedDate,
fullVisitorId
FROM
`project.dataset.ga_sessions_*`
WHERE
# only show fullVisitorId if first visit
totals.newVisits = 1
GROUP BY
fullVisitorId)
2.Then a separate query to select some fields within a specific date range:
SELECT
PARSE_DATE('%Y%m%d',
date) parsedDate,
fullVisitorId,
visitId,
totals.newVisits,
totals.visits,
totals.bounces,
device.deviceCategory
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX = "20180118"
3.Joining these two queries together to find "Returning Users"
SELECT
q1.parsedDate date,
COUNT(DISTINCT q1.fullVisitorId) users,
# Default way to determine New Users
SUM(q1.newVisits) newVisits,
# Number of "New Users" based on my queries (matches with default way above)
COUNT(DISTINCT IF(q2.parsedDate < q1.parsedDate, NULL, q2.fullVisitorId)) newUsers,
# Number of "Returning Users" based on my queries
COUNT(DISTINCT IF(q2.parsedDate < q1.parsedDate, q2.fullVisitorId, NULL)) returningUsers
FROM (
(SELECT
PARSE_DATE('%Y%m%d',
date) parsedDate,
fullVisitorId,
visitId,
totals.newVisits,
totals.visits,
totals.bounces,
device.deviceCategory
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX = "20180118") q1
LEFT JOIN (
SELECT
parsedDate,
CASE
# return fullVisitorId when the first latest visit is between 2 years and today
WHEN parsedDate BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE() THEN fullVisitorId
END fullVisitorId
FROM (
SELECT
# convert the date field from string to date and get the latest date
PARSE_DATE('%Y%m%d',
MAX(date)) parsedDate,
fullVisitorId
FROM
`project.dataset.ga_sessions_*`
WHERE
# only show fullVisitorId if first visit
totals.newVisits = 1
GROUP BY
fullVisitorId)) q2
ON q1.fullVisitorId = q2.fullVisitorId)
GROUP BY
date
Results in BQ
Un-sampled new/returning visitors split by Users report for the same period in GA
Questions/Issues:
Given that newVisits (default field) and newUsers (my calculation) is giving the same results which is inline with the GA report New Visitor Users. Why is there mismatch of GAs Returning Visitor Users and my calculation of returningUsers in BQ? can these two even be compared, what am I missing?
Is my approach the most efficient and less verbose way of going about this?
Is there a better way to get the figures, something I'm missing?
SOLUTION
Based on Martin's answer below, I managed to create the "Returning Users" metric/field within the context of the query I was running:
SELECT
date,
deviceCategory,
# newUsers - SUM result if it's a new user
SUM(IF(userType="New Visitor", 1, 0)) newUsers,
# returningUsers - COUNT DISTINCT fullvisitorId if it's a returning user
COUNT(DISTINCT IF(userType="Returning Visitor", fullvisitorid, NULL)) returningUsers,
COUNT(DISTINCT fullvisitorid) users,
SUM(visits) sessions
FROM (
SELECT
date,
fullVisitorId,
visitId,
totals.visits,
device.deviceCategory,
IF(totals.newVisits IS NOT NULL, "New Visitor", "Returning Visitor") userType
FROM
`project.dataset.ga_sessions_20180118` )
GROUP BY
deviceCategory,
date
Google Analytics uses approximations for users (fullvisitorid) - even if it says "based on 100%". You get better user numbers when using an unsampled report.
Another thing to mention: fullvisitorids are taken into consideration even if totals.visits != 1, while sessions are only counted where totals.visits = 1
Also users are double-counted if they where new and then returned. Meaning, this should give you correct numbers:
SELECT
totals.newVisits IS NOT NULL AS isNew,
COUNT(DISTINCT fullvisitorid) AS visitors,
SUM(totals.visits) AS sessions
FROM
`project.dataset.ga_sessions_20180214`
GROUP BY
1
If you want to avoid double counting you can use this, where a user is counted as new even if she returned:
WITH
visitors AS (
SELECT
fullvisitorid,
-- check if any visit of this visitor was new - will be used for grouping later
MAX(totals.newVisits ) isNew,
SUM(totals.visits) as sessions
FROM
`project.dataset.ga_sessions_20180214`
GROUP BY 1
)
SELECT
isNew IS NOT NULL AS isNew,
COUNT(1) AS visitors,
sum(sessions) as sessions
FROM
visitors
GROUP BY 1
Of course these numbers match with GA only in totals.
I am trying to get the new user and user count, qualified visitors by custom dimension value and date. Here is the code. But I couldn't get the data tie with Google Analytics. I think the problem is the UNNEST creates duplicate and total.newVisits is on different granularity. Thank you!
SELECT
PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(FullvisitorID)) as visitor_count
,sum( totals.newVisits ) AS New_Visitors
,if(customDimensions.index=2, customDimensions.value,null) as orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions) AS customDimensions
WHERE
date='20170101'
GROUP BY DATE,if(customDimensions.index=2, customDimensions.value,null)
Try this instead:
SELECT
PARSE_DATE('%Y%m%d', date) AS Date,
COUNT(DISTINCT fullvisitorid) visitor_count,
SUM(totals.newVisits) AS New_Visitors,
(SELECT value FROM UNNEST(hits), UNNEST(customDimensions) WHERE index = 2 LIMIT 1) orig
FROM `dataset_id.ga_sessions_20170101`
GROUP BY Date, orig
It's basically the same thing but instead of doing the UNNEST in the outer query this solution only applies this operation at the hit level which avoids the duplication of totals.newVisits you observed in your query.