How to group multiple AND/OR statements in BigQuery - sql

I want to filter out NHT out of my BigQuery with some criterion I have found in my Dataset from Google Analytics. For my example I want these two sets of criterion filtered out:
networkLocation REGEXP_Contains (r"^(ovh \(nwk\)|hostwinds llc.|bhost inc|prisma networks llc|psychz networks|buyvm services|private customer|secure dragon llc.|vmpanel|netaction telecom srl-d|hostigation|frontlayer technologies inc.|digital energy technologies limited|owned-networks|rica web services|netaction telecom srl-d|hurricane electric inc.|private customer - host.howpick.com|ssdvirt|sway broadband|detect network|gorillaservers inc.|micfo llc.| netaction telecom srl|egihosting|zenlayer inc|intercom online inc.|gs1 argentine|ovh hosting inc.|vps cheap inc.|limeip networks|blackhost ltd.|amazon.com inc.)$")
AND
device.browserVersion REGEXP_Contains(r"^(41.0|55.0)$")
OR
networkLocation REGEXP_Contains ("^(hpro group ltd)$")
AND
device.browserVersion REGEXP_Contains("45.0")
My SQL:
SELECT
channelGrouping,
date,
h.page.pagePath AS Page,
SUM(totals.timeOnSite) AS Session_Duration,
SUM(totals.visits) AS Visits,
AVG(totals.timeonSite/totals.visits) AS Avg_Time_per_Session,
SUM(totals.bounces) AS Bounce,
(SUM(totals.bounces)/SUM(totals.visits)) AS Bounce_rate,
geoNetwork.networkLocation,
device.browserVersion,
device.browser
FROM
`93868086.ga_sessions_*`,
UNNEST(hits) as h
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY
date,
channelGrouping,
geoNetwork.networkLocation,
device.browserVersion,
device.browser,
h.page.pagePath
I need a HAVING NOT Clause however I am not sure how to group the set of statements I need to filter out my criterion. Any help would be great!

Assuming your expressions for criterion are correct - below should be a way
HAVING NOT (
(
REGEXP_CONTAINS (networkLocation, r"^(ovh \(nwk\)|hostwinds llc.|bhost inc|prisma networks llc|psychz networks|buyvm services|private customer|secure dragon llc.|vmpanel|netaction telecom srl-d|hostigation|frontlayer technologies inc.|digital energy technologies limited|owned-networks|rica web services|netaction telecom srl-d|hurricane electric inc.|private customer - host.howpick.com|ssdvirt|sway broadband|detect network|gorillaservers inc.|micfo llc.| netaction telecom srl|egihosting|zenlayer inc|intercom online inc.|gs1 argentine|ovh hosting inc.|vps cheap inc.|limeip networks|blackhost ltd.|amazon.com inc.)$")
AND REGEXP_CONTAINS(device.browserVersion, r"^(41.0|55.0)$")
) OR (
REGEXP_CONTAINS (networkLocation, r"^(hpro group ltd)$")
AND REGEXP_CONTAINS(device.browserVersion, r"45.0")
)
)

Related

Google Analytics BigQuery get the time difference between two different pages by user

I'm trying to get the difference in time by user between the first step checkout and final purchase. This is my query:
SELECT transactionid1,MAX((t1.hit_moment1-t2.hit_moment2)) as diff_hits,MAX(t2.checkout_step_2) as day FROM ((SELECT clientId as client1_id,
hits_1.page.pagePath as page_event1,
hits_1.eventInfo.eventAction as action_event1,
hits_1.transaction.transactionId as transactionId1,
TIMESTAMP_SECONDS(visitStartTime) as checkout_step_1,
hits_1.hour as hour1,
hits_1.minute as minute1,
(hits_1.hour*60+hits_1.minute) as hit_moment1
from `616180.ga_sessions_*` ,
UNNEST(hits) as hits_1 where hits_1.page.pagePath like '%/buy1/suscription%' and hits_1.eventInfo.eventAction="Transaction" and hits_1.transaction.transactionId is not null)t1 INNER JOIN (SELECT clientId as client2_id,
hits_2.page.pagePath as page_event2,
hits_2.eventInfo.eventAction as action_event2,
TIMESTAMP_SECONDS(visitStartTime) as checkout_step_2,
hits_2.hour as hour2,
hits_2.minute as minute2,
(hits_2.hour*60+hits_2.minute) as hit_moment2
from `616180.ga_sessions_*` ,UNNEST(hits) as hits_2 where hits_2.page.pagePath like '%/buy4/suscription%' and hits_2.eventInfo.eventAction="Checkout" )t2 on t1.client1_id=t2.client2_id) where (t1.hit_moment1-t2.hit_moment2)>0 and (t1.hit_moment1-t2.hit_moment2)<180 group by transactionId1 order by transactionid1
Where pagePath contains /buy1/suscription represents the transaction event and pagePath equal to buy4/suscription represents the first checkout step. I get results, but many of them are extremely large periods of time. Have i made a mistake?
Thank you.
I don't fully follow what the sample data looks like or exactly the format your want for the result set.
That said, you can use aggregation to do the calculation you want. The following assumes that the checkout is after the transaction, but it gives the basic idea:
select s.transaction_id,
max(hit.hour * 60 + hit.minutes) - min(hit.hour * 60 + hit.minutes) as diff_minute
from `616180.ga_sessions_*` s cross join
unnest(s.hits) as hit
where (hit.page.pagePath like '%/buy1/suscription%' and
hit.eventInfo.eventAction = 'Transaction' or
) or
(hit.page.pagePath like '%/buy4/suscription%' and
hit.eventInfo.eventAction = 'Checkout'
)
group by s.transaction_id;

In SQL can I join data using a like or contains clause to exclude some values?

I am working on a project to build some queries from Google Analytics data in BigQuery to replicate some reports for one particular KPI, I have a table with a list of sites that I need to have excluded from the Google Analytics data in order to get the correct metric.
My list might have something such as:
sitename.com
However I need to match this to the eventLabel column in BigQuery where the URL could come back as:
http://sitename.com/subpage/extra-subpage
I can't do a Not In as this requires a direct match, I have tried using a like statement however I get the following error
Scalar subquery produced more than one element
I'm not really sure how else to proceed and am wondering if I need to do a query that say does the string match (as i can get it to work if i use an inner join and then use this new table to do the exclusions as I can keep the eventLabel and then do my Not In based on that?
SELECT Distinct
h.eventinfo.eventAction eventAction,
h.eventinfo.eventlabel eventlabel
FROM `projectName.ga_sessions_*`, unnest(Hits) h
WHere
_TABLE_SUFFIX BETWEEN "20190101" AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
and type = 'EVENT'
and h.eventInfo.eventCategory = 'EventName'
and Replace(Replace(Replace(h.eventInfo.eventLabel,'http://',''),'https://',''),'www.','')
Not like (select concat(ThirdPartyURL,'%') from `projectName.datasetName.ExclusionList`)
I hope the above makes sense.
TIA.
After reproducing your problem the solution is to use NOT IN instead of NOT LIKE as follow:
WITH `projectName.datasetName.ExclusionList` AS
(SELECT 'label1' AS ThirdPartyURL UNION ALL
SELECT 'label2')
SELECT DISTINCT h.eventinfo.eventAction eventAction,
h.eventinfo.eventlabel eventlabel
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
unnest(Hits) h
WHERE _TABLE_SUFFIX BETWEEN "20170801" AND "20170802"
AND TYPE = 'EVENT'
AND h.eventInfo.eventCategory = 'EventName'
AND Replace(Replace(Replace(h.eventInfo.eventLabel, 'http://', ''), 'https://', ''), 'www.', '')
NOT IN
(SELECT ThirdPartyURL FROM `projectName.datasetName.ExclusionList`)
This is the link to BigQuery related SQL documentation

how to write Bigquery in new schema with replacing event_dim in old schema from Firebase analytics?

The old BigQuery Export schema wise script is running.It is given below. But I want to replicate this code and write it according to new export schema as we Bigquery schema has been changed. Please help becasue in new BigQuery Export schema I don't find any other corresponding record against
event_dim (event_dim is in according to old BigQuery Export schema).
Here is link for BigQuery Export schema: click here
SELECT user_dim.app_info.app_instance_id
, (SELECT MIN(timestamp_micros) FROM UNNEST(event_dim)) min_time
, (SELECT MAX(timestamp_micros) FROM UNNEST(event_dim)) max_time,
event.name,
params.value.int_value engagement_time
FROM `xxx.app_events_*`,
UNNEST(event_dim) as event,
UNNEST(event.params) as params,
UNNEST(user_dim.user_properties) as user_params
where (event.name = "user_engagement" and params.key = "engagement_time_msec")
and
(user_params.key = "access" and user_params.value.value.string_value = "true") and
PARSE_DATE('%Y%m%d', event.date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
PARSE_DATE('%Y%m%d', event.date) <= "{{upto_date (yyyy-mm-dd)}}"
Tried the query below but what I want app_instance, min_time, max_time, event_name, engagement_time at one SELECT statement. And as I am using 'group by', I am not able to get all those (app_instance, min_time, max_time, event_name, engagement_time) at a time. Please help.
SELECT user_pseudo_id
, MIN(event_timestamp) AS min_time
,MAX(event_timestamp) AS max_time
FROM `xxx.app_events_*` as T,
T.event_params,
T.user_properties,
T.event_timestamp
where (event_name = "user_engagement" and event_params.key = "engagement_time_msec")
and
(user_properties.key = "access" and user_properties.value.string_value = "true") and
PARSE_DATE('%Y%m%d', event_date) >= date_sub("{{upto_date (yyyy-mm-dd)}}", interval {{last n days}} day) and
PARSE_DATE('%Y%m%d', event_date) <= "{{upto_date (yyyy-mm-dd)}}"
group by 1
It is true that there was a schema change in the Google Analytics for Firebase BigQuery Export. Although there is no clear mapping of the old fields as compared to the new ones, the SQL query that is provided in the documentation in order to migrate existing BQ datasets from the old schema to the new one provides some hints of how have these fields changed.
I share the migration_script.sql SQL query below, just for reference, but let me pin-point the most relevant changes for your use-case:
event_dim is mapped as event in the SQL query, but does not have any final representation in the schema, because event_dim is no longer a nested field: UNNEST(event_dim) AS event
event_dim.timestamp_micros is mapped as event_timestamp: event.timestamp_micros AS event_timestamp
event_dim.name is mapped as event_name: event.name AS event_name
event_param.value.int_value is mapped as event_params.value.int_value: event_param.value.int_value AS int_value
user_dim.user_properties is mapped as user_properties, and all its nested values follow the same structure: UNNEST(user_dim.user_properties) AS user_property) AS user_properties
So, in summary, the schema change has been focused at unnesting several of the fields for simplicity, in such a way that, for example, instead of having to access event_dim.name (which would require unnesting and complicating the query), you can query directly the field event_name.
Having this in mind, I am sure you will be able to adapt your query to this new schema, and it will probably look way more simple, given that you will not have to unnest so many fields.
Just for clarification, let me share with you a couple of sample BQ queries comparing the old and the new schema (they are using public Firebase tables, so you should be able to run them out-of-the-box):
# Old Schema - UNNEST() required because there are nested fields
SELECT
user_dim.app_info.app_instance_id,
MIN(event.timestamp_micros) AS min_time,
MAX(event.timestamp_micros) AS max_time,
event.name
FROM
`firebase-public-project.com_firebase_demo_ANDROID.app_events_20180503`,
UNNEST(event_dim) AS event
WHERE
event.name = "user_engagement"
GROUP BY
user_dim.app_info.app_instance_id,
event.name
As compared to:
# New Schema - UNNEST() not required because there are no nested fields
SELECT
user_pseudo_id,
MIN(event_timestamp) AS min_time,
MAX(event_timestamp) AS max_time,
event_name
FROM
`firebase-public-project.analytics_153293282.events_20180815`
WHERE
event_name = "user_engagement"
GROUP BY
user_pseudo_id,
event_name
These queries are equivalent, but referencing tables with the old and new schema. Please note that, as your query is more complex, you may need to add some UNNEST() in order to access the remaining nested fields in the table.
Additionally, you may want to have a look at these samples that can help you with some ideas on how to write queries with the new schema.
EDIT 2
My understanding is that a query like the one below should allow you to query for all the fields in a single statement. I am grouping by all the non-aggregated/filtered fields, but depending on your use case (this is definitely something you would need to work on your own) you may want to apply a different strategy in order to be able to query the non-grouped fields (i.e. use a MIN/MAX filter, etc.).
SELECT
user_pseudo_id,
MIN(event_timestamp) AS min_time,
MAX(event_timestamp) AS max_time,
event_name,
par.value.int_value AS engagement_time
FROM
`firebase-public-project.analytics_153293282.events_20180815`,
UNNEST(event_params) as par
WHERE
event_name = "user_engagement" AND par.key = "engagement_time_msec"
GROUP BY
user_pseudo_id,
event_name,
par.value.int_value
ANNEX
migration_script.sql:
SELECT
#date AS event_date,
event.timestamp_micros AS event_timestamp,
event.previous_timestamp_micros AS event_previous_timestamp,
event.name AS event_name,
event.value_in_usd AS event_value_in_usd,
user_dim.bundle_info.bundle_sequence_id AS event_bundle_sequence_id,
user_dim.bundle_info.server_timestamp_offset_micros as event_server_timestamp_offset,
(
SELECT
ARRAY_AGG(STRUCT(event_param.key AS key,
STRUCT(event_param.value.string_value AS string_value,
event_param.value.int_value AS int_value,
event_param.value.double_value AS double_value,
event_param.value.float_value AS float_value) AS value))
FROM
UNNEST(event.params) AS event_param) AS event_params,
user_dim.first_open_timestamp_micros AS user_first_touch_timestamp,
user_dim.user_id AS user_id,
user_dim.app_info.app_instance_id AS user_pseudo_id,
"" AS stream_id,
user_dim.app_info.app_platform AS platform,
STRUCT( user_dim.ltv_info.revenue AS revenue,
user_dim.ltv_info.currency AS currency ) AS user_ltv,
STRUCT( user_dim.traffic_source.user_acquired_campaign AS name,
user_dim.traffic_source.user_acquired_medium AS medium,
user_dim.traffic_source.user_acquired_source AS source ) AS traffic_source,
STRUCT( user_dim.geo_info.continent AS continent,
user_dim.geo_info.country AS country,
user_dim.geo_info.region AS region,
user_dim.geo_info.city AS city ) AS geo,
STRUCT( user_dim.device_info.device_category AS category,
user_dim.device_info.mobile_brand_name,
user_dim.device_info.mobile_model_name,
user_dim.device_info.mobile_marketing_name,
user_dim.device_info.device_model AS mobile_os_hardware_model,
#platform AS operating_system,
user_dim.device_info.platform_version AS operating_system_version,
user_dim.device_info.device_id AS vendor_id,
user_dim.device_info.resettable_device_id AS advertising_id,
user_dim.device_info.user_default_language AS language,
user_dim.device_info.device_time_zone_offset_seconds AS time_zone_offset_seconds,
IF(user_dim.device_info.limited_ad_tracking, "Yes", "No") AS is_limited_ad_tracking ) AS device,
STRUCT( user_dim.app_info.app_id AS id,
#firebase_app_id AS firebase_app_id,
user_dim.app_info.app_version AS version,
user_dim.app_info.app_store AS install_source ) AS app_info,
(
SELECT
ARRAY_AGG(STRUCT(user_property.key AS key,
STRUCT(user_property.value.value.string_value AS string_value,
user_property.value.value.int_value AS int_value,
user_property.value.value.double_value AS double_value,
user_property.value.value.float_value AS float_value,
user_property.value.set_timestamp_usec AS set_timestamp_micros ) AS value))
FROM
UNNEST(user_dim.user_properties) AS user_property) AS user_properties
FROM
`SCRIPT_GENERATED_TABLE_NAME`,
UNNEST(event_dim) AS event
As I believe my previous answer provides some general ideas for the Community, I will keep it and write a new one in order to be more specific for your use case.
First of all, I would like to clarify that in order to adapt a query (just like you are asking us to do), one needs to have a clear understanding of the statement, objective of the query, expected results and data to play with. As this is not the case, it is difficult to work with it, even more considering that there are some functionalities that are not clear from the query, for example: in order to obtain the "min_time" and "max_time" for each event, you are taking the min and max value across multiple events, which does not make clear sense to me (it may, depending on your use case, reason why I suggested that it would be better if you could provide more details or work more on the query yourself). Moreover, the new schema "flattens" events, in such a way that each event is written in a different line (you can easily check this by running a SELECT COUNT(*) FROM 'table_with_old_schema' and compare it to SELECT COUNT(*) FROM 'table_with_new_schema'; you will see that the second one has many more rows), so your query does not make sense anymore, because events are not grouped anymore, and then you cannot pick a minimum and maximum between nested fields.
This being clarified, and having removed some fields that cannot be directly adapted to the new schema (you may be able to adapt this from your side, but this would require some additional effort and understanding of what did those fields mean to you in your previous query), here there are two queries that provide exactly the same results, when run against the same table, with different schema:
Query against a table with the old schema:
SELECT
user_dim.app_info.app_instance_id,
event.name,
params.value.int_value engagement_time
FROM
`DATASET.app_events_YYYYMMDD`,
UNNEST(event_dim) AS event,
UNNEST(event.params) AS params,
UNNEST(user_dim.user_properties) AS user_params
WHERE
(event.name = "user_engagement"
AND params.key = "engagement_time_msec")
AND (user_params.key = "plays_quickplay"
AND user_params.value.value.string_value = "true")
ORDER BY 1, 2, 3
Query against the same table, with the new schema:
SELECT
user_pseudo_id,
event_name,
params.value.int_value engagement_time
FROM
`DATASET.events_YYYYMMDD`,
UNNEST(event_params) AS params,
UNNEST(user_properties) AS user_params
WHERE
(event_name = "user_engagement"
AND params.key = "engagement_time_msec")
AND (user_params.key = "plays_quickplay"
AND user_params.value.string_value = "true")
ORDER BY 1, 2, 3
Again, for this I am using the following table from the public dataset: firebase-public-project.com_firebase_demo_ANDROID.app_events_YYYYMMDD, so I had to change some filters and remove some others in order for it to retrieve sensible results against that table. Therefore, feel free to modify or add the ones you need in order for it to be useful for your use case.

Query hits and custom dimensions in the BigQuery?

I am working with the GoogleAnalytics data in the BigQuery.
I want to output 2 columns: specific event actions (hits) and custom dimension (session based). All that, using Standard SQL. I cannot figure out how to do it correctly. Documentation does not help either. Please help me. This is what I am trying:
SELECT
(SELECT MAX(IF(index=80, value, NULL)) FROM UNNEST(customDimensions)) AS is_app,
(SELECT hits.eventInfo.eventAction) AS ea
FROM
`table-big-query.105229861.ga_sessions_201711*`, UNNEST(hits) hits
WHERE
totals.visits = 1
AND _TABLE_SUFFIX BETWEEN '21' and '21'
AND EXISTS(SELECT 1 FROM UNNEST(hits) hits
WHERE hits.eventInfo.eventCategory = 'SomeEventCategory'
)
Try to give your tables and sub-tables names that are not part of the original table schema. Always tell to which table you're referring - when cross joining, you're basically adding new columns (here h.* - flattened) - but the old ones (hits.* - nested) still exist.
I named ga_sessions_* t and use it to refer the cross-join and also the customDimension.
Also: You don't need the legacy sql trick using MAX() for customDimensions anymore. It's a simple sub-query now :)
try:
SELECT
(SELECT value FROM t.customDimensions where index=80) AS is_app, -- use h.customDimensions if it is hit-scope
eventInfo.eventAction AS ea
FROM
`projectid.dataset.ga_sessions_201711*` t, t.hits h
WHERE
totals.visits = 1
AND _TABLE_SUFFIX BETWEEN '21' and '21'
AND h.eventInfo.eventCategory is not null

Error: The project hits has not enabled BigQuery

I'm trying to export data from GA using BigQuery and the Query failed.
I use this functions:
FLATTEN
TABLE_DATA_RANGE
Because I need data from hits.
Can anyone help me about this Error?
Error:
The project hits has not enabled BigQuery
Now, the error is other: Field CampaignGrouping not found:
SELECT
a.hits.contentGroup.contentGroup2 AS CampaignGrouping,
a.customDimensions.value AS member_PK,
'Web' AS Canal,
'ES' AS country_id,
count(a.hits.contentGroup.contentGroupUniqueViews2) AS VistasUnicas
FROM FLATTEN(FLATTEN(
(SELECT
hits.contentGroup.contentGroupUniqueViews2,
hits.contentGroup.contentGroup2,
customDimensions.value
FROM TABLE_DATE_RANGE([###.ga_sessions_], TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-30'))),
hits.contentGroup.contentGroupUniqueViews2), customDimensions.value
)a
WHERE hits.contentGroup.contentGroup2<>'(not set)' AND customDimensions.value<>'null' AND hits.contentGroup.contentGroupUniqueViews2 IS NOT NULL
GROUP BY 1,2,3,4
ORDER BY 5 ASC
Solving your problem in Standard SQL is much easier than in Legacy.
This query might help you on computing this:
SELECT
hits.contentgroup.contentgroup2 CampaignGrouping,
custd.value member_PK,
'Web' Canal,
'ES' AS country_id,
SUM(hits.contentGroup.contentGroupUniqueViews2) VistasUnicas
FROM
`project_id.dataset_id.ga_sessions_*`,
UNNEST(customdimensions) custd,
UNNEST(hits) AS hits
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-05-01') AND TIMESTAMP('2017-05-06')
and hits.contentGroup.contentGroup2<>'(not set)'
AND custd.value<>'null'
AND hits.contentGroup.contentGroupUniqueViews2 IS NOT NULL
GROUP BY
1, 2
ORDER BY 5 ASC
You just need to enable it and it's already ready to run.
As you said you are learning SQL, it's highly recommended that you start by learning the Standard version instead of the Legacy one as it's more stable and offers several different techniques to better assist you on your analyzes.