Bug or new behavior in BigQuery? - google-bigquery

Since two days ago (August 10th 2016), a query which used to work (using tables of the BQ Export for Google Analytics Premium) has stopped working. It returns the following error:
Error: Cannot union tables : Incompatible types.
'hits.latencyTracking.userTimingVariable' : TYPE_INT64
'hits.latencyTracking.userTimingVariable' : TYPE_STRING
After some investigation, it seems to be a problem with using IN in a WHERE clause when I query tables from before and after August 10th (table ga_sessions_20160810).
I've simplified my original query to provide a dummy one which has the same basic structure. The following query works (querying data from 2016-08-08 and 2016-08-09):
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09'))
)
GROUP BY fullVisitorId
But this other one (just changing dates, in this case from 2016-08-09 and 2016-08-10) returns the error:
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
)
GROUP BY fullVisitorId
This last query works fine either if I delete the WHERE clause or if I just try the query within the IN, so I guess the problem is with the structure WHERE field IN(...). Furthermore, querying only data from 2016-08-10 does work. Also, the same happens using a field different to fullVisitorId and running the same queries in different BQ projects.
Looking to the error description, it should be a problem with variable types, but I don't know what is hits.latencyTracking.userTimingVariable. My query used to work properly, so I can't figure out what has changed that produces the error. Have some fields changed their type or what happened?
Has anyone experienced this? Is this a bug or a new behavior in BigQuery? How can this error be solved?

As you are using * in select clause it might causing problem when union is happening its trying to combine two different column types ( as schema changed from INT64 to STRING).
I have two approaches
1) use only those fields required by you than using * in select clause
SELECT fullVisitorId, sum(totals.visits)
FROM (select fullVisitorId,totals.visits from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
) GROUP BY fullVisitorId
2) using views to split inner queries and use the view later in the query. (even in view you need to use only use those fields which are required )
SELECT fullVisitorId, sum(totals.visits)
FROM [view.innertable2]
WHERE fullVisitorId in(
SELECT fullVisitorId from [view.innertable1] ) GROUP BY fullVisitorId
This will exclude the hits.latencyTracking.userTimingVariable so there will be no error.

If the fields that you are querying are compatible, you may try using Standard SQL wildcard tables (you'll have to uncheck use Legacy SQL box if you are doing this from the UI). Something like this:
SELECT fullVisitorId, sum(totals.visits)
FROM `xxxxxxxx.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160808' and '20160810'
GROUP BY fullVisitorId;

Related

Dynamically including all available custom dimensions in BigQuery select statement

We are using a query similar to the one below for a report
SELECT
visitId AS visitId,
hits.hitNumber AS hits_hitNumber,
hits.time AS hits_time,
hits.page.pagePath AS hits_page_pagePath,
-- hit scope custom dimensions
(SELECT value from hits.customDimensions where index=1) AS CD1,
(SELECT value from hits.customDimensions where index=2) AS CD2,
-- user and session level custom dimensions
(SELECT value from sessions.customDimensions where index=3) AS CD3
FROM `ga_sessions_20191031` AS sessions, UNNEST(hits) as hits
ORDER BY visitId, hits_hitNumber
LIMIT 50
The query uses un-nesting to flatten some of the custom dimensions. However the index values are hard coded in the query. So every time there is a new custom dimension defined, the query needs to be updated. Is it possible to use a subquery to select all available distinct index values and add them to the query dynamically ?
EDIT:
The following queries provide distinct index values : Is there a way to link them in first query ?
(hit scope )
SELECT
DISTINCT cds.index as hit_cd_index
FROM `ga_sessions_20191031` AS sessions, UNNEST(hits) as hits, UNNEST(hits.customDimensions) as cds
ORDER BY hit_cd_index
(user and session scope )
SELECT
DISTINCT session_cds.index as session_cd_index
FROM `ga_sessions_20191031`, UNNEST(customDimensions) as session_cds
ORDER BY session_cd_index asc
the most robust solution would be to add a table into your BigQuery dataset containing data from Management API so you'll be able to construct your select based on values from the most recent custom dimensions list: https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/customDimensions/list

Migrating from Legacy SQL: options for "WITHIN RECORD" with Standard SQL

I am trying to migrate to Standard SQL from BigQuery Legacy SQL. The Legacy product offered the ability to query "WITHIN RECORD" which came in handy on numerous occasions.
I am looking for an efficient alternative to WITHIN RECORD. I could always just use a few subqueries and join them but wondering if there may be a more efficient way using ARRAY + ORDINAL.
EXAMPLE: Consider the following Standard SQL
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber, h.page.pagePath,
# This would previously use WITHIN RECORD in Legacy SQL:
ARRAY( SELECT eventInfo.eventAction FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay"
ORDER BY hitNumber DESC
)[ORDINAL(1)] AS lastVideoSeen
FROM
`proj.ga_sessions`, UNNEST(hits) as h
GROUP BY fullVisitorId, visitNumber, h.page.pagePath, lastVideoSeen
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM
(SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays
GROUP BY
pagePath, lastVideoSeen
)
Resulting output:
Questions:
1) I would like to see the last video play event on a given page, which is what I used accomplish using WITHIN RECORD but am attempting the ARRAY + ORDINAL approach shown above. However for this to work, I'm thinking the SELECT statement within ARRAY() must get synchronized to the outer record since it is now flattened? Is that accurate?
2) I would also like get a COUNT of DISTINCT videos played on a given page and wondering if more efficient approach would be joining to a separate query OR inserting another inline aggregate function, like done with ARRAY above.
Any suggestions would be appreciated.
1) I would like to see the last video play event on a given page,
which is what I used accomplish using WITHIN RECORD but am attempting
the ARRAY + ORDINAL approach shown above. However for this to work,
I'm thinking the SELECT statement within ARRAY() must get synchronized
to the outer record since it is now flattened? Is that accurate?
I think that is correct. With your query the UNNEST(hits) from the inner query would be independent from the outer UNNEST, and is probably not want you wanted.
I think maybe one way to write it is this:
WITH
sessPageVideoPlays AS (
SELECT fullVisitorId, visitNumber,
ARRAY(
SELECT AS STRUCT pagePath, lastVideoSeen FROM (
SELECT
page.pagePath,
eventInfo.eventAction AS lastVideoSeen,
ROW_NUMBER() OVER (PARTITION BY page.pagePath ORDER BY hitNumber DESC) AS rank
FROM UNNEST(hits)
WHERE eventInfo.eventCategory="videoPlay")
WHERE rank = 1
) AS lastVideoSeenOnPage
FROM
`proj.ga_sessions`
)
SELECT
pagePath, lastVideoSeen, numOccur
FROM (
SELECT
pagePath, lastVideoSeen, count(1) numOccur
FROM
sessPageVideoPlays, UNNEST(lastVideoSeenOnPage)
GROUP BY
pagePath, lastVideoSeen
)
2) I would also like get a COUNT of DISTINCT videos played on a given
page and wondering if more efficient approach would be joining to a
separate query OR inserting another inline aggregate function, like
done with ARRAY above.
I think both are OK, but inserting another inline aggregate function would evaluate them closer together, so it might be a bit easier for the query engine to optimize if there is a chance.

Trying to understand partition by statement

I've seen other posts on SO about getting ones head around partition and order by. Kinda get it but still a bit confused.
Here is a query provided by my colleague that works:
SELECT EMAIL, SUBSCRIPTION_NAME, SOURCE, BILLING_SYSTEM,
RATE_PLAN, NEXT_CHARGE_DATE, SERVICE_ACTIVATION_DATE, CONTRACT_EFFECTIVE_DATE,
SUBSCRIPTION_END_DATE, STATUS, LAST_MODIFIED_DATE, PRODUCT_NAME,
RATE_PLAN_NAME, LOAD_DATE
FROM theDB
QUALIFY COUNT(*) OVER (PARTITION BY EMAIL,CONTRACT_EFFECTIVE_DATE ) > 1
Is this query saying, in plain English, return the fields selected only where the count of records for CONTRACT_EFFECTIVE_DATE appear more than once for each EMAIL?
Put another way is it doing this, which does not run (I'm using Teradata and receive error message "Improper use of aggregate function" - when I see that message should I think "use QUALIFY and PARTITION BY?"):
SELECT EMAIL, SUBSCRIPTION_NAME, SOURCE, BILLING_SYSTEM,
RATE_PLAN, NEXT_CHARGE_DATE, SERVICE_ACTIVATION_DATE, CONTRACT_EFFECTIVE_DATE,
SUBSCRIPTION_END_DATE, STATUS, LAST_MODIFIED_DATE, PRODUCT_NAME,
RATE_PLAN_NAME, LOAD_DATE
FROM RDMATBLSANDBOX.TmpNIMSalesForceDB
WHERE COUNT(CONTRACT_EFFECTIVE_DATE) >1
GROUP BY EMAIL
Not quite. Your query, if it ran, would return one row per email (at least it would as MySQL interprets this non-standard syntax). The original version will return multiple rows for each email.
The equivalent query is essentially:
select q.*
from (<your query here>
) q join
(select EMAIL, CONTRACT_EFFECTIVE_DATE
from theDB
group by EMAIL, CONTRACT_EFFECTIVE_DATE
having count(*) > 1
) filter
on q.email = filter.email and q.CONTRACT_EFFECTIVE_DATE = e.CONTRACT_EFFECTIVE_DATE;
There is a subtle difference, which is usually immaterial. Your version will recognize NULL values in either or both fields. This version will filter those out, even if there are duplicates.
EDIT:
If you just want the list of emails, use group by:
select email
from theDB t
where CONTRACT_EFFECTIVE_DATE between #start and #end
group by email
having count(*) = 5
(or whatever the specific conditions are).
If you need more information about the email or joins, join back to the original tables.
When you are comfortable with this process, you can think about using window/analytic functions to do the same thing. My concern is that the conditions that you really want may become more complicated and doing the logic in two steps (get the emails, get the additional information) will help you refine it.

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

SQL Having on columns not in SELECT

I have a table with 3 columns:
userid mac_address count
The entries for one user could look like this:
57193 001122334455 42
57193 000C6ED211E6 15
57193 FFFFFFFFFFFF 2
I want to create a view that displays only those MAC's that are considered "commonly used" for this user. For example, I want to filter out the MAC's that are used <10% compared to the most used MAC-address for that user. Furthermore I want 1 row per user. This could easily be achieved with a GROUP BY, HAVING & GROUP_CONCAT:
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
And indeed, the result is as follows:
57193 001122334455,000C6ED211E6 42
However I really don't want the count-column in my view. But if I take it out of the SELECT statement, I get the following error:
#1054 - Unknown column 'count' in 'having clause'
Is there any way I can perform this operation without being forced to have a nasty count-column in my view? I know I can probably do it using inner queries, but I would like to avoid doing that for performance reasons.
Your help is very much appreciated!
As HAVING explicitly refers to the column names in the select list, it is not possible what you want.
However, you can use your select as a subselect to a select that returns only the rows you want to have.
SELECT a.userid, a.macs
FROM
(
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
) as a
UPDATE:
Because of a limitation of MySQL this is not possible, although it works in other DBMS like Oracle.
One solution would be to create a view for the subquery. Another solution seems cleaner:
CREATE VIEW YOUR_VIEW (userid, macs) AS
SELECT userid, GROUP_CONCAT(mac_address SEPARATOR ',') AS macs, count
FROM mactable
GROUP BY userid
HAVING count*10 >= MAX(count)
This will declare the view as returning only the columns userid and macs although the underlying SELECT statement returns more columns than those two.
Although I am not sure, whether the non-DBMS MySQL supports this or not...