Google Looker Studio - Google Data Studio | Bad performance using table_suffix filter in BigQuery data source - google-bigquery

CONTEXT
Hi,
in bigquery, I have a table that is partitioned by an integer that can be from 0 to 999.
Every time I use this data source in Looker Studio for reporting, I filter this column using a parameter to get the right partition; after that, another filter is used on the date column.
The queries are fast but very expensive.
GOAL
To reduce cost, I divided the table into 1000 wildcard tables in my big query project and use the date as a partition for all of them.
So,
before: I have my_project.big_table partition by id;
now: I have my_project.table_ partition by date and I can use the table_suffix to get the right table
In the Looker Studio, I changed the custom query for the data source from:
SELECT a.*
FROM `my_project.big_table` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.id = #id1
AND a.user_email = #DS_USER_EMAIL
to :
SELECT a.*
FROM `my_project.table_*` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a._TABLE_SUFFIX = #id1
AND a.user_email = #DS_USER_EMAIL
ISSUE DESCRIPTION
the change above caused a dramatic issue in the performance of the dashboard.
Every page now spends more than 5' to give me results, before the pages were loaded in less than 10''.
I tried to use:
The parameter #id1 directly in the FROM SQL but it is not automatically substituted and it causes an error Not found: Table my_project.table_#{id1} was not found in location EU
an EXECUTE IMMEDIATE but it is not recognized by the tool
When I try to use direct one of the 1k table suffixes, for example id 400:
SELECT a.*
FROM `my_project.table_400` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.user_email = #DS_USER_EMAIL
the performances are exactly the same as before, but, I must filter for reporting.
I know that the wildcard tables are limited in many aspects( cache for example) but, testing the query on BigQuery, the time spent is 0/1 second.
Is there something that I miss/I can change on the query?
Do you have some advice/suggestions?
Many thanks!

Related

How do I query GA RealtimeView with Standard SQL in BigQuery?

When exporting Google Analytics data to Google BigQuery you can setup a realtime table that is populated with Google Analytics data in real time. However, this table will contain duplicates due to the eventual consistent nature of distributed computing.
To overcome this Google has provided a view where the duplicated are filtered out. However, this view is not queryable with Standard SQL.
If I try querying with standard:
Cannot reference a legacy SQL view in a standard SQL query
We have standardized on Standard, and I am hesitant to rewrite all our batch queries to legacy for when we want to use them on realtime data. Is there a way to switch the realtime view to be a standard view?
EDIT:
This is the view definition (which is recreated every day by Google):
SELECT *
FROM [111111.ga_realtime_sessions_20190625]
WHERE exportKey IN (SELECT exportKey
FROM
(SELECT
exportKey, exportTimeUsec,
MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
FROM
[111111.ga_realtime_sessions_20190625])
WHERE exportTimeUsec >= maxexportTimeUsec );
You can create a logical view like this using standard SQL:
CREATE VIEW dataset.realtime_view_20190625 AS
SELECT
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM dataset.ga_realtime_sessions_20190625 AS t
GROUP BY visitKey
This selects the most recent row for each visitKey. If you want to generalize this across days, you can do something like this:
CREATE VIEW dataset.realtime_view AS
SELECT
CONCAT('20', _TABLE_SUFFIX) AS date,
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM `dataset.ga_realtime_sessions_20*` AS t
GROUP BY date, visitKey

Bigquery Select all latest partitions from a wildcard set of tables

We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)

Bug or new behavior in BigQuery?

Since two days ago (August 10th 2016), a query which used to work (using tables of the BQ Export for Google Analytics Premium) has stopped working. It returns the following error:
Error: Cannot union tables : Incompatible types.
'hits.latencyTracking.userTimingVariable' : TYPE_INT64
'hits.latencyTracking.userTimingVariable' : TYPE_STRING
After some investigation, it seems to be a problem with using IN in a WHERE clause when I query tables from before and after August 10th (table ga_sessions_20160810).
I've simplified my original query to provide a dummy one which has the same basic structure. The following query works (querying data from 2016-08-08 and 2016-08-09):
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-08'),TIMESTAMP('2016-08-09'))
)
GROUP BY fullVisitorId
But this other one (just changing dates, in this case from 2016-08-09 and 2016-08-10) returns the error:
SELECT fullVisitorId, sum(totals.visits)
FROM (select * from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
)
GROUP BY fullVisitorId
This last query works fine either if I delete the WHERE clause or if I just try the query within the IN, so I guess the problem is with the structure WHERE field IN(...). Furthermore, querying only data from 2016-08-10 does work. Also, the same happens using a field different to fullVisitorId and running the same queries in different BQ projects.
Looking to the error description, it should be a problem with variable types, but I don't know what is hits.latencyTracking.userTimingVariable. My query used to work properly, so I can't figure out what has changed that produces the error. Have some fields changed their type or what happened?
Has anyone experienced this? Is this a bug or a new behavior in BigQuery? How can this error be solved?
As you are using * in select clause it might causing problem when union is happening its trying to combine two different column types ( as schema changed from INT64 to STRING).
I have two approaches
1) use only those fields required by you than using * in select clause
SELECT fullVisitorId, sum(totals.visits)
FROM (select fullVisitorId,totals.visits from TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10')))
WHERE fullVisitorId in(
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([XXXXXXXX.ga_sessions_],TIMESTAMP('2016-08-09'),TIMESTAMP('2016-08-10'))
) GROUP BY fullVisitorId
2) using views to split inner queries and use the view later in the query. (even in view you need to use only use those fields which are required )
SELECT fullVisitorId, sum(totals.visits)
FROM [view.innertable2]
WHERE fullVisitorId in(
SELECT fullVisitorId from [view.innertable1] ) GROUP BY fullVisitorId
This will exclude the hits.latencyTracking.userTimingVariable so there will be no error.
If the fields that you are querying are compatible, you may try using Standard SQL wildcard tables (you'll have to uncheck use Legacy SQL box if you are doing this from the UI). Something like this:
SELECT fullVisitorId, sum(totals.visits)
FROM `xxxxxxxx.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20160808' and '20160810'
GROUP BY fullVisitorId;

How to extract record's table name when using Table wildcard functions [duplicate]

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)

Is there a way to select table_id in a Bigquery Table Wildcard Query

I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)