When exporting Google Analytics data to Google BigQuery you can setup a realtime table that is populated with Google Analytics data in real time. However, this table will contain duplicates due to the eventual consistent nature of distributed computing.
To overcome this Google has provided a view where the duplicated are filtered out. However, this view is not queryable with Standard SQL.
If I try querying with standard:
Cannot reference a legacy SQL view in a standard SQL query
We have standardized on Standard, and I am hesitant to rewrite all our batch queries to legacy for when we want to use them on realtime data. Is there a way to switch the realtime view to be a standard view?
EDIT:
This is the view definition (which is recreated every day by Google):
SELECT *
FROM [111111.ga_realtime_sessions_20190625]
WHERE exportKey IN (SELECT exportKey
FROM
(SELECT
exportKey, exportTimeUsec,
MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
FROM
[111111.ga_realtime_sessions_20190625])
WHERE exportTimeUsec >= maxexportTimeUsec );
You can create a logical view like this using standard SQL:
CREATE VIEW dataset.realtime_view_20190625 AS
SELECT
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM dataset.ga_realtime_sessions_20190625 AS t
GROUP BY visitKey
This selects the most recent row for each visitKey. If you want to generalize this across days, you can do something like this:
CREATE VIEW dataset.realtime_view AS
SELECT
CONCAT('20', _TABLE_SUFFIX) AS date,
visitKey,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (visitKey))
ORDER BY exportTimeUsec DESC LIMIT 1)[OFFSET(0)].*
FROM `dataset.ga_realtime_sessions_20*` AS t
GROUP BY date, visitKey
Related
CONTEXT
Hi,
in bigquery, I have a table that is partitioned by an integer that can be from 0 to 999.
Every time I use this data source in Looker Studio for reporting, I filter this column using a parameter to get the right partition; after that, another filter is used on the date column.
The queries are fast but very expensive.
GOAL
To reduce cost, I divided the table into 1000 wildcard tables in my big query project and use the date as a partition for all of them.
So,
before: I have my_project.big_table partition by id;
now: I have my_project.table_ partition by date and I can use the table_suffix to get the right table
In the Looker Studio, I changed the custom query for the data source from:
SELECT a.*
FROM `my_project.big_table` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.id = #id1
AND a.user_email = #DS_USER_EMAIL
to :
SELECT a.*
FROM `my_project.table_*` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a._TABLE_SUFFIX = #id1
AND a.user_email = #DS_USER_EMAIL
ISSUE DESCRIPTION
the change above caused a dramatic issue in the performance of the dashboard.
Every page now spends more than 5' to give me results, before the pages were loaded in less than 10''.
I tried to use:
The parameter #id1 directly in the FROM SQL but it is not automatically substituted and it causes an error Not found: Table my_project.table_#{id1} was not found in location EU
an EXECUTE IMMEDIATE but it is not recognized by the tool
When I try to use direct one of the 1k table suffixes, for example id 400:
SELECT a.*
FROM `my_project.table_400` AS a
WHERE a.date BETWEEN PARSE_DATE('%Y%m%d', #DS_START_DATE) AND PARSE_DATE('%Y%m%d', #DS_END_DATE)
AND a.user_email = #DS_USER_EMAIL
the performances are exactly the same as before, but, I must filter for reporting.
I know that the wildcard tables are limited in many aspects( cache for example) but, testing the query on BigQuery, the time spent is 0/1 second.
Is there something that I miss/I can change on the query?
Do you have some advice/suggestions?
Many thanks!
I am trialling materialised views in our BQ eventing system but have hit a roadblock.
For context:
Our source event ingest tables use streaming inserts only (append only), partitioned by event time (more or less true time but always in order WRT the entity involved in the event stream), and we extract a given entities 'latest' / most recent full state. I feel with data being append only, and history immutable, there could be benefits here but currently cannot get it to work (yet)
Alot of our base BQ code is spent determining what the 'latest' state of the entity is. This latest entity is baked into the payloads of the most recent event ingested in that table e.g OrderAccepted then later OrderItemsDespatched events (for the same OrderId), the OrderItemsDespatched event will have the most up to date snapshot of the order (post processing an items dispatch).
Thus in BQ for BI, we need to surface the most current state of that order. e.g we need to extract the order struct from the OrderItemsDespatched event since it is the most recent event.
This could involve an analytic function:
(ROW_NUMBER() OVER entityId ORDER BY EventOccurredTimestamp DESC)
and pick row=1 - however analytic functions are not supported by MVs and is not as efficient anyway as ARRAY_AGG below
CREATE MATERIALIZED VIEW dataset.order_events_latest_mv
PARTITION BY EventOccurredDate
CLUSTER BY OrderId
AS
WITH ord_events AS (
SELECT
oe.*,
orderEvent.latestOrder.id AS OrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime) AS EventOccurredTimestamp,
EXTRACT(DATE FROM PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", event.eventOccurredTime)) AS EventOccurredDate,
FROM
`project.dataset.order_events` oe
),
ord_events_latest AS (
SELECT
ARRAY_AGG(
e ORDER BY EventOccurredTimestamp DESC LIMIT 1
)[OFFSET(0)].*
FROM
ord_events e
GROUP BY
e.OrderId
)
SELECT
*
FROM
ord_events_latest
However there is an error
Materialized view query contains unsupported feature.
Fundamentally, we could save a heck of alot of current processing and cost only processing changed data, rather then scanning all the data everytime, which given its an append only, partitioned source table, seems feasible?
The logic would be quite similar for deduping our events as well, which we do alot as well with slightly different query but using ARRAY_AGG as well.
Any advice welcome, hopefully the feature the error message mentions isnt supported is not far off. Thanks!
I hope it works:
WITH
latest_records AS
(
SELECT entityId, SPLIT(MAX(COALESCE(EventOccurredTimestamp, '||', Col1, '||', Col2, '||', Col3)), '||') values
FROM `project.dataset.order_events`
GROUP BY entityId
)
SELECT
entityId,
CAST(values[offset(0)]) as timestamp) as EventOccurredTimestamp,
values[offset(1)] as Col1, -- let's say it's string
CAST(values[offset(2)] as bool) as Col2, -- it's bool
CAST(values[offset(3)] as int64) as Col3 -- it's int64
FROM latest_records
We have a set of Google BigQuery tables which are all distinguished by a wildcard for technical reasons, for example content_owner_asset_metadata_*. These tables are updated daily, but at different times.
We need to select the latest partition from each table in the wildcard.
Right now we are using this query to build our derived tables:
SELECT
*
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME = (
SELECT
MIN(time)
FROM (
SELECT
MAX(_PARTITIONTIME) as time
FROM
`project.content_owner_asset_metadata_*`
WHERE
_PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)
)
This statement finds out the date that all the up-to-date tables are guarenteed to have and selects that date's data, however I need a filter that selects the data from the maximum partition time of each table. I know that I'd need to use _TABLE_SUFFIX with _PARTITIONTIME, but cannot quite work out how to make a select work without just loading all our data (very costly) and using a standard greatest-n-per-group solution.
We cannot just union a bunch of static tables, as our dataset ingestion is liable to change and the scripts we build need to be able to accomodate.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
Below example uses BigQuery public dataset to demonstrate how to prune partition to only query and scan on latest day of data.
DECLARE max_date TIMESTAMP
DEFAULT (SELECT MAX(_PARTITIONTIME) FROM `bigquery-public-data.sec_quarterly_financials.numbers`);
SELECT * FROM `bigquery-public-data.sec_quarterly_financials.numbers`
WHERE _PARTITIONTIME = max_date;
With INFORMATION_SCHEMA.PARTITIONS (preview) as of posting, this can be achieved by joining to the PARTITIONS table as follows (e.g. with HOUR partitioning):
SELECT i.*
FROM `project.dataset.prefix_*` i
JOIN (
SELECT * EXCEPT (r)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY table_name ORDER BY partition_id DESC) AS r
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name LIKE "%prefix%"
AND partition_id NOT IN ("__NULL__", "__UNPARTITIONED__"))
WHERE r = 1) p
ON (FORMAT_TIMESTAMP("%Y%m%d%H", i._PARTITIONTIME) = p.partition_id
AND CONCAT("prefix_", i._TABLE_SUFFIX) = p.table_name)
I have Google Analytics data that's spread across multiple BigQuery datasets, all using the same schema. I would like to query multiple tables each across these datasets at the same time using BigQuery's new Standard SQL dialect. I know I can query multiple tables within a single database like so:
FROM `12345678`.`ga_sessions_2016*` s
WHERE s._TABLE_SUFFIX BETWEEN '0501' AND '0720'
What I can't figure out is how to query against not just 12345678 but also against 23456789 at the same time.
How about using a simple UNION, with a SELECT wrapping around it (I tested this using the new standard SQL option and it worked as expected):
SELECT
SUM(foo)
FROM (
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>
UNION ALL
SELECT
COUNT(*) AS foo
FROM
<YOUR_DATASET_1>.<YOUR_TABLE_1>)
I believe that using table wild card & union (in bigquery, use comma to achieve the union function) will get what you need very quickly, if the tables have the same schema.
select *
from
(select * from table_table_range([dataset1], date1, date2),
(select * from table_table_range([dataset2], date3, date4),
......
I have a set of day-sharded data where individual entries do not contain the day. I would like to use table wildcards to select all available data and get back data that is grouped by both the column I am interested in and the day that it was captured. Something, in other words, like this:
SELECT table_id, identifier, Sum(AppAnalytic) as AppAnalyticCount
FROM (TABLE_QUERY(database_main,'table_id CONTAINS "Title_" AND length(table_id) >= 4'))
GROUP BY identifier, table_id order by AppAnalyticCount DESC LIMIT 10
Of course, this does not actually work because table_id is not visible in the table aggregation resulting from the TABLE_QUERY function. Is there any way to accomplish this? Some sort of join on table metadata perhaps?
This functionality is available now in BigQuery through _TABLE_SUFFIX pseudocolumn. Full documentation is at https://cloud.google.com/bigquery/docs/querying-wildcard-tables.
Couple of things to note:
You will need to use Standard SQL to enable table wildcards
You will have to rename _TABLE_SUFFIX into something else in your SELECT list, i.e. following example illustrates it
SELECT _TABLE_SUFFIX as table_id, ... FROM `MyDataset.MyTablePrefix_*`
Not available today, but something I'd love to have too. The team takes feature requests seriously, so thanks for adding support for this one :).
In the meantime, a workaround is doing a manual union of a SELECT of each table, plus an additional column with the date data.
For example, instead of:
SELECT x, #TABLE_ID
FROM table201401, table201402, table201303
You could do:
SELECT x, month
FROM
(SELECT x, '201401' AS month FROM table201401),
(SELECT x, '201402' AS month FROM table201402),
(SELECT x, '201403' AS month FROM table201403)