BigQuery: Querying repeated fields - google-bigquery

I'm trying to use the following query to get the rows for which the event names are equal to: EventGamePlayed, EventGetUserBasicInfos or EventGetUserCompleteInfos
select *
from [com_test_testapp_ANDROID.app_events_20170426]
where event_dim.name in ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
I'm getting the following error: Cannot query the cross product of repeated fields event_dim.name and user_dim.user_properties.value.index.
Is it possible to make it work by not having a flattened result ?
Also, I'm not sure why the error is talking about the "user_dim.user_properties.value.index" field.

The error is due to the SELECT *, which includes all columns. Rather than using legacy SQL, try this using standard SQL, which doesn't have this problem with repeated field cross products:
#standardSQL
SELECT *
FROM com_test_testapp_ANDROID.app_events_20170426
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE event_dim.name IN ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
You can read more about working with repeated fields/arrays in the Working with Arrays topic. If you are used to using legacy SQL, you can read about differences between legacy and standard SQL in BigQuery in the migration guide.

Related

"contains" in Bigquery standard SQL

I wish to migrate from Legacy SQL to Standard SQL
I had the following code in Legacy SQL
SELECT
hits.page.pageTitle
FROM [mytable]
WHERE hits.page.pageTitle contains '%'
And I tried this in Standard SQL:
SELECT
hits.page.pageTitle
FROM `mytable`
WHERE STRPOS(hits.page.pageTitle, "%")
But it gives me this error:
Error: Cannot access field page on a value with type
ARRAY> at [4:21]
Try this one:
SELECT
hits.page.pageTitle
FROM `table`,
UNNEST(hits) hits
WHERE REGEXP_CONTAINS(hits.page.pageTitle, r'%')
LIMIT 1000
In ga_sessions schema, "hits" is an ARRAY (that is, REPEATED mode). You need to apply the UNNEST operation in order to work with arrays in BigQuery.

BigQuery: how to convert this legacy SQL to standardSQL?

I have data import pipeline into BigQuery tables (the hourly tables named transactions_20170616_00 transactions_20170616_01 ... and there are more daily/weekly/... rollups), want to use a single view to always point to the latest one, found hard to do one static standardSQL view to point to latest, my current solution is to update the view's content to SELECT * FROM project.dataset.transactions_201706.... after every import successful,
Till I read this httparchive's latest view: it's all what I want but in legacy SQL; my project uses all standardSQL only, and prefer standardSQL because it's the future; wonder anyone knows how to convert this legacy SQL to standardSQL? then I won't need to constantly update my view
https://bigquery.cloud.google.com/table/httparchive:runs.latest_requests?tab=details
SELECT *
FROM TABLE_QUERY(httparchive:runs,
"table_id IN (
SELECT table_id FROM [httparchive:runs.__TABLES__]
WHERE REGEXP_MATCH(table_id, '2.*requests$')
ORDER BY table_id DESC LIMIT 1)")
following this guide, I'm trying to use
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#the_table_query_function
#standardSQL
SELECT * FROM `httparchive.runs.*`
WHERE _TABLE_SUFFIX IN
( SELECT table_id
FROM httparchive.runs.__TABLES__
WHERE REGEXP_CONTAINS(table_id, r'2.*requests$')
ORDER BY table_id DESC
LIMIT 1)
but the query failed of
Query Failed
Error: Views cannot be queried through prefix. Matched views are: httparchive:runs.latest_pages, httparchive:runs.latest_pages_mobile, httparchive:runs.latest_requests, httparchive:runs.latest_requests_mobile
Job ID: bidder-1183:bquijob_1400109e_15cb1dc3c0c
I found the wildcard can only be used at last? in this case why not SELECT * FROM httparchive.runs.*_requests WHERE ... work?
in this case, is it saying the Wildcard Tables feature in standardSQL isn't same flexible as TABLE_QUERY in legacySQL>?

Google BigQuery Legacy Syntax Help Needed

I'm having troubles converting a Google BigQuery statement from Standard SQL
to Legacy SQL. For context, I have posted the Standard SQL and respective table schema.
In a nutshell...the code below selects the 'latest' (AS-IS) version of a
Product Hierarchy for reporting. This was done with the use of STRUCTs in Standard SQL.
I'm not sure how to do this in legacy SQL.
Any help would be greatly appreciated!
clbarrineau
Standard SQL Example
SELECT STR_NBR
, SKU
, SKU_CRT_DT
, DS.*
, (
SELECT AS STRUCT
X.*
FROM (
SELECT *
, ROW_NUMBER() OVER(ORDER BY EFF_BGN_DT DESC) AS ROW_NUM
FROM SLS.PROD_HIER
) AS X
WHERE ROW_NUM = 1
) AS P_HIER
FROM `XXXX.YYYY.SKU_STR_SLS_20141201` SLS
, UNNEST(DAILY_SALES) AS DS;
Schema Definition
STR_NBR--------------------------------STRING-----------NULLABLE
SKU------------------------------------INTEGER----------NULLABLE
SKU_CRT_DT-----------------------------DATE-------------NULLABLE
DAILY_SALES----------------------------RECORD-----------REPEATED
DAILY_SALES.SLS_DT---------------------DATE-------------NULLABLE
DAILY_SALES.*(many other attributes) --XXXX-------------XXXX
PROD_HIER------------------------------RECORD-----------REPEATED
PROD_HIER.eff_bgn_dt-------------------DATE-------------NULLABLE
PROD_HIER.*(many other attributes) ----XXXX-------------XXXX
A couple of suggestions, though you may want to contact Tableau's support to ask what the status of being able to use standard SQL is as well. In some tools, it's possible to force standard SQL by putting #standardSQL at the top of the query.
For legacy SQL, instead of the comma operator with UNNEST, you'll need to use FLATTEN. Something like FLATTEN(XXXX.YYYY.SKU_STR_SLS_20141201, DAILY_SALES.SLS_DT), for example. Since you want to compute row numbers prior to flattening, though, you may need to apply FLATTEN to the subquery itself. My legacy SQL is a bit rusty, so I don't want to lead you astray with a non-functional query, but take a look at some of the other SO questions about FLATTEN to see how it's used.

Bigquery type cast on JOIN Clause

I'm trying to join two tables on two columns
-- query to join two tables
SELECT
*
FROM
[raw.raw_sales] AS game_records
JOIN
[facebook_aggregate.avg_aggregate] AS avg_aggregate
ON
(game_records.Away_team = avg_aggregate.team_name)
AND (game_records.game_date = avg_aggregate.time_update)
it gives me this error Error: 10.30 - 10.56: Timestamp literal or explicit conversion to timestamp is required. because game_records.game_date is type STRING and avg_aggregate.time_update is type DATE.
but if I do the conversion within the JOIN..ON.. clause
-- query to join two tables
SELECT
*
FROM
[raw.raw_sales] AS game_records
JOIN
[facebook_aggregate.avg_aggregate] AS avg_aggregate
ON
(game_records.Away_team = avg_aggregate.team_name)
AND (DATE(game_records.game_date) = DATE(avg_aggregate.time_update))
It gives me this error:
Error: ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name. .
Is there any way to do this without creating an intermediate table? Thanks!
Try using standard SQL (uncheck "Use Legacy SQL" under "Show Options"). You shouldn't need to do anything aside from remove the brackets around the table names:
SELECT
*
FROM
raw.raw_sales AS game_records
JOIN
facebook_aggregate.avg_aggregate AS avg_aggregate
ON
game_records.Away_team = avg_aggregate.team_name
AND game_records.game_date = avg_aggregate.time_update;
BigQuery Standard SQL (see Enabling Standard SQL) does not have this limitation for ON clause. Try running your query in Standard SQL.
As Elliott mentioned - make sure you are not using square brackets around tables references. In Standard SQL - when you need to escape special chars - yo should use back-ticks
Also check Migrating from legacy SQL if you will follow above direction

BigQuery query creation without variables?

Coming from SQL Server and a little bit of MySQL, I'm not sure how to proceed on google's BigQuery web browser query tool.
There doesn't appear to be any way to create, use or Set/Declare variables. How are folks working around this? Or perhaps I have missed something obvious in the instructions or the nature of BigQuery? Java API?
It is now possible to declare and set variables using SQL. For more information, see the documentation, but here is an example:
-- Declare a variable to hold names as an array.
DECLARE top_names ARRAY<STRING>;
-- Build an array of the top 100 names from the year 2017.
SET top_names = (
SELECT ARRAY_AGG(name ORDER BY number DESC LIMIT 100)
FROM `bigquery-public-data`.usa_names.usa_1910_current
WHERE year = 2017
);
-- Which names appear as words in Shakespeare's plays?
SELECT
name AS shakespeare_name
FROM UNNEST(top_names) AS name
WHERE name IN (
SELECT word
FROM `bigquery-public-data`.samples.shakespeare
);
There is currently no way to set/declare variables in BigQuery. If you need variables, you'll need to cut and paste them where you need them. Feel free to file this as a feature request here.
Its not elegant, and its a a pain, but...
The way we handle it is using a python script that replaces a "variable placeholder" in our query and than sending the amended query via the API.
I have opened a feature request asking for "Dynamic SQL" capabilities.
If you want to avoid BQ scripting, you can sometimes use an idiom which utilizes WITH and CROSS JOIN.
In the example below:
the events table contains some timestamped events
the reports table contain occasional aggregate values of the events
the goal is to write a query that only generates incremental (non-duplicate) aggregate rows
This is achieved by
introducing a state temp table that looks at a target table for aggregate results
to determine parameters (params) for the actual query
the params are CROSS JOINed with the actual query
allowing the param row's columns to be used to constrain the query
this query will repeatably return the same results
until the results themselves are appended to the reports table
WTIH state AS (
SELECT
-- what was the newest report's ending time?
COALESCE(
SELECT MAX(report_end_ts) FROM `x.y.reports`,
TIMESTAMP("2019-01-01")
) AS latest_report_ts,
...
),
params AS (
SELECT
-- look for events since end of last report
latest_report_ts AS event_after_ts,
-- and go until now
CURRENT_TIMESTAMP() AS event_before_ts
)
SELECT
MIN(event_ts) AS report_begin_ts,
MAX(event_ts) AS report_end_ts
COUNT(1) AS event_count,
SUM(errors) AS error_total
FROM `x.y.events`
CROSS JOIN params
WHERE event_ts > event_after_ts
AND event_ts < event_before_ts
)
This approach is useful for bigquery scheduled queries.