BigQuery: how to convert this legacy SQL to standardSQL? - google-bigquery

I have data import pipeline into BigQuery tables (the hourly tables named transactions_20170616_00 transactions_20170616_01 ... and there are more daily/weekly/... rollups), want to use a single view to always point to the latest one, found hard to do one static standardSQL view to point to latest, my current solution is to update the view's content to SELECT * FROM project.dataset.transactions_201706.... after every import successful,
Till I read this httparchive's latest view: it's all what I want but in legacy SQL; my project uses all standardSQL only, and prefer standardSQL because it's the future; wonder anyone knows how to convert this legacy SQL to standardSQL? then I won't need to constantly update my view
https://bigquery.cloud.google.com/table/httparchive:runs.latest_requests?tab=details
SELECT *
FROM TABLE_QUERY(httparchive:runs,
"table_id IN (
SELECT table_id FROM [httparchive:runs.__TABLES__]
WHERE REGEXP_MATCH(table_id, '2.*requests$')
ORDER BY table_id DESC LIMIT 1)")
following this guide, I'm trying to use
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#the_table_query_function
#standardSQL
SELECT * FROM `httparchive.runs.*`
WHERE _TABLE_SUFFIX IN
( SELECT table_id
FROM httparchive.runs.__TABLES__
WHERE REGEXP_CONTAINS(table_id, r'2.*requests$')
ORDER BY table_id DESC
LIMIT 1)
but the query failed of
Query Failed
Error: Views cannot be queried through prefix. Matched views are: httparchive:runs.latest_pages, httparchive:runs.latest_pages_mobile, httparchive:runs.latest_requests, httparchive:runs.latest_requests_mobile
Job ID: bidder-1183:bquijob_1400109e_15cb1dc3c0c
I found the wildcard can only be used at last? in this case why not SELECT * FROM httparchive.runs.*_requests WHERE ... work?
in this case, is it saying the Wildcard Tables feature in standardSQL isn't same flexible as TABLE_QUERY in legacySQL>?

Related

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)
Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.
You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz
was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.
I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows

Cannot query over table without a filter that can be used for partition elimination

I have a partitioned table and would love to use a MERGE statement, but for some reason doesn't work out.
MERGE `wr_live.p_email_event` t
using `wr_live.email_event` s
on t.user_id=s.user_id and t.event=s.event and t.timestamp=s.timestamp
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
values (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
I get
Cannot query over table 'wr_live.p_email_event' without a filter that
can be used for partition elimination.
What's the proper syntax? Also is there a way I can express shorter the insert stuff? without naming all columns?
What's the proper syntax?
As you can see from error message - your partitioned wr_live.p_email_event table was created with require partition filter set to true. This mean that any query over this table must have some filter on respective partitioned field
Assuming that timestamp IS that partitioned field - you can do something like below
MERGE `wr_live.p_email_event` t
USING `wr_live.email_event` s
ON t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > CURRENT_DATE() -- this is the filter you should tune
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
VALUES (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
So you need to make below line such that it in reality does not filter out whatever you need to be involved
AND DATE(t.timestamp) <> CURRENT_DATE() -- this is the filter you should tune
For example, I found, setting it to timestamp in future - in many cases addresses the issue, like
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
Of course, if your wr_live.email_event table also partitioned with require partition filter set to true - you need to add same filter for s.timestamp
Also is there a way I can express shorter the insert stuff? without naming all columns?
BigQuery DML's INSERT requires column names to be specified - there is no way (at least that I am aware of) to avoid it using INSERT statement
Meantime, you can avoid this by using DDL's CREATE TABLE from the result of the query. This will not require listing the columns
For example, something like below
CREATE OR REPLACE TABLE `wr_live.p_email_event`
PARTITION BY DATE(timestamp) AS
SELECT * FROM `wr_live.p_email_event`
WHERE DATE(timestamp) <> DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL
SELECT * FROM `wr_live.email_event` s
WHERE NOT EXISTS (
SELECT 1 FROM `wr_live.p_email_event` t
WHERE t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
)
You might also want to include table options list via OPTIONS() - but looks like filter attribute is not supported yet - so if you do have/need it - above will "erase" this attribute :o(

In BigQuery, what is the workaround for _TABLE_SUFFIX to pull the correct schema?

When using StandardSQL with _TABLE_SUFFIX, BigQuery interprets the schema based on the last table saved into the dataset. The result is that if there are more than one kind of table in the dataset, the following query will return NULL values for an incorrect schema:
SELECT
*
FROM `dataset.*`
WHERE
_TABLE_SUFFIX LIKE '%some_table_name'
So if I save a table named dataset.different_table and then execute the above query, then the result won't be as expected.
Is there a workaround for this workflow? An alternative to using _TABLE_SUFFIX?

BigQuery: Querying repeated fields

I'm trying to use the following query to get the rows for which the event names are equal to: EventGamePlayed, EventGetUserBasicInfos or EventGetUserCompleteInfos
select *
from [com_test_testapp_ANDROID.app_events_20170426]
where event_dim.name in ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
I'm getting the following error: Cannot query the cross product of repeated fields event_dim.name and user_dim.user_properties.value.index.
Is it possible to make it work by not having a flattened result ?
Also, I'm not sure why the error is talking about the "user_dim.user_properties.value.index" field.
The error is due to the SELECT *, which includes all columns. Rather than using legacy SQL, try this using standard SQL, which doesn't have this problem with repeated field cross products:
#standardSQL
SELECT *
FROM com_test_testapp_ANDROID.app_events_20170426
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE event_dim.name IN ("EventGamePlayed", "EventGetUserBasicInfos", "EventGetUserCompleteInfos");
You can read more about working with repeated fields/arrays in the Working with Arrays topic. If you are used to using legacy SQL, you can read about differences between legacy and standard SQL in BigQuery in the migration guide.

Google bigquery - update sql?

It seems you can do select sql in bigquery but can you update only certain rows in the table through api or from their web console?
Currently BigQuery only accepts SELECT statements. Updates to data need to be done via API, web UI, or CLI.
BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE or DELETE row.
To delete data, you could re-materialize the table without the desired rows:
SELECT *
FROM [mytable]
WHERE id NOT IN (SELECT id FROM [rows_to_delete]
To update data, you could follow a similar process:
SELECT * FROM (
SELECT *
FROM [mytable]
WHERE id NOT IN (SELECT id FROM [rows_to_update]
), (
SELECT *
FROM [rows_to_update]
)
Re-materializing a table in BigQuery is fast enough - compared to native update/deletes on other analytical databases AFAIK.