BigQuery standard SQL syntax: _TABLE_SUFFIX and .yesterday tables - google-bigquery

My goal is to query across multiple tables of a dataset using BigQuery standard SQL syntax.
I can successfully make it work when all tables of a dataset follow the same number pattern. However, for datasets that contain additional tables like .yesterday, I get an error: Views cannot be queried through prefix. Matched views are: githubarchive:day.yesterday
Here is the query I used:
SELECT
COUNT(*)
FROM
`githubarchive.day.*`
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '20170101' AND '20170215'

Try using more of a prefix. For example,
SELECT
COUNT(*)
FROM
`githubarchive.day.2017*`
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '0101' AND '0215';

Related

Use wildcard query on dataset with models in BigQuery

I have a series of tables that are named {YYYYMM}_{id} and I have ML models that are named {groupid}_cost_model. I'm attempting to collate some data across all the tables using the following query:
SELECT * FROM `mydataset.20*`
The problem I'm having is that I have a model named 200_cost_model and it causes the following error:
Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table myproject:mydataset.200_cost_model, first column table myproject:mydataset.202001_4544248676.
Is there a way to filter out the models from wildcard queries or am I stuck joining all the tables together?
When using Wildcard tables you can use psuedo column to filter results:
Queries with wildcard tables support the _TABLE_SUFFIX pseudo column
in the WHERE clause. This column contains the values matched by the
wildcard character, so that queries can filter which tables are
accessed. For example, the following WHERE clauses use comparison
operators to filter the matched tables
I have tested on my side, although only on standard freshly created tables, that it should work for example like that:
SELECT *
FROM
`mydataset.20*`
WHERE
_TABLE_SUFFIX like '%cost_model' ;
As well to check all possible _TABLE_SUFFIX choices it work to me like this:
select DISTINCT _TABLE_SUFFIX as suffix from `mydataset.20*`
but I am not sure, if this will work in your situation.

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

Google BigQuery - Using wildcard table query with date partitioned table?

I am trying to use wildcard table functions to query bunch of date-partitioned tables.
This query works:
select * from `Mydataset.fact_table_1` where _partitiontime='2016-09-30' limit 10
This query does not work:
select * from `Mydataset.fact_table_*` where _partitiontime='2016-09-30' limit 10
Is this operation not supported?
If it is not supported what's the best way to read same day's data from multiple date-partitioned tables?
Following statement
select * from TABLE_QUERY(YOUR_DATASET,'table_id contains "fact_table_"') where _PARTITIONTIME = TIMESTAMP('2016-09-30')
Should do the trick

SQL querying multiple schemas

I am looking to run a query on several schemas in workbench. bascially, they are all symmetric , just different dates. In workbench, i can only select one of them and run the query. Is there a way to aggregate them and run the query over a selection of schemas?
EDIT:
To elaborate a bit more, I have schemas with names yyyy_mm_dd for each day. Ideally, instead of doing a union over them as suggested by Guish below, If would like a dynamic query that would be able to turn the name of the schema into a valid date and Union all of them where the date is within a defined range. Is this possible? I am using Oracle and sql workbench
I guess you are using mySql workbench.
Use an union operator.
(SELECT a FROM `schema1`.`t1` )
UNION
(SELECT a FROM `schema2`.`t1`);
Info here
You can then create a view from your query.
A thread here on querying multiple shema
In know Transact-SQL a lot more and it is similar.
SELECT ProductModelID, Name
FROM Schema1.ProductModel
UNION ALL
SELECT ProductModelID, Name
FROM Schema2.ProductModel
ORDER BY Name;

How do I use the TABLE_QUERY() function in BigQuery?

A couple of questions about the TABLE_QUERY function:
The examples show using table_id in the query string, are there other fields available?
It seems difficult to debug. I'm getting "error evaluating subsidiary query" when I try to use it.
How does TABLE_QUERY() work?
The TABLE_QUERY() function allows you to write a SQL WHERE clause that is evaluated to find which tables to run the query over. For instance, you can run the following query to count the rows in all tables in the publicdata:samples dataset that are older than 7 days:
SELECT count(*)
FROM TABLE_QUERY(publicdata:samples,
"MSEC_TO_TIMESTAMP(creation_time) < "
+ "DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')")
Or you can run this to query over all tables that have ‘git’ in the name (which are the github_timeline and the github_nested sample tables) and find the most common urls:
SELECT url, COUNT(*)
FROM TABLE_QUERY(publicdata:samples, "table_id CONTAINS 'git'")
GROUP EACH BY url
ORDER BY url DESC
LIMIT 100
Despite being very powerful, TABLE_QUERY() can be difficult to use. The WHERE clause must be specified as a string, which can be a little bit awkward. Moreover, it can be difficult to debug, since when there is a problem, you only get the error “Error evaluating subsidiary query”, which isn’t always helpful.
How it works:
TABLE_QUERY() essentially executes two queries. When you run TABLE_QUERY(<dataset>, <table_query>), BigQuery executes SELECT table_id FROM <dataset>.__TABLES_SUMMARY__ WHERE <table_query> to get the list of table IDs to run the query on, then it executes your actual query over those tables.
The __TABLES__ portion of that query may look unfamiliar. __TABLES_SUMMARY__ is a meta-table containing information about tables in a dataset. You can use this meta-table yourself. For example, the query SELECT * FROM publicdata:samples.__TABLES_SUMMARY__ will return metadata about the tables in the publicdata:samples dataset.
Available Fields:
The fields of the __TABLES_SUMMARY__ meta-table (that are all available in the TABLE_QUERY query) include:
table_id: name of the table.
creation_time: time, in milliseconds since 1/1/1970 UTC, that the table was created. This is the same as the creation_time field on the table.
type: whether it is a view (2) or regular table (1).
The following fields are not available in TABLE_QUERY() since they are members of __TABLES__ but not __TABLES_SUMMARY__. They're kept here for historical interest and to partially document the __TABLES__ metatable:
last_modified_time: time, in milliseconds since 1/1/1970 UTC, that the table was updated (either metadata or table contents). Note that if you use the tabledata.insertAll() to stream records to your table, this might be a few minutes out of date.
row_count: number of rows in the table.
size_bytes: total size in bytes of the table.
How to debug
In order to debug your TABLE_QUERY() queries, you can do the same thing that BigQuery does; that is, you can run the the metatable query yourself. For example:
SELECT * FROM publicdata:samples.__TABLES_SUMMARY__
WHERE MSEC_TO_TIMESTAMP(creation_time) <
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')
lets you not only debug your query but also see what tables would be returned when you run the TABLE_QUERY function. Once you have debugged the inner query, you can put it together with your full query over those tables.
Alternative answer, for those moving forward to Standard SQL:
BigQuery Standard SQL doesn't support TABLE_QUERY, but it supports * expansion for table names.
When expanding table names *, you can use the meta-column _TABLE_SUFFIX to narrow the selection.
Table expansion with * only works when all tables have compatible schemas.
For example, to get the average worldwide NOAA GSOD temperature between 2010 and 2014:
#standardSQL
SELECT AVG(temp) avg_temp, _TABLE_SUFFIX y
FROM `bigquery-public-data.noaa.gsod_20*` #every year that starts with "20"
WHERE _TABLE_SUFFIX BETWEEN "10" AND "14" #only years between 2010 and 2014
GROUP BY y
ORDER BY y