Use wildcard query on dataset with models in BigQuery - google-bigquery

I have a series of tables that are named {YYYYMM}_{id} and I have ML models that are named {groupid}_cost_model. I'm attempting to collate some data across all the tables using the following query:
SELECT * FROM `mydataset.20*`
The problem I'm having is that I have a model named 200_cost_model and it causes the following error:
Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table myproject:mydataset.200_cost_model, first column table myproject:mydataset.202001_4544248676.
Is there a way to filter out the models from wildcard queries or am I stuck joining all the tables together?

When using Wildcard tables you can use psuedo column to filter results:
Queries with wildcard tables support the _TABLE_SUFFIX pseudo column
in the WHERE clause. This column contains the values matched by the
wildcard character, so that queries can filter which tables are
accessed. For example, the following WHERE clauses use comparison
operators to filter the matched tables
I have tested on my side, although only on standard freshly created tables, that it should work for example like that:
SELECT *
FROM
`mydataset.20*`
WHERE
_TABLE_SUFFIX like '%cost_model' ;
As well to check all possible _TABLE_SUFFIX choices it work to me like this:
select DISTINCT _TABLE_SUFFIX as suffix from `mydataset.20*`
but I am not sure, if this will work in your situation.

Related

Can SELECT * FROM multiple tables with same _TABLE_SUFFIX pattern

I am trying to select * rows from 3 tables that match a TABLE_SUFFIX pattern, the thing is I didn't recieve the expected output.
The query I am using:
SELECT
*
FROM
`project-id.airbyte_google_ads.client_id_*`
WHERE
REGEXP_CONTAINS(_TABLE_SUFFIX, r"_campaign_performance_overview$")
The ouput recieved contains columns of other tables, and not from the ones that I want, but if I am using:
SELECT
DISTINCT _TABLE_SUFFIX as tables
FROM
`project-id.airbyte_google_ads.client_id_*`
WHERE
REGEXP_CONTAINS(_TABLE_SUFFIX, r"_campaign_performance_overview$")
The tables names from which I want to select rows, are correct.
My tought is that something is wrong at wildcard line, and i tought if there can be a way to use it somehow like:
`project-id.airbyte_google_ads.client_id_*_campaign`
or something similar, because looks like the query does something at FROM statement, and does whats in WHERE at a different point.
Let me know what are your toughts on that.
Thank you for your time!
As per this documentation, when using wildcard tables, all the tables in the dataset that begin with the table name before * are scanned even if _TABLE_SUFFIX is used in combination with REGEXP_CONTAINS. In our case, the wildcard pattern is client_id_* and hence, the values such as client_id_1_campaigns are also matched irrespective of the pattern in REGEXP_CONTAINS.
The reason for this behaviour is that, the wildcard pattern precedes the regex and scans all the tables matching the wildcard pattern and will not take the regex into account. Using wildcards while also using REGEXP_CONTAINS is applying regex on top of regex and is not recommended.
If you wish to have the intended target tables you will need to use the below query instead of using wildcards to query multiple tables.
SELECT *
FROM (
SELECT * FROM `project-id.dataset-id.client_id_2_campaign_performance_overview` UNION ALL
SELECT * FROM `project-id.dataset-id.client_id_7_campaign_performance_overview` UNION ALL
SELECT * FROM `project-id.dataset-id.client_id_10_campaign_performance_overview`);
Using the LIKE operator also does not give the expected results for the same reason mentioned above. The tables are scanned first then filtered giving extra columns in the result.
Also, BigQuery uses the schema for the most recently created table that matches the wildcard as the schema for the wildcard table. Even if you restrict the number of tables that you want to use from the wildcard table using the _TABLE_SUFFIX pseudo column in a WHERE clause, BigQuery uses the schema for the most recently created table that matches the wildcard. You will see the extra columns in the result if the most recently created table has them.

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

BigQuery standard SQL syntax: _TABLE_SUFFIX and .yesterday tables

My goal is to query across multiple tables of a dataset using BigQuery standard SQL syntax.
I can successfully make it work when all tables of a dataset follow the same number pattern. However, for datasets that contain additional tables like .yesterday, I get an error: Views cannot be queried through prefix. Matched views are: githubarchive:day.yesterday
Here is the query I used:
SELECT
COUNT(*)
FROM
`githubarchive.day.*`
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '20170101' AND '20170215'
Try using more of a prefix. For example,
SELECT
COUNT(*)
FROM
`githubarchive.day.2017*`
WHERE
type = "WatchEvent"
AND _TABLE_SUFFIX BETWEEN '0101' AND '0215';

BigQuery flattens result when selecting into table with GROUP BY even with "noflatten_results" flag on

I have a table with duplicate records. I want to remove them. I've created a column called "hash_code" which is just a sha1 hash of all the columns. Duplicate rows will have the same hash code. Everything is fine except when I tried to create a new table with a query containing GROUP BY. My table has RECORD data type, but the new table created flattens it even when I had specified it to not flatten. Seems like GROUP BY and the "-noflatten_results" flag doesn't place nice.
Here's an example command line I ran:
bq query --allow_large_results --destination_table mydataset.my_events --noflatten_results --replace
"select hash_code, min(event) as event, min(properties.adgroup_name) as properties.adgroup_name,
min(properties.adid) as properties.adid, min(properties.app_id) as properties.app_id,
min(properties.campaign_name) as properties.campaign_name from mydataset.my_orig_events group each
by hash_code "
In the above example, properties is a RECORD data type with nested fields. The resulting table doesn't have properties as RECORD data type. Instead it translated properties.adgroup_name to properties_adgroup_name, etc.
Any way to force BigQuery to treat the result set as RECORD and not flatten in GROUP BY?
Thanks!
There are a few known cases where query results can be flattened despite requesting unflattened results.
Queries containing a GROUP BY clause
Queries containing an ORDER BY clause
Selecting a nested field with a flat alias (e.g. SELECT record.record.field AS flat_field). Note that this only flattens the specific field with the alias applied, and only flattens the field if it and its parent records are non-repeated.
The BigQuery query engine always flattens query results in these cases. As far as I know, there is no workaround for this behavior, other than removing these clauses or aliases from the query.

What is the difference between Select and Project Operations

I'm referring to the basic relational algebra operators here.
As I see it, everything that can be done with project can be done with select.
I don't know if there is a difference or a certain nuance that I've missed.
PROJECT eliminates columns while SELECT eliminates rows.
Select Operation : This operation is used to select rows from a table (relation) that specifies a given logic, which is called as a predicate. The predicate is a user defined condition to select rows of user's choice.
Project Operation : If the user is interested in selecting the values of a few attributes, rather than selection all attributes of the Table (Relation), then one should go for PROJECT Operation.
See more : Relational Algebra and its operations
In Relational algebra 'Selection' and 'Projection' are different operations, but the SQL SELECT combines these operations in a single statement.
Select retrieves the tuples (rows) in a relation (table) for which the condition in 'predicate' section (WHERE clause) stands true.
Project retrieves the attributes (columns) specified.
The following SQL SELECT query:
select field1,field2 from table1 where field1 = 'Value';
is a combination of both Projection and Selection operations of relational algebra.
Project is not a statement. It is the capability of the select statement.
Select statement has three capabilities. They are selection,projection,join. Selection-it retrieves the rows that are satisfied by the given query.
Projection-it chooses the columns that are satisfied by the given query.
Join-it joins the two or more tables
selection opertion is used to select a subset of tuple from the relation that satisfied selection condition It filter out those tuple that satisfied the condition .Selection opertion can be visualized as horizontal partition into two set of tuple - those tuple satisfied the condition are selected and those tuple do not select the condition are discarded
sigma (R)
projection opertion is used to select a attribute from the relation that satisfied selection condition . It filter out only those tuple that satisfied the condition . The projection opertion can be visualized as a vertically partition into two part -are those satisfied the condition are selected other discarded
Π(R)
attribute list is a num of attribute
Project will effects Columns in the table while Select effects the Rows. on other hand Project is use to select the columns with specefic properties rather than Select the all of columns data
Select extract rows from the relation with some condition and Project extract particular number of attribute/column from the relation with or without some condition.
The difference between the project operator (π) in relational algebra and the SELECT keyword in SQL is that if the resulting table/set has more than one occurrences of the same tuple, then π will return only one of them, while SQL SELECT will return all.
select just changes cardinality of the result table but project does change both degree of relation and cardinality.
The difference come in relational algebra where project affects columns and select affect rows. However in query syntax, select is the word. There is no such query as project.
Assuming there is a table named users with hundreds of thousands of records (rows) and the table has 6 fields (userID, Fname,Lname,age,pword,salary). Lets say we want to restrict access to sensitive data (userID,pword and salary) and also restrict amount of data to be accessed. In mysql maria DB we create a view as follows ( Create view user1 as select Fname,Lname, age from users limit 100;) from our view we issue (select Fname from users1;) . This query is both a select and a project