BigQuery Wildcard tables with Regex and date range - sql

Is it possible to combine the table wildcard functions as documented here?
I've taken a look through the Table Query functions SO answer, but doesn't quite seem to cover my use case.
I have table names in the format: s_CUSTOMER_ID_YYYYMMDD
I can find all the tables for a customer ID using:
SELECT *
FROM TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")')
And I can find all the tables for a date range via:
SELECT *
FROM (TABLE_DATE_RANGE([project:dataset],
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-03-01')))
But how do I query for both at the same time?
I tried using sub queries like this:
SELECT * FROM
(SELECT *
FROM TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")'))
,(SELECT *
FROM (TABLE_DATE_RANGE([project:dataset],
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-03-01'))))
...but the parser complains of Error: Can't parse table: project:dataset.
Adding a dot so they are project:dataset. brings an error Error: Error preparing subsidiary query: Dataset project:dataset. not found
Are my table names poorly done? What would be a better way of organising them if so?

Below quick "solution" - should work and you can improve it based on real/extra requirements you probably have
SELECT *
FROM
TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")
AND RIGHT(table_id, 8) BETWEEN "20160101" AND "20160301"')

Related

Query Snowflake Jobs [duplicate]

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

How do you query an array in Standard SQL that meets a certain conditional?

I am trying to pull records whose arrays only meet a certain condition.
For example, I want only the results that contain "IAB3".
Here is what the table looks like
Table Name:
bids
Column Names:
BidderBanner / WinCat
Entries:
1600402 / null
1911048 / null
1893069 / [IAB3-11, IAB3]
1214894 / IAB3
How I initially thought it would be
SELECT * FROM bids WHERE WinCat = "IAB3"
but I get an error that says no match for operator types array, string.
The database is in Google Big Query.
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM `project.dataset.bids` WHERE 'IAB3' IN UNNEST(WinCat)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.bids` AS (
SELECT 1600402 BidderBanner, NULL WinCat UNION ALL
SELECT 1911048, NULL UNION ALL
SELECT 1893069, ['IAB3-11', 'IAB3'] UNION ALL
SELECT 1214894, ['IAB3']
)
SELECT * FROM `project.dataset.bids` WHERE 'IAB3' IN UNNEST(WinCat)
with result
you need to use single quotes in sql for all strings. it should be WHERE WinCat = 'IAB3' not WHERE WinCat = "IAB3"
One method uses unnest(), something like this:
SELECT b.*
FROM bids b
WHERE 'IAB3' IN (SELECT unnest(b.WinCats))
However, array syntax varies among the databases that support them and they are no part of "standard SQL".
this will work:
SELECT * FROM bids WHERE REGEXP_LIKE (WinCat, '(.)*(IAB3)+()*');

BigQuery can use wildcard table names and table_suffix, but I am looking for a similar solution like wildcard datasets and dataset_suffix

So if you process data daily and put the results into the same dataset, such as results, and each day will have the same table name (first part) and with date as table_suffix, such as result1_20190101, result1_20190102 etc., they you query the result tables use wildcard table names and table_suffix.
So your dataset/tables looks like
results/result1_20190101
results/result1_20190102
results/result2_20190101
results/result2_20190102
So I can query all the result1
select * from `xxxx.results.result1*`
But I arrange the results tables differently. Due to I have dozens tables processed each day. so to easily check and manage each day results. I use date as dataset.
so my dataset/tables look like
20190101/result1
20190101/result2
...
20190102/result1
20190102/result2
...
And my daily data process usually will not query cross dates(datasets). the daily results are pushed to next step data pipelines etc.
But once a while, I need to do some quick check, and I need to query across the dates(in my case, across the datasets)
so when I try to query result1, I have to hard code the dataset name.
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
1) First question is, are there anyway I could use wildcards and suffix on datasets, like we could with tables?
2) Second question: how could I use the date function, such as DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) to get the date value and use the data value in the below query
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
to replace the hard coded value, 20190101, 20190102 etc.
There is no wildcards and/or suffix available on BigQuery datasets (at least as of today)
Meantime, you can check a feature request for INFORMATION_SCHEMA that is in Alpha now. You can apply for it by submitting form that is available there.
In short: you will be able to query list of datasets in the projects and then use it to construct your query. Please note - you still need to use some sort of client to script all this properly

Group By Using Wildcards in Big Query

I have this query:
SELECT SomeTableA.*
FROM SomeTableB
LEFT JOIN SomeTableA USING (XYZ)
GROUP BY SomeTableA.*
I know that I cannot do the GROUP BY part with wildcards. At the same time, I don't really like listing all the columns (can be up to 20) manually.
Could this be added as new feature? Or is there any way how to easily get the list of all 20 columns from SomeTableA for the GROUP BY part?
If you really have the exact query shown in your question - then try below instead - no grouping required
#standardSQL
SELECT DISTINCT *
FROM `project.dataset.tableA`
WHERE xyz IN (SELECT xyz FROM `project.dataset.tableB`)
As of Group By Using Wildcards in Big Query this sounds more like grouping by struct which is not supported so you can submit feature request if you want - https://issuetracker.google.com/issues/new?component=187149&template=0

Google BigQuery Trying to run a TABLE_RANGE_DATE

i am building a partition based table in a dataset and i am trying to query those partitions using a date range.
Here is an example of the data:
Dataset:
logs
Tables:
logs_20170501
logs_20170502
logs_20170503
i am trying first the TABLE_RANGE_DATE
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
i am keep getting : "ERROR:Error evaluating subsidiary query"
i tried those options as well:
single comma:
SELECT count(*) FROM TABLE_DATE_RANGE([logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
Add Project ID:
SELECT count(*) FROM TABLE_DATE_RANGE([main_sys_logs:logs.logs_],
TIMESTAMP('2017-05-01'),
TIMESTAMP('2017-05-03')) as logs_count
And it didn't worked.
So i tried to use TABLE_SUFFIX
SELECT
count(*)
FROM [main_sys_logs:logs.logs_*]
WHERE _TABLE_SUFFIX BETWEEN '20170501' AND '20170503'
And i got this error :
Invalid table name:'main_sys_logs:logs.logs_*
i have been switching SQL Dialect between legacy SQL ON/Off and i just got different errors on the table name part.
Is there any tips or help for this matter ?
maybe my table name is build wrong with the "_" at the end and this is causing the problem ? thanks for any help.
So i tried this Query and it worked :
SELECT count(*) FROM TABLE_DATE_RANGE(logs.logs_,
TIMESTAMP("2017-05-01"),
TIMESTAMP("2017-05-03")) as logs_count
it started to work after i run this query , i don't know if this is the reason .. but i just query the TABLES data for the dataset
SELECT *
FROM logs__TABLES__