I'd like to view all existing datasets in a project and properties for those datasets just like I can view when using __TABLES__. Is there an equivalent syntax for querying all datasets? The following doesn't work for me:
SELECT *
FROM TABLE_QUERY([gdelt-bq:__DATASETS__], 'true')
LIMIT 1000
or
SELECT *
FROM [gdelt-bq:__DATASETS__.__TABLES__]
LIMIT 1000
but the following will give me information on all tables in a given dataset. Is there a query that can be run to get a list of all datasets in a project?
SELECT *
FROM [gdelt-bq:extra.__TABLES__]
LIMIT 1000
From what I know - unfortunately, there is no __DATASETS__ equivalent of __TABLES__ - https://cloud.google.com/bigquery/querying-data#meta-tables
Still, you can use respective Datasets: list API or list datasets using respective bq ls
Related
is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.
When trying to generate a large array using the following command
GENERATE_ARRAY(1467331200, 1530403201, 15)
I'm getting the following error:
google.api_core.exceptions.BadRequest: 400 GENERATE_ARRAY(1467331200, 1530403201, 15) produced too many elements
Is there a way to generate an array of said size?
There is a limit on the number of result elements up to 1048575.
Test: bq query --dry_run --nouse_legacy_sq "[replace query below]"
Query: select GENERATE_ARRAY(1, 1048575) as test_array;
Output: Query successfully validated. Assuming the tables are not modified, running this query will process 0 bytes of data.
Query: select GENERATE_ARRAY(1, 1048576) as test_arr;
Output: GENERATE_ARRAY(1, 1048576, 1) produced too many elements
There's no mention of this limit in the documentation so I suggest that you either send a documentation feedback on the page or file a feature request to increase the limit or if possible remove the limit.
Possible workaround is to concat the array.
Example: SELECT ARRAY_CONCAT(GENERATE_ARRAY(1,1048575), GENERATE_ARRAY(1,1048575))...
So if you process data daily and put the results into the same dataset, such as results, and each day will have the same table name (first part) and with date as table_suffix, such as result1_20190101, result1_20190102 etc., they you query the result tables use wildcard table names and table_suffix.
So your dataset/tables looks like
results/result1_20190101
results/result1_20190102
results/result2_20190101
results/result2_20190102
So I can query all the result1
select * from `xxxx.results.result1*`
But I arrange the results tables differently. Due to I have dozens tables processed each day. so to easily check and manage each day results. I use date as dataset.
so my dataset/tables look like
20190101/result1
20190101/result2
...
20190102/result1
20190102/result2
...
And my daily data process usually will not query cross dates(datasets). the daily results are pushed to next step data pipelines etc.
But once a while, I need to do some quick check, and I need to query across the dates(in my case, across the datasets)
so when I try to query result1, I have to hard code the dataset name.
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
1) First question is, are there anyway I could use wildcards and suffix on datasets, like we could with tables?
2) Second question: how could I use the date function, such as DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) to get the date value and use the data value in the below query
select * from `xxxxxx.20190101/result1`
union all
select * from `xxxxxx.20190102/result1`
union all
...
to replace the hard coded value, 20190101, 20190102 etc.
There is no wildcards and/or suffix available on BigQuery datasets (at least as of today)
Meantime, you can check a feature request for INFORMATION_SCHEMA that is in Alpha now. You can apply for it by submitting form that is available there.
In short: you will be able to query list of datasets in the projects and then use it to construct your query. Please note - you still need to use some sort of client to script all this properly
I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.
One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.
If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2
You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.
Is it possible to combine the table wildcard functions as documented here?
I've taken a look through the Table Query functions SO answer, but doesn't quite seem to cover my use case.
I have table names in the format: s_CUSTOMER_ID_YYYYMMDD
I can find all the tables for a customer ID using:
SELECT *
FROM TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")')
And I can find all the tables for a date range via:
SELECT *
FROM (TABLE_DATE_RANGE([project:dataset],
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-03-01')))
But how do I query for both at the same time?
I tried using sub queries like this:
SELECT * FROM
(SELECT *
FROM TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")'))
,(SELECT *
FROM (TABLE_DATE_RANGE([project:dataset],
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-03-01'))))
...but the parser complains of Error: Can't parse table: project:dataset.
Adding a dot so they are project:dataset. brings an error Error: Error preparing subsidiary query: Dataset project:dataset. not found
Are my table names poorly done? What would be a better way of organising them if so?
Below quick "solution" - should work and you can improve it based on real/extra requirements you probably have
SELECT *
FROM
TABLE_QUERY([project:dataset],
'REGEXP_MATCH(table_id, r"^s_CUSTOMER_ID")
AND RIGHT(table_id, 8) BETWEEN "20160101" AND "20160301"')