BigQuery select __TABLES__ from all tables within project? - sql

Using BigQuery, is there a way I can select __TABLES__ from every dataset within my project? I've tried SELECT * FROM '*.__TABLES' but that is not allowed within BigQuery. Any help would be great, thanks!

You can use this SQL query to generate the list of dataset in your project:
select string_agg(
concat("select * from `[PROJECT ID].", schema_name, ".__TABLES__` ")
, "union all \n"
)
from `[PROJECT ID]`.INFORMATION_SCHEMA.SCHEMATA;
You will have this list:
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
...
Then put the list within this query:
SELECT
table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type
,TIMESTAMP_MILLIS(creation_time) AS creation_time
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time
,FORMAT_TIMESTAMP("%Y-%m", TIMESTAMP_MILLIS(last_modified_time)) as last_modified_month
,dataset_id
,project_id
FROM
(
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
)
ORDER BY dataset_id, table_id asc

__TABLES__ syntax is supported only for specific dataset and does not work across datasets
What you can do is something as below
#standardSQL
WITH ALL__TABLES__ AS (
SELECT * FROM `bigquery-public-data.1000_genomes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.baseball.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.bls.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.census_bureau_usa.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cloud_storage_geo_index.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cms_codes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.common_us.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.fec.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.genomics_cannabis.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_d.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_m.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.github_repos.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.hacker_news.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.irs_990.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.medicare.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.new_york.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.nlm_rxnorm.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.open_images.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.samples.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.san_francisco.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.stackoverflow.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.usa_names.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_us.__TABLES__`
)
SELECT *
FROM ALL__TABLES__
In this case you need to know in advance list of datasets, which you can easily do via Datasets: list API or using respective bq ls
Please note: above approach will work only for datasets with data in same location. If you have datasets with data in different locations you will need to query them in two different queries
For example:
#standardSQL
WITH ALL_EU__TABLES__ AS (
SELECT * FROM `bigquery-public-data.common_eu.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_eu.__TABLES__`
)
SELECT *
FROM ALL_EU__TABLES__

I know that you ask for using BigQuery, but I did a Python Script to get this information that you are asking for, maybe it is can help another coders:
Pip install:
!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client
Code:
import subprocess
import sys
import threading
from google.cloud import bigquery
def _worker_query(project, dataset_id, results_scan ):
query_str = 'SELECT * FROM `{}.{}.__TABLES__`'.format(project, dataset_id )
QUERY = (query_str)
query_job = client.query(QUERY)
rows = query_job.result()
count=0;
for row in rows:
count = count+1
results_scan.append({'dataset_id':dataset_id, 'count':count})
def main_execute():
project = 'bigquery-public-data'
dataset = client.list_datasets(project)
count = 0
threads_project = []
results_scan = []
for d in dataset:
t = threading.Thread(target=_worker_query, args=(project,d.dataset_id, results_scan))
threads_project.append(t)
t.start()
for t in threads_project:
t.join()
total_count = 0
for result in results_scan:
print(result)
total_count = total_count + result['count']
print('\n\nTOTAL TABLES: "{}"'.format(total_count))
JSON_FILE_NAME = 'sa_bq.json'
client = bigquery.Client.from_service_account_json(JSON_FILE_NAME)
main_execute()

You can extend Mikhail Berlyant's answer and automatically generate the SQL with a single query.
The INFORMATION_SCHEMA.SCHEMATA lists all the datasets. You can use a WHILE loop to generate all of the UNION ALL statements dynamically, like so:
DECLARE schemas ARRAY<string>;
DECLARE query string;
DECLARE i INT64 DEFAULT 0;
DECLARE arrSize INT64;
SET schemas = ARRAY(select schema_name from <your_project>.INFORMATION_SCHEMA.SCHEMATA);
SET query = "SELECT * FROM (";
SET arrSize = ARRAY_LENGTH(schemas);
WHILE i < arrSize - 1 DO
SET query = CONCAT(query, "SELECT '", schemas[OFFSET(i)], "', table_ID, row_count, size_bytes from <your project>.", schemas[OFFSET(i)], '.__TABLES__ UNION ALL ');
SET i = i + 1;
END WHILE;
SET query = CONCAT(query, "SELECT '", schemas[ORDINAL(arrSize)], "', table_ID, row_count, size_bytes from <your project>.", schemas[ORDINAL(arrSize)], '.__TABLES__` )');
EXECUTE IMMEDIATE query;

Building on #mikhail-berlyant 's nice solution above it's now possible to take advantage of BigQuery's scripting features to automate gathering the list of dataset and retreiving table metadata. Simply replace the *_name variables to generate a view of metadata for all of your tables in a given project.
DECLARE project_name STRING;
DECLARE dataset_name STRING;
DECLARE table_name STRING;
DECLARE view_name STRING;
DECLARE generate_metadata_query_for_all_datasets STRING;
DECLARE retrieve_table_metadata STRING;
DECLARE persist_table_metadata STRING;
DECLARE create_table_metadata_view STRING;
SET project_name = "your-project";
SET dataset_name = "your-dataset";
SET table_name = "your-table";
SET view_name = "your-view";
SET generate_metadata_query_for_all_datasets = CONCAT("SELECT STRING_AGG( CONCAT(\"select * from `",project_name,".\", schema_name, \".__TABLES__` \"), \"union all \\n\" ) AS datasets FROM `",project_name,"`.INFORMATION_SCHEMA.SCHEMATA");
SET
retrieve_table_metadata = generate_metadata_query_for_all_datasets;
SET create_table_metadata_view = CONCAT(
"""
CREATE VIEW IF NOT EXISTS
`""",project_name,".",dataset_name,".",view_name,"""`
AS
SELECT
project_id
,dataset_id
,table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS created_date
,TIMESTAMP_MILLIS(creation_time) AS created_at
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_at
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'native table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external table'
ELSE 'unknown'
END AS type
FROM `""",project_name,".",dataset_name,".",table_name,"""`
ORDER BY dataset_id, table_id asc""");
EXECUTE IMMEDIATE retrieve_table_metadata INTO persist_table_metadata;
EXECUTE IMMEDIATE CONCAT("CREATE OR REPLACE TABLE `",project_name,".",dataset_name,".",table_name,"` AS (",persist_table_metadata,")");
EXECUTE IMMEDIATE create_table_metadata_view;
After that you can query your new view.
SELECT * FROM `[PROJECT ID].[DATASET ID].[VIEW NAME]`

Maybe you can use INFORMATION_SCHEMA instead of TABLES:
SELECT * FROM region-us.INFORMATION_SCHEMA.TABLES;
Just replace region-us for the region where your datasets are.
If you have more than a region, you'll need to use UNION ALL.. but it's simpler than using UNION for all the datasets.
Or you can use a query to get all the unions all, like this:
With SelectTable AS (
SELECT 1 AS ID,'SELECT * FROM '|| table_schema ||'.__TABLES__ UNION ALL' AS SelectColumn FROM region-us.INFORMATION_SCHEMA.TABLES
GROUP BY table_schema
)
Select STRING_AGG(SelectColumn,'\n') FROM SelectTable
GROUP BY ID

slight changes referring to > #Dinh Tran answer:
#!/bin/bash
project_name="abc-project-name"
echo -e "project_id,dataset_id,table_id,row_count,size_mb,size_gb,type,partiton,partition_expiration_days,cluster_key" > /tmp/bq_out.csv
for dataset in $(bq ls|tail -n +3); do
bq query --format=csv --use_legacy_sql=false '
SELECT
t1.project_id as project_id,
t1.dataset_id as dataset_id ,
t1.table_id as table_id,
t1.row_count as row_count,
round(safe_divide(t1.size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(t1.size_bytes, (1000*1000*1000)),2) as size_gb,
case
when t1.type = 1 then "table"
when t1.type = 2 then "view"
when t1.type = 3 then "external"
else "?"
END AS type,
case
when t2.ddl like "%PARTITION BY%" then "Yes"
else "No"
end as partiton,
REGEXP_EXTRACT(t2.ddl, r".*partition_expiration_days=([0-9-].*)") as partition_expiration_days,
REGEXP_EXTRACT(t2.ddl, r"CLUSTER BY(.*)") as cluster_key,
FROM `'"${project_name}"'.'"${dataset}"'.__TABLES__` as t1,`'"${project_name}"'.'"${dataset}"'.INFORMATION_SCHEMA.TABLES` as t2
where t1.table_id=t2.table_name' | sed "1d" >> /tmp/bq_out.csv
done

Mikhail Berlyant's answer is very good. I would like to add there's a cleaner way to use in some cases.
So, if you have only one dataset, the tables are within the same dataset and they follow a pattern, you could query them using a wildcard table.
Let's say you want to query the noaa_gsod dataset (its tables have the following names gsod1929, gsod1930, ... 2018, 2019), then simply use
FROM
`bigquery-public-data.noaa_gsod.gsod*`
This is going to match all tables in the noaa_gsod dataset that begin with the string gsod.

Related

Multiple Linked Servers in one select statement with one where clause, possible?

Got a tricky one today (Might even just be me):
I have 8 Linked SQL 2012 servers configured to my main SQL server and I need to create table views so that I can filter all these combined table results only using one where clause, currently I use UNION because they all have the same table structures.
Currently my solution looks as follows:
SELECT * FROM [LinkedServer_1].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_2].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_3].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_4].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_5].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_6].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_7].[dbo].[Table] where value = 'xxx'
UNION
SELECT * FROM [LinkedServer_8].[dbo].[Table] where value = 'xxx'
As you can see this is becoming quite ugly because I have a select statement and where clause for each linked server and would like to know if there was a simpler way of doing this!
Appreciate the feedback.
Brakkie101
Instead of using views, you can use inline table-valued functions (a view with parameters). It will not save initial efforts for creating the queries, but could save some work in the future:
CREATE FUNCTION [dbo].[fn_LinkedSever] (#value NVARCHAR(128))
AS
RETURNS TABLE
AS
RETURN
(
SELECT * FROM [LinkedServer_1].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_2].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_3].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_4].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_5].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_6].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_7].[dbo].[Table] where value = #value
UNION
SELECT * FROM [LinkedServer_8].[dbo].[Table] where value = #value
);
Also, if possible, use UNION ALL instead of UNION.

SQL join for Loop?

I have beginner knowledge on SQL and I am wondering whether this is possible in SQL.
SQL query 1 >>
select distinct(id) as active_pod from schema_naming
Query 1 output >>
active_pod
DB_1
DB_2
...
DB_20
SQL query 2 >>
select * from DB_1.mapping UNION
select * from DB_2.mapping UNION
....
select * from DB_20.mapping UNION
Due to my limited knowledge on SQL, I'm currently running #1 query first and change DB_1, DB2,.. DB_20 in query 2 everytime and run #2.
However, I was wondering whether there's way to this in one query so I don't have to manually change DB number in the #2 query and don't have to union every line.
something like this..(but not sure what to do with union)
select * from {
select distinct id from schema_naming}.user_map
It will be great if someone can shed light on this. (I'm trying to do this on Oracle SQL)
thank you in advance.
Are you trying to get something like this?
SELECT 'SELECT * FROM ' || active_pod || '.' || 'Mapping UNION'
FROM
(
select distinct(id) as active_pod from schema_naming
) as DT;
Alternatively, use PL/SQL block:
BEGIN
For i in (SELECT 'SELECT * FROM ' || ACTIVE_POD || '.MAPPING UNION' AS QUERY
FROM SCHEMA_NAMING) loop
dbms_output.put_line(i.query);
end loop;
END
Your queries will appear in the output window on your IDE.
This is a definitely a hack but it might make your life easier until a better solution is proposed. Basically use a query to generate your 2nd query, only manual edit needed would to remove the unecessary UNION on the final line.
SELECT 'SELECT * FROM ' || ACTIVE_POD || '.MAPPING UNION' AS QUERY
FROM SCHEMA_NAMING
Results:
SELECT * FROM DB_1.MAPPING UNION
SELECT * FROM DB_2.MAPPING UNION
SELECT * FROM DB_3.MAPPING UNION
SELECT * FROM DB_4.MAPPING UNION
SELECT * FROM DB_5.MAPPING UNION
SELECT * FROM DB_6.MAPPING UNION
SELECT * FROM DB_7.MAPPING UNION
SELECT * FROM DB_8.MAPPING UNION
SELECT * FROM DB_9.MAPPING UNION
SELECT * FROM DB_10.MAPPING UNION
SELECT * FROM DB_11.MAPPING UNION
SELECT * FROM DB_12.MAPPING UNION
SELECT * FROM DB_13.MAPPING UNION
SELECT * FROM DB_14.MAPPING UNION
SELECT * FROM DB_15.MAPPING UNION
SELECT * FROM DB_16.MAPPING UNION
SELECT * FROM DB_17.MAPPING UNION
SELECT * FROM DB_18.MAPPING UNION
SELECT * FROM DB_19.MAPPING UNION
SELECT * FROM DB_20.MAPPING UNION

Input to determine how many datasets to pull from

Hopefully this will make sense:
Basically I want the user to input a variable at the start to determine how many datasets they want to compare:
DECLARE #NumberOfRuns INT
SET #NumberOfRuns = (1,2,3.. normally just 2)
After this I will have a query on multiple datasets:
SELECT*
FROM (Dataset 1)
UNION ALL
SELECT*
FROM (Dataset 2)
Is there a way that if the input for #NumberOfRuns = 1 it only runs the first part, and if the input was 2 it would run both?
Which rdbms? sql-server, mysql oracle? etc. For the non-dynamic route one other method is to use temp table
CREATE TABLE #TempTable (
--Columns And DataTypes of Select Statements
)
INSERT INTO #TempTable (columns)
SELECT *
FROM
Dataset 1
IF #NumberOfRuns >= 2
INSERT INTO #TempTable (columns)
SELECT *
FROM
Dataset 2
IF #NumberOfRuns >= 3
INSERT INTO #TempTable (columns)
SELECT *
FROM
Dataset 3
SELECT *
FROM
#TempTable
You can do an IF ... ELSE ... and hard code it into
IF #NumberOfRuns = 1
SELECT * FROM Dataset1
IF #NumberOfRuns = 2
SELECT * FROM Dataset1
UNION ALL
SELECT * FROM Dataset2
IF #NumberOfRuns = 3
SELECT * FROM Dataset1
UNION ALL
SELECT * FROM Dataset2
UNION ALL
SELECT * FROM Dataset3
etc...
In SQL Server you could just write your union ONCE and pull it into a #Temp and then filter on a column you set.
select
1 as ColumnID
....
INTO #Temp
....
UNION ALL
SELECT
2 as ColumnID
....
UNION ALL
SELECT
3 as ColumnID
....
SELECT * FROM #Temp WHERE ColumnID <= #NumberOfRuns

Using UNION with Sequel

I want to define a SQL-command like this:
SELECT * FROM WOMAN
UNION
SELECT * FROM MEN
I tried to define this with the following code sequence in Ruby + Sequel:
require 'sequel'
DB = Sequel::Database.new()
sel = DB[:women].union(DB[:men])
puts sel.sql
The result is (I made some pretty print on the result):
SELECT * FROM (
SELECT * FROM `women`
UNION
SELECT * FROM `men`
) AS 't1'
There is an additional (superfluous?) SELECT.
If I define multiple UNION like in this code sample
sel = DB[:women].union(DB[:men]).union(DB[:girls]).union(DB[:boys])
puts sel.sql
I get more superfluous SELECTs.
SELECT * FROM (
SELECT * FROM (
SELECT * FROM (
SELECT * FROM `women`
UNION
SELECT * FROM `men`
) AS 't1'
UNION
SELECT * FROM `girls`
) AS 't1'
UNION
SELECT * FROM `boys`
) AS 't1'
I detected no problem with it up to now, the results seem to be the same.
My questions:
Is there a reason for the additional selects (beside sequel internal procedures)
Can I avoid the selects?
Can I get problems with this additional selects? (Any Performance issue?)
The reason for the extra SELECTs is so code like DB[:girls].union(DB[:boys]).where(:some_column=>1) operates properly. You can use DB[:girls].union(DB[:boys], :from_self=>false) to not wrap it in the extra SELECTs, as mentioned in the documentation.

How can I treat a UNION query as a sub query

I have a set of tables that are logically one table split into pieces for performance reasons. I need to write a query that effectively joins all the tables together so I use a single where clause of the result. I have successfully used a UNION on the result of using the WHERE clause on each subtable explicitly as in the following
SELECT * FROM FRED_1 WHERE CHARLIE = 42
UNION
SELECT * FROM FRED_2 WHERE CHARLIE = 42
UNION
SELECT * FROM FRED_3 WHERE CHARLIE = 42
but as there are ten separate subtables updating the WHERE clause each time is a pain. What I want is something like this
SELECT *
FROM (
SELECT * FROM FRED_1
UNION
SELECT * FROM FRED_2
UNION
SELECT * FROM FRED_3)
WHERE CHARLIE = 42
If it makes a difference the query needs to run against a DB2 database.
Here is a more comprehensive (sanitised) version of what I need to do.
select *
from ( select * from FRD_1 union select * from FRD_2 union select * from FRD_3 ) as FRD,
( select * from REQ_1 union select * from REQ_2 union select * from REQ_3 ) as REQ,
( select * from RES_1 union select * from RES_2 union select * from RES_3 ) as RES
where FRD.KEY1 = 123456
and FRD.KEY1 = REQ.KEY1
and FRD.KEY1 = RES.KEY1
and REQ.KEY2 = RES.KEY2
NEW INFORMATION:
It looks like the problem has more to do with the number of fields in the union than anything else. If I greatly restrict the fields I can get most of the syntax variations below working. Unfortunately, restricting the fields so much means the resulting query, while potentially useful, is not giving me the result I wanted. I've managed to get an additional 3 fields from one of the tables in addition to the 2 keys. Any more than that and the query fails.
I believe you have to give a name to your subquery result. I don't know db2 so I'm taking a shot in the dark, but I know this works on several other platforms.
SELECT *
FROM (
SELECT * FROM FRED_1
UNION
SELECT * FROM FRED_2
UNION
SELECT * FROM FRED_3) AS T1
WHERE CHARLIE = 42
If the logical implementation is a single table but the physical implementation is multiple tables then how about creating a view that defines the logical model.
CREATE VIEW VW_FRED AS
SELECT * FROM FRED_1
UNION
SELECT * FROM FRED_2
UNION
SELECT * FROM FRED_3
then it's a simple matter of
SELECT * FROM VW_FRED WHERE CHARLIE = 42
Again, I'm not familiar with db2 syntax but this gives you the general idea.
with
FRD as ( select * from FRD_1 union select * from FRD_2 union select * from FRD_3 ),
REQ as ( select * from REQ_1 union select * from REQ_2 union select * from REQ_3 ),
RES as ( select * from RES_1 union select * from RES_2 union select * from RES_3 )
SELECT * from FRD, REQ, RES
WHERE FRD.KEY1 = 123456
and FRD.KEY1 = REQ.KEY1
and FRD.KEY1 = RES.KEY1
and REQ.KEY2 = RES.KEY2
I'm not familiar with DB2 syntax but why aren't you doing this as an INNER JOIN or LEFT JOIN?
SELECT *
FROM FRED_1
INNER JOIN FRED_2
ON FRED_1.Charlie = FRED_2.Charlie
INNER JOIN FRED_3
ON FRED_1.Charlie = FRED_3.Charlie
WHERE FRED_1.Charlie = 42
If the values don't exist in FRED_2 or FRED_3 then use a LEFT/OUTER JOIN. I'm assuming that FRED_1 is a master table, and if a record exists then it will be in this table.
maybe:
SELECT * FROM
(select * from FRD_1
union
select * from FRD_2
union
select * from FRD_3) FRD
INNER JOIN (select * from REQ_1 union select * from REQ_2 union select * from REQ_3) REQ
on FRD.KEY1 = REQ.KEY1
INNER JOIN (select * from RES_1 union select * from RES_2 union select * from RES_3) RES
on FRD.KEY1 = RES.KEY1
WHERE FRD.KEY1 = 123456 and REQ.KEY2 = RES.KEY2