Using BigQuery, is there a way I can select __TABLES__ from every dataset within my project? I've tried SELECT * FROM '*.__TABLES' but that is not allowed within BigQuery. Any help would be great, thanks!
You can use this SQL query to generate the list of dataset in your project:
select string_agg(
concat("select * from `[PROJECT ID].", schema_name, ".__TABLES__` ")
, "union all \n"
)
from `[PROJECT ID]`.INFORMATION_SCHEMA.SCHEMATA;
You will have this list:
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
...
Then put the list within this query:
SELECT
table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type
,TIMESTAMP_MILLIS(creation_time) AS creation_time
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time
,FORMAT_TIMESTAMP("%Y-%m", TIMESTAMP_MILLIS(last_modified_time)) as last_modified_month
,dataset_id
,project_id
FROM
(
select * from `[PROJECT ID].[DATASET ID 1].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 2].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 3].__TABLES__` union all
select * from `[PROJECT ID].[DATASET ID 4].__TABLES__`
)
ORDER BY dataset_id, table_id asc
__TABLES__ syntax is supported only for specific dataset and does not work across datasets
What you can do is something as below
#standardSQL
WITH ALL__TABLES__ AS (
SELECT * FROM `bigquery-public-data.1000_genomes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.baseball.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.bls.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.census_bureau_usa.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cloud_storage_geo_index.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.cms_codes.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.common_us.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.fec.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.genomics_cannabis.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_d.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.ghcn_m.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.github_repos.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.hacker_news.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.irs_990.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.medicare.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.new_york.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.nlm_rxnorm.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.open_images.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.samples.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.san_francisco.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.stackoverflow.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.usa_names.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_us.__TABLES__`
)
SELECT *
FROM ALL__TABLES__
In this case you need to know in advance list of datasets, which you can easily do via Datasets: list API or using respective bq ls
Please note: above approach will work only for datasets with data in same location. If you have datasets with data in different locations you will need to query them in two different queries
For example:
#standardSQL
WITH ALL_EU__TABLES__ AS (
SELECT * FROM `bigquery-public-data.common_eu.__TABLES__` UNION ALL
SELECT * FROM `bigquery-public-data.utility_eu.__TABLES__`
)
SELECT *
FROM ALL_EU__TABLES__
I know that you ask for using BigQuery, but I did a Python Script to get this information that you are asking for, maybe it is can help another coders:
Pip install:
!pip install google-cloud
!pip install google-api-python-client
!pip install oauth2client
Code:
import subprocess
import sys
import threading
from google.cloud import bigquery
def _worker_query(project, dataset_id, results_scan ):
query_str = 'SELECT * FROM `{}.{}.__TABLES__`'.format(project, dataset_id )
QUERY = (query_str)
query_job = client.query(QUERY)
rows = query_job.result()
count=0;
for row in rows:
count = count+1
results_scan.append({'dataset_id':dataset_id, 'count':count})
def main_execute():
project = 'bigquery-public-data'
dataset = client.list_datasets(project)
count = 0
threads_project = []
results_scan = []
for d in dataset:
t = threading.Thread(target=_worker_query, args=(project,d.dataset_id, results_scan))
threads_project.append(t)
t.start()
for t in threads_project:
t.join()
total_count = 0
for result in results_scan:
print(result)
total_count = total_count + result['count']
print('\n\nTOTAL TABLES: "{}"'.format(total_count))
JSON_FILE_NAME = 'sa_bq.json'
client = bigquery.Client.from_service_account_json(JSON_FILE_NAME)
main_execute()
You can extend Mikhail Berlyant's answer and automatically generate the SQL with a single query.
The INFORMATION_SCHEMA.SCHEMATA lists all the datasets. You can use a WHILE loop to generate all of the UNION ALL statements dynamically, like so:
DECLARE schemas ARRAY<string>;
DECLARE query string;
DECLARE i INT64 DEFAULT 0;
DECLARE arrSize INT64;
SET schemas = ARRAY(select schema_name from <your_project>.INFORMATION_SCHEMA.SCHEMATA);
SET query = "SELECT * FROM (";
SET arrSize = ARRAY_LENGTH(schemas);
WHILE i < arrSize - 1 DO
SET query = CONCAT(query, "SELECT '", schemas[OFFSET(i)], "', table_ID, row_count, size_bytes from <your project>.", schemas[OFFSET(i)], '.__TABLES__ UNION ALL ');
SET i = i + 1;
END WHILE;
SET query = CONCAT(query, "SELECT '", schemas[ORDINAL(arrSize)], "', table_ID, row_count, size_bytes from <your project>.", schemas[ORDINAL(arrSize)], '.__TABLES__` )');
EXECUTE IMMEDIATE query;
Building on #mikhail-berlyant 's nice solution above it's now possible to take advantage of BigQuery's scripting features to automate gathering the list of dataset and retreiving table metadata. Simply replace the *_name variables to generate a view of metadata for all of your tables in a given project.
DECLARE project_name STRING;
DECLARE dataset_name STRING;
DECLARE table_name STRING;
DECLARE view_name STRING;
DECLARE generate_metadata_query_for_all_datasets STRING;
DECLARE retrieve_table_metadata STRING;
DECLARE persist_table_metadata STRING;
DECLARE create_table_metadata_view STRING;
SET project_name = "your-project";
SET dataset_name = "your-dataset";
SET table_name = "your-table";
SET view_name = "your-view";
SET generate_metadata_query_for_all_datasets = CONCAT("SELECT STRING_AGG( CONCAT(\"select * from `",project_name,".\", schema_name, \".__TABLES__` \"), \"union all \\n\" ) AS datasets FROM `",project_name,"`.INFORMATION_SCHEMA.SCHEMATA");
SET
retrieve_table_metadata = generate_metadata_query_for_all_datasets;
SET create_table_metadata_view = CONCAT(
"""
CREATE VIEW IF NOT EXISTS
`""",project_name,".",dataset_name,".",view_name,"""`
AS
SELECT
project_id
,dataset_id
,table_id
,DATE(TIMESTAMP_MILLIS(creation_time)) AS created_date
,TIMESTAMP_MILLIS(creation_time) AS created_at
,DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date
,TIMESTAMP_MILLIS(last_modified_time) AS last_modified_at
,row_count
,size_bytes
,round(safe_divide(size_bytes, (1000*1000)),1) as size_mb
,round(safe_divide(size_bytes, (1000*1000*1000)),2) as size_gb
,CASE
WHEN type = 1 THEN 'native table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external table'
ELSE 'unknown'
END AS type
FROM `""",project_name,".",dataset_name,".",table_name,"""`
ORDER BY dataset_id, table_id asc""");
EXECUTE IMMEDIATE retrieve_table_metadata INTO persist_table_metadata;
EXECUTE IMMEDIATE CONCAT("CREATE OR REPLACE TABLE `",project_name,".",dataset_name,".",table_name,"` AS (",persist_table_metadata,")");
EXECUTE IMMEDIATE create_table_metadata_view;
After that you can query your new view.
SELECT * FROM `[PROJECT ID].[DATASET ID].[VIEW NAME]`
Maybe you can use INFORMATION_SCHEMA instead of TABLES:
SELECT * FROM region-us.INFORMATION_SCHEMA.TABLES;
Just replace region-us for the region where your datasets are.
If you have more than a region, you'll need to use UNION ALL.. but it's simpler than using UNION for all the datasets.
Or you can use a query to get all the unions all, like this:
With SelectTable AS (
SELECT 1 AS ID,'SELECT * FROM '|| table_schema ||'.__TABLES__ UNION ALL' AS SelectColumn FROM region-us.INFORMATION_SCHEMA.TABLES
GROUP BY table_schema
)
Select STRING_AGG(SelectColumn,'\n') FROM SelectTable
GROUP BY ID
slight changes referring to > #Dinh Tran answer:
#!/bin/bash
project_name="abc-project-name"
echo -e "project_id,dataset_id,table_id,row_count,size_mb,size_gb,type,partiton,partition_expiration_days,cluster_key" > /tmp/bq_out.csv
for dataset in $(bq ls|tail -n +3); do
bq query --format=csv --use_legacy_sql=false '
SELECT
t1.project_id as project_id,
t1.dataset_id as dataset_id ,
t1.table_id as table_id,
t1.row_count as row_count,
round(safe_divide(t1.size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(t1.size_bytes, (1000*1000*1000)),2) as size_gb,
case
when t1.type = 1 then "table"
when t1.type = 2 then "view"
when t1.type = 3 then "external"
else "?"
END AS type,
case
when t2.ddl like "%PARTITION BY%" then "Yes"
else "No"
end as partiton,
REGEXP_EXTRACT(t2.ddl, r".*partition_expiration_days=([0-9-].*)") as partition_expiration_days,
REGEXP_EXTRACT(t2.ddl, r"CLUSTER BY(.*)") as cluster_key,
FROM `'"${project_name}"'.'"${dataset}"'.__TABLES__` as t1,`'"${project_name}"'.'"${dataset}"'.INFORMATION_SCHEMA.TABLES` as t2
where t1.table_id=t2.table_name' | sed "1d" >> /tmp/bq_out.csv
done
Mikhail Berlyant's answer is very good. I would like to add there's a cleaner way to use in some cases.
So, if you have only one dataset, the tables are within the same dataset and they follow a pattern, you could query them using a wildcard table.
Let's say you want to query the noaa_gsod dataset (its tables have the following names gsod1929, gsod1930, ... 2018, 2019), then simply use
FROM
`bigquery-public-data.noaa_gsod.gsod*`
This is going to match all tables in the noaa_gsod dataset that begin with the string gsod.
I have this code below, but I keep getting an error that says A_NUM was specified multiple times for A.
SELECT Alleg,
CASE
WHEN ALLEGCU LIKE '%F%'
OR ALLEGCU LIKE'%A%' THEN 'EF'
ELSE ALLEGCU
END,
ALLEGCU
FROM
(SELECT *, CASE
WHEN ISNUMERIC(SUBSTRING(LTRIM(RTRIM(ALL)), 1,1)) = 0
THEN UPPER(LTRIM(RTRIM(SUBSTRING(ALL, CHARINDEX(' ', ALL)+1, LEN(ALL)
- CHARINDEX(' ', ALL)))))
ELSE ALL
END ALLEGCU
FROM ASSI_O
LEFT OUTER JOIN ALLS
ON ASSI_O.A_NUM = ALLS.A_NUM ) A
A_NUM column exists in both the tables ASSI_O AND ALLS and you are doing a SELECT * while doing the join.
Select only the needed columns and also prefix the table name to the column name
Changing SELECT * to SELECT ASSI_O.A_NUM as ASSI_NUM, ALLS.A_NUM as ALLS_ANUM should solve the problem
SELECT ASSI_O.A_NUM as ASSI_NUM, ALLS.A_NUM as ALLS_ANUM ...
I have the following select query, which runs fine as expected (its purpose is to build a list of words from strings in another table):
SELECT UPPER(LTRIM(RTRIM(allWords.words)))
FROM
(
SELECT section.Cols.value('.', 'varchar(250)') words
FROM #xml.nodes('/c') section(Cols)
) AS allWords
WHERE
LTRIM(RTRIM(words)) <> ''
AND dbo.RegExIsMatch('.*[\W\d].*',LTRIM(RTRIM(words)),1) <> 1
AND LEN(LTRIM(RTRIM(words))) > 3
GROUP BY words
But when I add the INTO declaration it fails:
IF OBJECT_ID('usr.nameList') IS NOT NULL DROP TABLE usr.nameList;
SELECT UPPER(LTRIM(RTRIM(allWords.words)))
INTO usr.nameList
FROM
(
SELECT section.Cols.value('.', 'varchar(250)') words
FROM #xml.nodes('/c') section(Cols)
) AS allWords
WHERE
LTRIM(RTRIM(words)) <> ''
AND dbo.RegExIsMatch('.*[\W\d].*',LTRIM(RTRIM(words)),1) <> 1
AND LEN(LTRIM(RTRIM(words))) > 3
GROUP BY words
Output is:
An object or column name is missing or empty. For SELECT INTO
statements, verify each column has a name. For other statements, look
for empty alias names. Aliases defined as "" or [] are not allowed.
Change the alias to a valid name.
If the SELECT query works, why would such a simple modification to insert the data into a new table fail?
What are you missing in the error message? You need to give the column a name:
SELECT UPPER(LTRIM(RTRIM(allWords.words))) as words
INTO usr.nameList
FROM
(
SELECT section.Cols.value('.', 'varchar(250)') words
FROM #xml.nodes('/c') section(Cols)
) AS allWords
WHERE
LTRIM(RTRIM(words)) <> ''
AND dbo.RegExIsMatch('.*[\W\d].*',LTRIM(RTRIM(words)),1) <> 1
AND LEN(LTRIM(RTRIM(words))) > 3
GROUP BY words;
I have a sql script which is nothing but a combination of multiple "Select" queries like:
Select * from ABC
Select * from CD
Select * from EN
Now when I execute it, I use to get output like
<output 1>
<output 2>
<output 3>
Requirement: I need some title to be displayed for each of the output.
To be more clear,I want output like:
Heading for Output of SQL query 1
output 1
Heading for Output of SQL query 2
output 2
Heading for Output of SQL query 3
output 3
Database is SQL Server 2008 R2
There're a lot of ways for achieving this. What exactly do you need this for?
1.
SELECT 'ABC' As title
Select * from ABC
SELECT 'CD' As title
Select * from CD
SELECT 'ABC' As title
Select * from EN
2.
Select 'ABC' As title, * from ABC
Select 'CD' As title, * from CD
Select 'EN' As title, * from EN
3.
Works for SQL Server. Not sure about other db's
PRINT 'ABC'
Select * from ABC
PRINT 'CD'
Select * from CD
PRINT 'ABC'
Select * from EN
Probably it has been asked before but I cannot find an answer.
Table Data has two columns:
Source Dest
1 2
1 2
2 1
3 1
I trying to come up with a MS Access 2003 SQL query that will return:
1 2
3 1
But all to no avail. Please help!
UPDATE: exactly, I'm trying to exclude 2,1 because 1,2 already included. I need only unique combinations where sequence doesn't matter.
For Ms Access you can try
SELECT DISTINCT
*
FROM Table1 tM
WHERE NOT EXISTS(SELECT 1 FROM Table1 t WHERE tM.Source = t.Dest AND tM.Dest = t.Source AND tm.Source > t.Source)
EDIT:
Example with table Data, which is the same...
SELECT DISTINCT
*
FROM Data tM
WHERE NOT EXISTS(SELECT 1 FROM Data t WHERE tM.Source = t.Dest AND tM.Dest = t.Source AND tm.Source > t.Source)
or (Nice and Access Formatted...)
SELECT DISTINCT *
FROM Data AS tM
WHERE (((Exists (SELECT 1 FROM Data t WHERE tM.Source = t.Dest AND tM.Dest = t.Source AND tm.Source > t.Source))=False));
your question is asked incorrectly. "unique combinations" are all of your records. but i think you mean one line per each Source. so it is:
SELECT *
FROM tab t1
WHERE t1.Dest IN
(
SELECT TOP 1 DISTINCT t2.Dest
FROM tab t2
WHERE t1.Source = t2.Source
)
SELECT t1.* FROM
(SELECT
LEAST(Source, Dest) AS min_val,
GREATEST(Source, Dest) AS max_val
FROM table_name) AS t1
GROUP BY t1.min_val, t1.max_val
Will return
1, 2
1, 3
in MySQL.
To eliminate duplicates, "select distinct" is easier than "group by":
select distinct source,dest from data;
EDIT: I see now that you're trying to get unique combinations (don't include both 1,2 and 2,1). You can do that like:
select distinct source,dest from data
minus
select dest,source from data where source < dest
The "minus" flips the order around and eliminates cases where you already have a match; the "where source < dest" keeps you from removing both (1,2) and (2,1)
Use this query :
SELECT distinct * from tabval ;