Bigquery: get tables' sizes from all datasets - google-bigquery

I have a simple query that returns the tabels' sizes for each table in the dataset orders:
SELECT
table_id,
TRUNC(size_bytes/1024/1024/1024/1024,2) size_tb,
FROM orders.__TABLES__
If I wish to run this query once for the whole project and all its tables, how can I do it?
I tried to change the last row to From __TABLES__ but that is an error.

I use this Python script for something similar (probably originate in Stackoverflow) with my adjustments
from google.cloud import bigquery
client = bigquery.Client()
datasets = list(client.list_datasets())
project = client.project
sizes = []
if datasets:
print('Datasets in project {}:'.format(project))
for dataset in datasets: # API request(s)
print('Dataset: {}'.format(dataset.dataset_id))
query_job = client.query("select table_id, sum(size_bytes)/pow(10,9) as size from `"+dataset.dataset_id+"`.__TABLES__ group by 1")
results = query_job.result()
for row in results:
print("\tTable: {} : {}".format(row.table_id, row.size))
item = {
'project': project,
'dataset': dataset.dataset_id,
'table': row.table_id,
'size': row.size
}
sizes.append(item)
else:
print('{} project does not contain any datasets.'.format(project))

You could use INFORMATION_SCHEMA data to query
select
project_id,
TABLE_SCHEMA,
TABLE_NAME,
sum(TOTAL_PHYSICAL_BYTES) / pow(10,9) as size
from
project.region.INFORMATION_SCHEMA.TABLE_STORAGE
group by 1,2, 3
order by size DESC
Where project is your project name and region is region where data is located (e.g. region-us). Refer to https://cloud.google.com/bigquery/docs/information-schema-table-storage for more info

Ok. Lets consider doing it in some steps:
Step 1 - List a single project and own datasets:
SELECT
string_agg(concat("SELECT * FROM `$_PROJECT_ID.", schema_name, ".__TABLES__` ")," UNION ALL \n")
FROM
`$_PROJECT_ID`.INFORMATION_SCHEMA.SCHEMATA;
OR IF ISNT FOR A SINGLE PROJECT:
Step 1.1 - List All projects consider then are been used in queries stories in last 6m (180 days):
WITH LISTA_PROJETOS AS (
SELECT DISTINCT R.PROJECT_ID
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION J, UNNEST(REFERENCED_TABLES) R
ORDER BY 1 ASC
), RESULTADOS AS (
SELECT 'SELECT \n\t' ||AGG_RESULTADOS FROM (
SELECT STRING_AGG('(SELECT STRING_AGG(CONCAT("SELECT * FROM `'||PROJECT_ID||'.", SCHEMA_NAME, ".__TABLES__` UNION ALL "), "\\n") FROM `'||PROJECT_ID||'`.INFORMATION_SCHEMA.SCHEMATA)', ' ||"\\n"||\n\t') AS AGG_RESULTADOS
FROM LISTA_PROJETOS
)
)
SELECT * FROM RESULTADOS;
If you choose the step 1.1 then you must copy all to clipboard the one line output from step 1.1 and execute it.
So you will have something like it:
SELECT * FROM `teste.raw.__TABLES__` UNION ALL
SELECT * FROM `teste.stage.__TABLES__` UNION ALL
Take care... the maximum list of unions for this query is 100.
You must remove the last UNION ALL from last query for it works.
Then you should do the next step:
Step 2:
/***** Query onde serĂ¡ feita a consulta... *****/
SELECT
project_id,
dataset_id,
table_id,
concat(project_id,':',dataset_id,'.',table_id) objeto,
case type
when 1 then 'TABLE'
when 2 then 'VIEW'
else 'OTHER'
end as tipo,
row_count as qtd_linhas,
round(size_bytes/power(1024, 3), 2) as tamanho_gb,
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(creation_time), 'America/Sao_Paulo') as data_criacao,
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', TIMESTAMP_MILLIS(last_modified_time), 'America/Sao_Paulo') as ultima_modificacao, /*Dados somente L6M (GCP)*/
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', MAX(last_query_in), 'America/Sao_Paulo') as ultima_consulta_em,
MAX(user_email) as consultado_por
FROM (
/***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
SELECT * FROM `teste.raw.__TABLES__` UNION ALL
SELECT * FROM `teste.stage.__TABLES__`
/***** HERE YOU SHOULD PASTE THE CODE OUTPUT FROM STEP 1 OR 1.1 *****/
) AS tables
LEFT JOIN (
SELECT
creation_time AS last_query_in, user_email,
x
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION,
UNNEST(referenced_tables) AS x)
ON
project_id=x.project_id
AND x.dataset_id=dataset_id
AND x.table_id=table_id
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ORDER BY 2, 7
Finally you have the data you've desired.
Let me know if this helps you, ok?

Related

How to include more than one value in IN operator in Big Query

I have the following query that is working fine in Big Query:
SELECT
date,
nombre,
i.identifier,
i.hour
FROM
`table` t, unnest(identifier_s) i
where 2 in unnest(i.hour)
However, I need to include more integers in the search value of the IN operator. Something like this:
...where (2 or 5 or 6) in unnest...
Consider below example - hope yo will be able to adopt it to your specific use-case
select date, nombre, i.identifier, i.hour
from `adsmovil-produccion.analisis_alejandro.100_temp_pers` t,
unnest(identifier_s) i
where exists (
select 1
from unnest(i.hour) x
join unnest([2, 5, 6]) x
using(x)
)

How can I flatten table in SQL in Google Big Query?

I have this table
And tried to achieve the following output:
I found different articles (like this) how to do it, unfortunately they do not work with my table.
The schema of the table is the following:
Consider below approach - less verbose and easy to manage if any adjustments needed
select * from (
select id_car, kv.element.key, kv.element.value
from `project.dataset.table`, unnest(table.keyvalue.list) as kv
)
pivot (min(value) for key in ('id', 'model', 'status', 'speed'))
if applied to sample data in your question - output is
I created a table with the schema you mentioned and data you gave:
I ran the following query on this table:
Select id_car,STRING_AGG(id,'') as id, STRING_AGG(model,'') as model, STRING_AGG(Status,'') as status, STRING_AGG(speed,'') as speed from (SELECT id_car,
if(my.element.key = "id", my.element.value,'') as id,
if(my.element.key = "model", my.element.value, '') as `model`,
if(my.element.key = "Status", my.element.value, '') as Status,
if(my.element.key = "speed", my.element.value, '') as speed,
FROM `ProjectID.Dataset.Table`, unnest(table.keyvalue.list) as my) group by id_car
This gives me the same output that you expect:

Adding summary statistics to an existing table in SQL

I am trying to add summary statistics (just total and average) to a table with 21 columns and 7 rows of data, I would like the two rows of summary statistics to start at row 8. I've been trying a query along these lines without any luck:
SELECT *
FROM
( SELECT 1,
weekday, summer_member_total, summer_member_avg_duration, summer_casual_total, summer_casual_avg_duration,
fall_member_total, fall_member_avg_duration, fall_casual_total, fall_casual_avg_duration,
winter_member_total, winter_member_avg_duration, winter_casual_total, winter_casual_avg_duration,
spring_member_total, spring_member_avg_duration, spring_casual_total, spring_casual_avg_duration,
member_total, member_avg_duration, casual_total, casual_avg_duration,
FROM `case-study-319921.2020_2021_Trip_Data.2020_2021_Summary_Stats`
UNION ALL
SELECT 8,
'TOTAL',
SUM(summer_member_total),
SUM(summer_member_avg_duration),
SUM(summer_casual_total),
SUM(summer_casual_avg_duration),
SUM(fall_member_total),
SUM(fall_member_avg_duration),
SUM(fall_casual_total),
SUM(fall_casual_avg_duration),
SUM(winter_member_total),
SUM(winter_member_avg_duration),
SUM(winter_casual_total),
SUM(winter_casual_avg_duration),
SUM(spring_member_total),
SUM(spring_member_avg_duration),
SUM(spring_casual_total),
SUM(spring_casual_avg_duration),
SUM(member_total),
SUM(member_avg_duration),
SUM(casual_total),
SUM(casual_avg_duration),
FROM `case-study-319921.2020_2021_Trip_Data.2020_2021_Summary_Stats`
UNION ALL
SELECT 9,
'AVG',
AVG(summer_member_total),
AVG(summer_member_avg_duration),
AVG(summer_casual_total),
AVG(summer_casual_avg_duration),
AVG(fall_member_total),
AVG(fall_member_avg_duration),
AVG(fall_casual_total),
AVG(fall_casual_avg_duration),
AVG(winter_member_total),
AVG(winter_member_avg_duration),
AVG(winter_casual_total),
AVG(winter_casual_avg_duration),
AVG(spring_member_total),
AVG(spring_member_avg_duration),
AVG(spring_casual_total),
AVG(spring_casual_avg_duration),
AVG(member_total),
AVG(member_avg_duration),
AVG(casual_total),
AVG(casual_avg_duration),
FROM `case-study-319921.2020_2021_Trip_Data.2020_2021_Summary_Stats` )
ORDER BY 1
Any ideas on how to approach this?
As an option to six your issue - replace
SELECT 1,
weekday, summer_
with
SELECT 1,
CAST(weekday AS STRING) weekday , summer_

Postgres Materialized Path Search using Bookshelf

Let's say I am using materialized paths to store management chains:
Table: User
id name management_chain
1 Senior VP {1}
2 Middle Manager {1,2}
3 Cubicle Slave {1,2,3}
4 Janitor {1,2,4}
How do I construct a query given a user id that returns all of his direct reports, eg given the middle manager, it should return Cubicle Slave and Janitor, given the Senior VP it should return the Middle Manager. Put another way, what would be a good way to get all records where the management_chain contains the id queried for at a position that is second to last (given that the last item represent the user's own id).
In other words, how do I represent the following SQL:
SELECT *
FROM USER u
WHERE u.management_chain #> {stored_variable, u.id}
My current JS:
var collection = Users.forge()
.query('where', 'management_chain', '#>', [req.user.id, id]);
Which errors out with
ReferenceError: id is not defined
Assuming management_chain is an integer array (int[]) you could do the following (in plain SQL)
select *
from (
select id,
name,
'/'||array_to_string(management_chain, '/') as path
from users
) t
where path like '%/2/%';
This works, because array_to_string() will not append the delimiter to the end of the string. Therefore if a path contains the sequence /2/ it means there are more nodes "below" that one. The nodes where 2 is the last id in the management_chain will end with /2 (no trailing /) and will not be included in the result.
The expression will not make use of an index, so this might not be feasible for large tables.
However I don't know how this would translate into that JS thing.
SQLFiddle example: http://sqlfiddle.com/#!15/75948/2
Lookup WITH RECURSIVE
As an example take a look a this code:
CREATE VIEW
mvw_pre_import_cellpath_check
(
pkid_cell,
id_cell ,
id_parent,
has_child,
id_path ,
name_path,
string_path
) AS WITH RECURSIVE cell_paths
(
pkid_cell,
id_cell ,
id_parent,
id_path ,
name_path
) AS
(
SELECT
tbl_cell.pkid ,
tbl_cell.cell_id ,
tbl_cell.cell_parent_id ,
ARRAY[tbl_cell.cell_id] AS "array",
ARRAY[tbl_cell.cell_name] AS "array"
FROM
ufo.tbl_cell
WHERE
(((
tbl_cell.cell_parent_id IS NULL)
AND (
tbl_cell.reject_reason IS NULL))
AND (
tbl_cell.processed_dt IS NULL))
UNION ALL
SELECT
tbl_cell.pkid ,
tbl_cell.cell_id ,
tbl_cell.cell_parent_id ,
(cell_paths_1.id_path || tbl_cell.cell_id),
(cell_paths_1.name_path || tbl_cell.cell_name)
FROM
(cell_paths cell_paths_1
JOIN
ufo.tbl_cell
ON
((
tbl_cell.cell_parent_id = cell_paths_1.id_cell)))
WHERE
(((
NOT (
tbl_cell.cell_id = ANY (cell_paths_1.id_path)))
AND (
tbl_cell.reject_reason IS NULL))
AND (
tbl_cell.processed_dt IS NULL))
)
SELECT
cell_paths.pkid_cell,
cell_paths.id_cell ,
cell_paths.id_parent,
(
SELECT
COUNT(*) AS COUNT
FROM
ufo.tbl_cell x
WHERE
((
cell_paths.id_cell = x.cell_id)
AND (
EXISTS
(
SELECT
1
FROM
ufo.tbl_cell y
WHERE
(
x.cell_id = y.cell_parent_id))))) AS has_child,
cell_paths.id_path ,
cell_paths.name_path ,
array_to_string(cell_paths.name_path, ' -> '::text) AS string_path
FROM
cell_paths
ORDER BY
cell_paths.id_path;
There are plenty more examples to find on SO when looking for recursive CTE.
But in contrary with your example the top level cells (managers) have parent_id = NULL in my example. These are the starting points for the different branches.
HTH

Trouble with SQL UNION operation

I have the following table:
I am trying to create an SQL query that returns a table that returns three fields:
Year (ActionDate), Count of Built (actiontype = 12), Count of Lost (actiontype = a few different ones)
Bascially, ActionType is a lookup code. So, I'd get back something like:
YEAR CountofBuilt CountofLost
1905 30 18
1929 12 99
1940 60 1
etc....
I figured this would take two SELECT statements put together with a UNION.
I tried the following below but it only spits back two columns (year and countbuilt). My countLost field doesn't appear
My sql currently (MS Access):
SELECT tblHist.ActionDate, Count(tblHist.ActionDate) as countBuilt
FROM ...
WHERE ((tblHist.ActionType)=12)
GROUP BY tblHist.ActionDate
UNION
SELECT tblHist.ActionDate, Count(tblHist.ActionDate) as countLost
FROM ...
WHERE (((tblHist.ActionType)<>2) AND
((tblHist.ActionType)<>3))
GROUP BY tblHist.ActionDate;
Use:
SELECT h.actiondate,
SUM(IIF(h.actiontype = 12, 1, 0)) AS numBuilt,
SUM(IIF(h.actiontype NOT IN (2,3), 1, 0)) AS numLost
FROM tblHist h
GROUP BY h.actiondate
You should not use UNION for such queries. There are many ways to do what you want, for example
Updated to fit access syntax
SELECT tblHist.ActionDate,
COUNT(SWITCH(tblHist.ActionType = 12,1)) as countBuilt,
COUNT(SWITCH(tblHist.ActionType <>1 OR tblHist.ActionType <>2 OR ...,1)) as countLost
FROM ..
WHERE ....
GROUP BY tblHist.ActionDate