number of rows of all tables in a bigquery dataset

number of rows of all tables in a bigquery dataset - google-bigquery

I have a dataset with many tables. Is there an easy way to query an output that displays the table_name and the number of rows in that table without having to do count(*) on each table in the dataset?

Yes you can do it, querying the metadata:
SELECT
dataset_id,
table_id,
# Convert size in bytes to GB
ROUND(size_bytes/POW(10,9),2) AS size_gb,
# Convert creation_time and last_modified_time from UNIX EPOCH format to a timestamp
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
row_count,
# Convert table type from numerical value to description
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
ELSE
NULL
END
AS type
FROM
project_id.dataset_id.__TABLES__
ORDER BY
size_gb DESC;

You may get the metadata using INFORMATION_SCHEMA.TABLE_STORAGE by using below query:
SELECT table_schema, table_name, total_rows from your-project-id.`region-REGION`.INFORMATION_SCHEMA.TABLE_STORAGE;
You may refer to this INFORMATION_SCHEMA.TABLE_STORAGE documenation for more information when retrieving this kind of metadata.
Please note that when using INFORMATION_SCHEMA.TABLE_STORAGE, the query must include a region qualifier. If not specified, the default region being used in the query is US.
It is mentioned in this Scope and Syntax documentation for INFORMATION_SCHEMA.TABLE_STORAGE that when you do not specify any region, the metadata is retrieved from all region, however, during testing, the behavior of the query only retrieves from region US just like in this scope and syntax documentation for INFORMATION_SCHEMA.SCHEMATA. I think, the documentation for INFORMATION_SCHEMA.TABLE_STORAGE syntax should be updated.
In addition, please note that __TABLES__ is officially removed from Google BigQuery documentation because it is already deprecated as also mentioned in this similar SO post. It is better to use INFORMATION_SCHEMA when retrieving BigQuery metadata since this will be the one to be supported by Google moving forward.

Related

Difference between last_modified_time and last_ddl_time in Hive?

In Hive when we use the following command : SHOW CREATE TABLE TABLE_NAME ;
It returns a list of metadata related to the particular table . Out of those metadata, there are two fields i am confused between.
'last_modified_time'='1620814731',
'transient_lastDdlTime'='1620820769'
What is the underlying difference between these two metrics ?

Redshift showing 0 rows for external table, though data is viewable in Athena

I created an external table in Redshift and then added some data to the specified S3 folder. I can view all the data perfectly in Athena, but I can't seem to query it from Redshift. What's weird is that select count(*) works, so that means it can find the data, but it can't actually show anything. I'm guessing it's some mis-configuration somewhere, but I'm not sure what.
Some stuff that may be relevant (I anonymized some stuff):
create external schema spectrum_staging
from data catalog
database 'spectrum_db'
iam_role 'arn:aws:iam::############:role/RedshiftSpectrumRole'
create external database if not exists;
create external table spectrum_staging.errors(
id varchar(100),
error varchar(100))
stored as parquet
location 's3://mybucket/errors/';
My sample data is stored in s3://mybucket/errors/2018-08-27-errors.parquet
This query works:
db=# select count(*) from spectrum_staging.errors;
count
-------
11
(1 row)
This query does not:
db=# select * from spectrum_staging.errors;
id | error
----+-------
(0 rows)

Check your parquet file and make sure the column data types in the Spectrum table match up.
Then run SELECT pg_last_query_id(); after your query to get the query number and look in the system tables STL_S3CLIENT and STL_S3CLIENT_ERROR to find further details about the query execution.

You don't need to define external tables when you have defined external schema based on Glue Data Catalog. Redshift Spectrum pics up all the tables that are in the Catalog.
What's probably going on there is that you somehow have two things with the same name and in one case it picks it up from the data catalog and in the other case it tries to use the external table.
Check these tables from Redshift side to get a better view of what's there:
select * from SVV_EXTERNAL_SCHEMAS
select * from SVV_EXTERNAL_TABLES
select * from SVV_EXTERNAL_PARTITIONS
select * from SVV_EXTERNAL_COLUMNS
And these tables for queries that use the tables from external schema:
select * from SVL_S3QUERY_SUMMARY
select * from SVL_S3LOG order by eventtime desc
select * from SVL_S3QUERY where query = xyz
select * from SVL_S3PARTITION where query = xyz

was there ever a resolution for this? a year down, i have the same problem today.
nothing stands out in terms of schema differences- an error exists though
select recordtime, file, process, errcode, linenum as line,
trim(error) as err
from stl_error order by recordtime desc;
/home/ec2-user/padb/src/sys/cg_util.cpp padbmaster 1 601 Compilation of segment failed: /rds/bin/padb.1.0.10480/data/exec/227/48844003/de67afa670209cb9cffcd4f6a61e1c32a5b3dccc/0
Not sure what this means.

I encountered a similar issue when creating an external table in Athena using RegexSerDe row format. I was able to query this external table from Athena without any issues. However, when querying the external table from Redhift the results were null.
Resolved by converting to parquet format as Spectrum cannot handle regular expression serialization.
See link below:
Redshift spectrum shows NULL values for all rows

How to get column name and type in hive

I know of these,
To get column names in a table we can fire:
show columns in <database>.<table_name>
To get description of a table (including column_name, column_type and many other details):
describe [formatted] <database>.<table_name>
I know that I can use the above query and filter the result to get the columns names and types. But I want to know if there is any direct command to get just the column names and types like select columns, column_type ...?

In HIVE you could use:
DESCRIBE FORMATTED [DatabaseName].[TableName] [Column Name];
This gives you the column data type and some stats of that column.
DESCRIBE [DatabaseName].[TableName] [Column Name];
This just gives you the data type and comments if available for a specific column.
Hope this helps.

Unlike traditional RDBMS, Hive stores metadata in a separate database. In most cases it is in MySQL or Postgres. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns.

BigQuery: how to convert this legacy SQL to standardSQL?

I have data import pipeline into BigQuery tables (the hourly tables named transactions_20170616_00 transactions_20170616_01 ... and there are more daily/weekly/... rollups), want to use a single view to always point to the latest one, found hard to do one static standardSQL view to point to latest, my current solution is to update the view's content to SELECT * FROM project.dataset.transactions_201706.... after every import successful,
Till I read this httparchive's latest view: it's all what I want but in legacy SQL; my project uses all standardSQL only, and prefer standardSQL because it's the future; wonder anyone knows how to convert this legacy SQL to standardSQL? then I won't need to constantly update my view
https://bigquery.cloud.google.com/table/httparchive:runs.latest_requests?tab=details
SELECT *
FROM TABLE_QUERY(httparchive:runs,
"table_id IN (
SELECT table_id FROM [httparchive:runs.__TABLES__]
WHERE REGEXP_MATCH(table_id, '2.*requests$')
ORDER BY table_id DESC LIMIT 1)")
following this guide, I'm trying to use
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#the_table_query_function
#standardSQL
SELECT * FROM `httparchive.runs.*`
WHERE _TABLE_SUFFIX IN
( SELECT table_id
FROM httparchive.runs.__TABLES__
WHERE REGEXP_CONTAINS(table_id, r'2.*requests$')
ORDER BY table_id DESC
LIMIT 1)
but the query failed of
Query Failed
Error: Views cannot be queried through prefix. Matched views are: httparchive:runs.latest_pages, httparchive:runs.latest_pages_mobile, httparchive:runs.latest_requests, httparchive:runs.latest_requests_mobile
Job ID: bidder-1183:bquijob_1400109e_15cb1dc3c0c
I found the wildcard can only be used at last? in this case why not SELECT * FROM httparchive.runs.*_requests WHERE ... work?
in this case, is it saying the Wildcard Tables feature in standardSQL isn't same flexible as TABLE_QUERY in legacySQL>?

Oracle get checksum value for a data chunk defined by a select clause

Is there any method in SQL (Oracle) using which I can get something like:
select checksum(select * from table) from table;

You can use DBMS_SQLHASH.GETHASH for this. The query results must be ordered and must not contain any LOBs, or the results won't be deterministic.
select dbms_sqlhash.gethash(q'[select * from some_table order by 1,2]', digest_type => 1)
from dual;
Where digest_type 1 = HASH_MD4, 2 = HASH_MD5, 3 = HASH_SH1.
That package is not granted to anyone by default. To use it, you'll need someone to logon as SYS and run this:
SQL> grant execute on dbms_sqlhash to <your_user>;
The query results must be ordered, as described in "Bug 17082212 : DBMS_SQLHASH DIFFERENT RESULTS FROM DIFFERENT ACCESS PATH".
I'm not sure why LOBs don't work, but it might be related to the way the function ORA_HASH does not work well with LOBs. This Jonathan Lewis article includes some examples of ORA_HASH returning different results for the same LOB data. And recent versions of the SQL Language Reference warn that ORA_HASH does not support LOBs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

number of rows of all tables in a bigquery dataset - google-bigquery

I have a dataset with many tables. Is there an easy way to query an output that displays the table_name and the number of rows in that table without having to do count(*) on each table in the dataset?

Related

Difference between last_modified_time and last_ddl_time in Hive?

Redshift showing 0 rows for external table, though data is viewable in Athena

How to get column name and type in hive

BigQuery: how to convert this legacy SQL to standardSQL?

Oracle get checksum value for a data chunk defined by a select clause

Categories

Resources