getting Clustering/Bucketing columns programmatically - sql

For reference, I am connecting to amazon-athena via sqlalchemy using essentially:
create_engine(
f'awsathena+rest://:#athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
echo=True)
In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:
select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.

Athena uses Glue Data Catalog to store metadata about databases and tables. I don't know how much of this is exposed in information_schema, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.
The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition's data).

Related

Extracting information about View from Bigquery auditlog

I have created a Sink using Log explorer that pushes data to Bigquery. I can get information about tables by using the following query.
SELECT
SPLIT(REGEXP_EXTRACT(protopayload_auditlog.resourceName, '^projects/[^/]+/datasets/[^/]+/tables/(.*)$'), '$')[OFFSET(0)] AS TABLE
FROM `project.dataset` WHERE
JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
OR JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataChange") IS NOT NULL
However, I am unable to find information about Views. I have tried
Audit logs https://cloud.google.com/bigquery/docs/reference/auditlogs
And biguqery asset information https://cloud.google.com/asset-inventory/docs/resource-name-format
however, I am unable to find how to get the information about "View". What do I need to include? Is that something in my sink or there is an alternative resource name I should use?
It seems like auditLogs treat tables and views the same way.
I made this query to track view/table changes. InsertJob will tell you about view creations. UpdateTable/PatchTable will tell you about updates
SELECT
resource.labels.dataset_id,
resource.labels.project_id,
--protopayload_auditlog.methodName,
REGEXP_EXTRACT(protopayload_auditlog.methodName,r'.*\.([^/$]*)') as method,
--protopayload_auditlog.resourceName,
REGEXP_EXTRACT(protopayload_auditlog.resourceName,r'.*tables\/([^/$]*)') as tableName,
protopayload_auditlog.authenticationInfo.principalEmail,
protopayload_auditlog.metadataJson,
case when protopayload_auditlog.methodName = 'google.cloud.bigquery.v2.JobService.InsertJob' then JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableCreation"),"$.table"),"$.view"),"$.query")
else JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableChange"),"$.table"),"$.view"),"$.query") end
as query,
receiveTimestamp
FROM `<project-id>.<bq_auditlog>.cloudaudit_googleapis_com_activity_*`
WHERE DATE(timestamp) >= "2022-07-10"
and protopayload_auditlog.methodName in
('google.cloud.bigquery.v2.TableService.PatchTable',
'google.cloud.bigquery.v2.TableService.UpdateTable',
'google.cloud.bigquery.v2.TableService.InsertTable',
'google.cloud.bigquery.v2.JobService.InsertJob',
'google.cloud.bigquery.v2.TableService.DeleteTable' )
Views are virtual table which are created and queried in the same way as queried from tables. Since you are looking for Views in BigQuery which is setup as a logging sink, you need to create Views in BigQuery by using the steps given in this documentation.
Currently there are two versions supported, v1 and v2. V1 reports API invocation and V2 reports resource interactions. After creating the views, you can do further analysis in BigQuery by saving or querying the Views.

How can I extract DDL for particurar object of a database in PostgreSQL

I need to get DDL query of a particular object of database like schema, table, column, etc. Is there a way to extract it from system catalog tables using sql?
I tried to find any table in information_schema or pg_catalog with required information, but I didn't find such one.
There is a way to extract it from the system catalogs, but the method depends on what type of object it is, and is not easy.
The "pg_dump" knows how to do it. I would just use that, rather than reinventing things. You can get just the DDL (exclude the data itself) using "-s" option. Then you can fish out the DDL for your specific desired object using your favorite text editor. If the object is a table, you can tell pg_dump to dump just that table, but for other objects you can't.

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.

Auto-Match Fields to Columns SQL LOADER

I'm trying to load some data from csv to Oracle 11g database tables through sqlldr
So I was I thinking if there's a way to carry those data matching the columns described on the ctl file with the table columns by the name. Just like an auto-match, with no sequential order or filler command
Anyone knows a thing about that? I've been searching in documentation and forums but haven't found a thing
Thank you, guys
Alas you're on 11g. What you're looking for is a new feature in 12c SQL Loader Express Mode. This allows us to load a comma-delimited file to a table without defining a Loader control file; instead Oracle uses the data dictionary ALL_TAB_COLUMNS to figure out the mapping.
Obviously there are certain limitations. Perhaps the biggest one is that external tables are the underlying mechanism so it requires the same privileges , including privileges on Directory objects. I think this reduces the helpfulness of the feature, because many people need to use SQL Loader precisely because their DBAs or sysadmins won't grant them the privileges necessary for external tables.

when should we go for a external table and internal table in Hive?

I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.