I have created a Sink using Log explorer that pushes data to Bigquery. I can get information about tables by using the following query.
SELECT
SPLIT(REGEXP_EXTRACT(protopayload_auditlog.resourceName, '^projects/[^/]+/datasets/[^/]+/tables/(.*)$'), '$')[OFFSET(0)] AS TABLE
FROM `project.dataset` WHERE
JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
OR JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataChange") IS NOT NULL
However, I am unable to find information about Views. I have tried
Audit logs https://cloud.google.com/bigquery/docs/reference/auditlogs
And biguqery asset information https://cloud.google.com/asset-inventory/docs/resource-name-format
however, I am unable to find how to get the information about "View". What do I need to include? Is that something in my sink or there is an alternative resource name I should use?
It seems like auditLogs treat tables and views the same way.
I made this query to track view/table changes. InsertJob will tell you about view creations. UpdateTable/PatchTable will tell you about updates
SELECT
resource.labels.dataset_id,
resource.labels.project_id,
--protopayload_auditlog.methodName,
REGEXP_EXTRACT(protopayload_auditlog.methodName,r'.*\.([^/$]*)') as method,
--protopayload_auditlog.resourceName,
REGEXP_EXTRACT(protopayload_auditlog.resourceName,r'.*tables\/([^/$]*)') as tableName,
protopayload_auditlog.authenticationInfo.principalEmail,
protopayload_auditlog.metadataJson,
case when protopayload_auditlog.methodName = 'google.cloud.bigquery.v2.JobService.InsertJob' then JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableCreation"),"$.table"),"$.view"),"$.query")
else JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableChange"),"$.table"),"$.view"),"$.query") end
as query,
receiveTimestamp
FROM `<project-id>.<bq_auditlog>.cloudaudit_googleapis_com_activity_*`
WHERE DATE(timestamp) >= "2022-07-10"
and protopayload_auditlog.methodName in
('google.cloud.bigquery.v2.TableService.PatchTable',
'google.cloud.bigquery.v2.TableService.UpdateTable',
'google.cloud.bigquery.v2.TableService.InsertTable',
'google.cloud.bigquery.v2.JobService.InsertJob',
'google.cloud.bigquery.v2.TableService.DeleteTable' )
Views are virtual table which are created and queried in the same way as queried from tables. Since you are looking for Views in BigQuery which is setup as a logging sink, you need to create Views in BigQuery by using the steps given in this documentation.
Currently there are two versions supported, v1 and v2. V1 reports API invocation and V2 reports resource interactions. After creating the views, you can do further analysis in BigQuery by saving or querying the Views.
Related
For reference, I am connecting to amazon-athena via sqlalchemy using essentially:
create_engine(
f'awsathena+rest://:#athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
echo=True)
In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:
select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.
Athena uses Glue Data Catalog to store metadata about databases and tables. I don't know how much of this is exposed in information_schema, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.
The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition's data).
I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.
I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.
I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.
I am new to Tableau, and having performance issues and need some help. I have a hive query result in Azure Blob Storage named as part-00000.
The issue having this performance is I want to execute the custom query in Tableau and generates the graphical reports at Tableau.
So can I do this? How ?
I have 7.0 M Data in Hive table.
you can find custom query in data source connection check linked image
You might want to consider creating an extract instead of a live connection. Additional considerations would include hiding unused fields and using filters at the data source level to limit data as per requirement.