Get the Last Modified date for all BigQuery tables in a BigQuery Project - sql

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?

For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).

I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.

Related

Need to simulate resourceName with full table path in Log Explorer

I need to understand under what circumstance does the protoPayload.resourceName with full table path i.e., projects/<project_id>/datasets/<dataset_id>/tables/<table_id> appear in the Log Explorer as shown in the example below.
The below entries were generated by a composer dag running a kubernetespodoperator executing some dbt commands on some models. On the basis of this, I have a sink linked to pub/sub for further processing.
As seen in the image the resourceName value is appearing as-
projects/gcp-project-name/datasets/dataset-name/tables/table-name
I have shaded the actual values of projectid, datasetid, and tablename.
I can't run the similar dag job with kuberenetesoperator on test tables owing to environment restrictions. So I tried running some update queries and insert queries using BigQuery Editor. Here is how value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/bxuxjob_
I tried same queries using Composer DAG using BigQueryInsertJobOpertor. Here is how the value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/airflow_<>_
Here is my question. What operation/operations in BigQuery will give me protoPayload.resourceName as the one that I am expecting i.e. -
projects/<project_id>/datasets/<dataset_id>/tables/<table_id>

Extracting information about View from Bigquery auditlog

I have created a Sink using Log explorer that pushes data to Bigquery. I can get information about tables by using the following query.
SELECT
SPLIT(REGEXP_EXTRACT(protopayload_auditlog.resourceName, '^projects/[^/]+/datasets/[^/]+/tables/(.*)$'), '$')[OFFSET(0)] AS TABLE
FROM `project.dataset` WHERE
JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
OR JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableDataChange") IS NOT NULL
However, I am unable to find information about Views. I have tried
Audit logs https://cloud.google.com/bigquery/docs/reference/auditlogs
And biguqery asset information https://cloud.google.com/asset-inventory/docs/resource-name-format
however, I am unable to find how to get the information about "View". What do I need to include? Is that something in my sink or there is an alternative resource name I should use?
It seems like auditLogs treat tables and views the same way.
I made this query to track view/table changes. InsertJob will tell you about view creations. UpdateTable/PatchTable will tell you about updates
SELECT
resource.labels.dataset_id,
resource.labels.project_id,
--protopayload_auditlog.methodName,
REGEXP_EXTRACT(protopayload_auditlog.methodName,r'.*\.([^/$]*)') as method,
--protopayload_auditlog.resourceName,
REGEXP_EXTRACT(protopayload_auditlog.resourceName,r'.*tables\/([^/$]*)') as tableName,
protopayload_auditlog.authenticationInfo.principalEmail,
protopayload_auditlog.metadataJson,
case when protopayload_auditlog.methodName = 'google.cloud.bigquery.v2.JobService.InsertJob' then JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableCreation"),"$.table"),"$.view"),"$.query")
else JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.tableChange"),"$.table"),"$.view"),"$.query") end
as query,
receiveTimestamp
FROM `<project-id>.<bq_auditlog>.cloudaudit_googleapis_com_activity_*`
WHERE DATE(timestamp) >= "2022-07-10"
and protopayload_auditlog.methodName in
('google.cloud.bigquery.v2.TableService.PatchTable',
'google.cloud.bigquery.v2.TableService.UpdateTable',
'google.cloud.bigquery.v2.TableService.InsertTable',
'google.cloud.bigquery.v2.JobService.InsertJob',
'google.cloud.bigquery.v2.TableService.DeleteTable' )
Views are virtual table which are created and queried in the same way as queried from tables. Since you are looking for Views in BigQuery which is setup as a logging sink, you need to create Views in BigQuery by using the steps given in this documentation.
Currently there are two versions supported, v1 and v2. V1 reports API invocation and V2 reports resource interactions. After creating the views, you can do further analysis in BigQuery by saving or querying the Views.

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

How to save a view using federated queries across two projects?

I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.

Create an Azure Data Factory pipeline to copy new records from DocumentDB to Azure SQL

I am trying to find the best way to copy yesterday's data from DocumentDB to Azure SQL.
I have a working DocumentDB database that is recording data gathered via a web service. I would like to routinely (daily) copy all new records from the DocumentDB to an Azure SQL DB table. In order to do so I have created and successfully executed an Azure Data Factory Pipeline that copies records with a datetime > '2018-01-01', but I've only ever been able to get it to work with an arbitrary date - never getting the date from a variable.
My research on DocumentDB SQL querying shows that it has Mathematical, Type checking, String, Array, and Geospatial functions but no date-time functions equivalent to SQL Server's getdate() function.
I understand that Data Factory Pipelines have some system variables that are accessible, including utcnow(). I cannot figure out, though, how to actually use those by editing the JSON successfully. If I try just including utcnow() within the query I get an error from DocumentDB that "'utcnow' is not a recognized built-in function name".
"query": "SELECT * FROM c where c.StartTimestamp > utcnow()",
If I try instead to build the string within the JSON using utcnow() I can't even save it because of a syntax error:
"query": "SELECT * FROM c where c.StartTimestamp > " + utcnow(),
I am willing to try a different technology than a Data Factory Pipeline, but I have a lot of data in our DocumentDB so I'm not interested in abandoning that, and I have much greater familiarity with SQL programming and need to move the data there for joining and other analysis.
What is the easiest and best way to copy those new entries over every day into the staging table in Azure SQL?
Are you using ADF V2 or V1?
For ADF V2.
I think that you can follow the incremental approach that they recommend, for example you could have a watermark table (it could be in your target Azure SQL database) and two lookups activities, one of the lookups will obtain the previous run watermark value (it could be date, integer, whatever your audit value is) and another lookup activity to obtain the MAX (watermark_value, i.e. date) of your source document and have a CopyActivity that gets all the values where the c.StartTimeStamp<=MaxWatermarkValueFromSource AND c.StartTimeStamp>LastWaterMarkValue.
I followed this example using the Python SDK and worked for me.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell