Need to simulate resourceName with full table path in Log Explorer - google-bigquery

I need to understand under what circumstance does the protoPayload.resourceName with full table path i.e., projects/<project_id>/datasets/<dataset_id>/tables/<table_id> appear in the Log Explorer as shown in the example below.
The below entries were generated by a composer dag running a kubernetespodoperator executing some dbt commands on some models. On the basis of this, I have a sink linked to pub/sub for further processing.
As seen in the image the resourceName value is appearing as-
projects/gcp-project-name/datasets/dataset-name/tables/table-name
I have shaded the actual values of projectid, datasetid, and tablename.
I can't run the similar dag job with kuberenetesoperator on test tables owing to environment restrictions. So I tried running some update queries and insert queries using BigQuery Editor. Here is how value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/bxuxjob_
I tried same queries using Composer DAG using BigQueryInsertJobOpertor. Here is how the value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/airflow_<>_
Here is my question. What operation/operations in BigQuery will give me protoPayload.resourceName as the one that I am expecting i.e. -
projects/<project_id>/datasets/<dataset_id>/tables/<table_id>

Related

BigQuery Scheduled Query won't run

I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!

How to save a view using federated queries across two projects?

I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.

Get the Last Modified date for all BigQuery tables in a BigQuery Project

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.

Error In Query Operation: Cannot start a job without a project id

I keep getting an error using the bqcommand line tool. For example, I can easily run this query and it returns the table that I want:
head -n 10 xxxx-bq:name_name.Report2
Note that xxxx-bq is the projectid, and name_name is the dataset id. When I try to run a query against this table, say the follwing:
query "SELECT count(*) FROM xxxx-bq:name_name.Report2
I get an error that says that I cannot start a job without a project id. What am I doing wrong here? How can I specify in the query the project ID? I know people have asked some similar questions. That said, I have been following along and my approach is not working.
Do you have a project id? If not, this page can help you set one up: https://developers.google.com/bigquery/bq-command-line-tool-quickstart
All BigQuery jobs (which include queries) require a project id, which is the project that gets billed for any damage done by the job. (by damage, I mean work)
You should either set your default project id (you can do this by running bq init)
or set the project id that you're running the job under via --project_id=
So if you're running bq shell, you would use bq shell --project_id=myprojectid instead.
strange... I just started working with bq & got the same error but it didn't like me passing --project_id=[myprojectid]. Although I was already authed with gcloud auth login, I had to run bq init (and it seemingly didn't do anything) -- after that, my queries worked just fine.

Pig step execution details

I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .
I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).
Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.