Get the BQ Query History Data - google-bigquery

I am running the following query in my BQ console to see the query history data:
select * from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT;
I can see the all the data query data present in the results. I came across Audit Logs https://cloud.google.com/bigquery/docs/reference/auditlogs
I have created the sink using command:
gcloud logging sinks create bq-audit-sink pubsub.googleapis.com/projects/my-project/topics/bq_audit --log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata
But i am not able to find the query data i.e past query which is fired and the information about the job.
How to get the data is which we can get via INFORMATION_SCHEMA.JOBS_BY_PROJECT view.

The INFORMATION_SCHEMA table is a historical record, the log sync receives events when they flow through the log mechanism. The sink doesn't get backfilled with events from before the sink was setup, if that was your hope.
Are you not receiving any events in the pubsub topic? Try running a query in the instrumented project and observe what's emitted into the pubsub topic.

Related

Dialogflow CX logs sink to BigQuery. sink error - field: value is not a record

I am using google cloud logging to sink Dialogflow CX requests data to big query. BigQuery tables are auto generated when you create the sink via Google Logging.
We keep getting a sink error - field: value is not a record.
This is because pageInfo/formInfo/parameterInfo/value is of type String in BigQuery BUT there are values that are records, not strings. One example is #sys.date-time
How do we fix this?
We have not tried anything at this point since the BigQuery dataset is auto created via a Logging Filter. We cannot modify the logs and if we could modify the table schema, what would we change it to since most of the time "Value" is a String but other times it is a Record

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Google Dataflow SQL | Creating Branches | Error Handling

Trying to use Dataflow SQL for Stream ingestion:
We have a Pubsub topic (source) and BigQuery Table (sink).
To achieve that we need to follow steps:
From BigQuery UI, adding schema for topic manually.
Question: Can we automate this process using commandline options?
Writing SQL for the transformation and executing using gcloud dataflow query command (helps us with dynamic queries and automation).
Question: Suppose we have missing key from Pubsub messages and the pipeline will mark those messages as error in stack driver. Can we add some capability like if validation of schema fails move to table y else table x? Something like, if we get message type y move of table y else table x?
You can use gcloud to add a schema to a topic. This was actually the only way to do it, at first: https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations#gcloud
For saving messages that cannot be parsed into SQL rows, the functionality is often called "dead letter queue". It is available in Beam SQL DDL for Pubsub but is not yet available when using Dataflow SQL through the BigQuery UI. See https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#pubsub

How to fetch COMPLETE data from Cumulocity using cep

I am currently getting data using event processing tab at administration tab using cep queries,but only the data after i save the query is fetched .How can we query based on complete set of data ?
#Name("CarLoadWeekly")
#Resilient
insert into createEvent
select "vechiclecount" as type ,e.event.time as time from EventCreated as e
where (e.text ='active')
The event processing is a pure realtime engine. The scripts that you deploy there will be injected into the realtime stream of data and do not apply for historical data.
If you need to access historical data within your realtime script there are certain functions where you can query historical data. You can take a look at the analytics guide for that.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.