Dialogflow CX logs sink to BigQuery. sink error - field: value is not a record - google-bigquery

I am using google cloud logging to sink Dialogflow CX requests data to big query. BigQuery tables are auto generated when you create the sink via Google Logging.
We keep getting a sink error - field: value is not a record.
This is because pageInfo/formInfo/parameterInfo/value is of type String in BigQuery BUT there are values that are records, not strings. One example is #sys.date-time
How do we fix this?
We have not tried anything at this point since the BigQuery dataset is auto created via a Logging Filter. We cannot modify the logs and if we could modify the table schema, what would we change it to since most of the time "Value" is a String but other times it is a Record

Related

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Data Flow output to Azure SQL Database contains only NULL data on Azure Data Factory

I'm testing the data flow on my Azure Data Factory. I created Data Flow with the following details:
Source dataset linked service - from CSV files dataset from Blob storage
Sink linked service - Azure SQL database with pre-created table
My CSV files are quite simple as they contain only 2 columns (PARENT, CHILD). So, my table in SQL DB also have only 2 columns.
For the sink setting of my data flow, I have allowed insert data and leaving other options as default.
I have also mapped the 2 columns for input and output columns as per screenshot.
The pipeline with data flow ran successfully when I checked the result, I could see thqat 5732 rows were processed. Is this the correct way to check? As this is the first time I try this functionality in Azure Data Factory.
But, when I click on Data preview tab, they are all NULL value.
And; when I checked my Azure SQL DB in the table where I tried to insert the data from CSV files from Blob storage with selecting top 1000 rows from this table, I don't see any data.
Could you please let me know what I configured incorrectly on my Data Flow? Thank you very much in advance.
Here is the screenshot of ADF data flow source data, it does see the data on the right side as they are not NULL, but on the left side are all NULLs. I imagine that the right side are the data from the CSV from the source on the blob right? And the left side is the sink destination as the table is empty for now?
And here is the screenshot for the sink inspect input, I think this is correct as it reads the 2 columns correctly (Parent, Child), is it?
After adding Map drifted, to map "Parent" => "parent" and "Child" => "child"
I get this error message after running the pipeline.
When checking on sink data preview, I get this error message. It seems like there is incorrect mapping?
I rename the MapDrifted1 expression to "toString(byName('Parent1))" and Child1 as suggested.
The data flow executed successfully, however I still get NULL data in the sink SQL table.
Can you copy/paste the script behind your data flow design graph? Go to the ADF UI, open the data flow, then click the Script button on top right.
In your Source transformation, click on Data Preview to see the data. Make sure you are seeing your data, not NULLs. Also, look at the Inspect on the INPUT for your Sink, to see if ADF is reading additional columns.

How transfer data from Google Cloud Storage to Biq Query in partitioned and clustered table?

In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.
When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.
Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.

Truncate Behavior in BigQuery

I am trying to load data in google bigquery table with OAuth authentication and Truncate mode.
Whenever I am pushing the data from mysql database table to google big query table, it is changing the structure of the bigquery table. Columns which are defined as REQUIRED are getting changed into NULLABLE.
Is this an expected behavior?
enter image description here

BigQuery: Cannot insert new value after updating several schema fields using streaming API

The issue I am facing in my nodejs application is identical to this user's question: Cannot insert new value to BigQuery table after updating with new column using streaming API.
To my understanding changes such as widening a table's schema may require some period of time before streamed inserts can reference the new columns otherwise a 'no such field' error is returned. For me this error is not always consistent as sometimes I am able to successfully insert.
However, I specifically wanted to know if you could alternatively use a load job instead of streaming? If so what drawbacks does it have as I am not sure of the difference even having read the documentation.
Alternatively, if I do use streaming but with the ignoreUnknownValues option, does that mean that all of the data is eventually inserted including data referencing new columns? Just that new columns are not queryable until the table schema is finished updating?