Why is my Upsolver Kafka data source is stuck and/or not pulling any data - sqlake

Kafta topic has messages but Upsolver data source is stuck or not pulling any new messages. We have about 15 such data sources and some are working fine but some seem stuck. What is happening?

When Kafka data source is created there is a configuration "Read From Start". When true (checked), the Kafka DS will only pull data if there were any messages in the topic at the time of Data source creation. If topic was empty, this Data source will get stuck and stalled. In such a case, data source should be created with this Read from start property unchecked. And then the data source will start pulling messages from the point of its creation.
Alternatively, you could play a sample dummy message per your schema in the topic before you create the data source with Read from Start = True if you don't or can't wait for data to arrive before creating.

Related

Dialogflow CX logs sink to BigQuery. sink error - field: value is not a record

I am using google cloud logging to sink Dialogflow CX requests data to big query. BigQuery tables are auto generated when you create the sink via Google Logging.
We keep getting a sink error - field: value is not a record.
This is because pageInfo/formInfo/parameterInfo/value is of type String in BigQuery BUT there are values that are records, not strings. One example is #sys.date-time
How do we fix this?
We have not tried anything at this point since the BigQuery dataset is auto created via a Logging Filter. We cannot modify the logs and if we could modify the table schema, what would we change it to since most of the time "Value" is a String but other times it is a Record

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Get the BQ Query History Data

I am running the following query in my BQ console to see the query history data:
select * from `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT;
I can see the all the data query data present in the results. I came across Audit Logs https://cloud.google.com/bigquery/docs/reference/auditlogs
I have created the sink using command:
gcloud logging sinks create bq-audit-sink pubsub.googleapis.com/projects/my-project/topics/bq_audit --log-filter='protoPayload.metadata."#type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata
But i am not able to find the query data i.e past query which is fired and the information about the job.
How to get the data is which we can get via INFORMATION_SCHEMA.JOBS_BY_PROJECT view.
The INFORMATION_SCHEMA table is a historical record, the log sync receives events when they flow through the log mechanism. The sink doesn't get backfilled with events from before the sink was setup, if that was your hope.
Are you not receiving any events in the pubsub topic? Try running a query in the instrumented project and observe what's emitted into the pubsub topic.

Azure Data Factory Copy Activity - Does not fail, despite being unsuccessful

Overview
I have a Data Factory copy activity in a pipeline that moves tabular data (.csv) from a file in Azure Data Lake into a SQL database-landing schema.
The table definitions will be refreshed each load, in principle, and so there should not be errors. However, I want to log errors in the case that the table exists and the file cannot be mapped to the destination.
I anticipate the activity failing when it cannot match the column to the destination table definition. Strangely my process succeeds despite not loading the file into the database.
Testing
I added a column in my source file.
I did not update the destination database table definition.
I truncated the destination table.
I unset fault tolerance.
Deleted all copy activity logs (nascent development environment, so this is OK).
Run the pipeline.
Check Data Factory pipeline Output, 100% success rating!?
Query destination table (empty).
Open each log file and look for relevant information (nothing just headers per copy activity call).
Copy activity settings

Data Flow output to Azure SQL Database contains only NULL data on Azure Data Factory

I'm testing the data flow on my Azure Data Factory. I created Data Flow with the following details:
Source dataset linked service - from CSV files dataset from Blob storage
Sink linked service - Azure SQL database with pre-created table
My CSV files are quite simple as they contain only 2 columns (PARENT, CHILD). So, my table in SQL DB also have only 2 columns.
For the sink setting of my data flow, I have allowed insert data and leaving other options as default.
I have also mapped the 2 columns for input and output columns as per screenshot.
The pipeline with data flow ran successfully when I checked the result, I could see thqat 5732 rows were processed. Is this the correct way to check? As this is the first time I try this functionality in Azure Data Factory.
But, when I click on Data preview tab, they are all NULL value.
And; when I checked my Azure SQL DB in the table where I tried to insert the data from CSV files from Blob storage with selecting top 1000 rows from this table, I don't see any data.
Could you please let me know what I configured incorrectly on my Data Flow? Thank you very much in advance.
Here is the screenshot of ADF data flow source data, it does see the data on the right side as they are not NULL, but on the left side are all NULLs. I imagine that the right side are the data from the CSV from the source on the blob right? And the left side is the sink destination as the table is empty for now?
And here is the screenshot for the sink inspect input, I think this is correct as it reads the 2 columns correctly (Parent, Child), is it?
After adding Map drifted, to map "Parent" => "parent" and "Child" => "child"
I get this error message after running the pipeline.
When checking on sink data preview, I get this error message. It seems like there is incorrect mapping?
I rename the MapDrifted1 expression to "toString(byName('Parent1))" and Child1 as suggested.
The data flow executed successfully, however I still get NULL data in the sink SQL table.
Can you copy/paste the script behind your data flow design graph? Go to the ADF UI, open the data flow, then click the Script button on top right.
In your Source transformation, click on Data Preview to see the data. Make sure you are seeing your data, not NULLs. Also, look at the Inspect on the INPUT for your Sink, to see if ADF is reading additional columns.