pubsub message ingestion into Bigquery using data fusion - google-bigquery

I've built a simple realtime pipeline to receive messages and attributes from pubsub subscription and wrangle them to keep only a few fields and load it to a BigQuery table. when deployed and run, the pipeline log says Importing into table '<tablename>' from 0 paths; path[0] is '(empty)'; awaitCompletion: true
I'm unable to understand why 0 paths and why all the records are going to errors when an error collector was setup. Is there a way to debug the wrangler stage better?
sample wrangler directives as below:
keep message,attributes
set-charset :message 'utf-8'
set-type :attributes string
parse-as-json :attributes 1
parse-as-json :message 5
keep attributes_page_url,attributes_cart_remove,attributes_page_title,attributes_transaction_complete,message_event_id,message_data_dom_domain,message_data_dom_title,message_data_dom_pathname,message_data_udo_ut_visitor_id
columns-replace s/^attributes_//g
columns-replace s/^message_//g
Any help is appreciated.Thanks

The reason you see that there are 0 paths to load is that all records are causing errors during wrangling.
There are 2 ways in which you can capture these errors:
Configure the Wrangler stage to Fail Pipeline on error. This will show the exception/error in the logs.
Attach the Error output from the Wrangler stage to an Error Collector, and store the output in a File or GCS Sink. This allows you to capture the error message for each row. Configure the Error Collector as follows:
Error Message Column Name = errorMsg
Error Code Column Name = errorCode
Error Emitter Node Name = invalidRecord

Related

Load from GCS to GBQ causes an internal BigQuery error

My application creates thousands of "load jobs" daily to load data from Google Cloud Storage URIs to BigQuery and only a few cases causing the error:
"Finished with errors. Detail: An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 7916072"
The application is written on Python and uses libraries:
google-cloud-storage==1.42.0
google-cloud-bigquery==2.24.1
google-api-python-client==2.37.0
Load job is done by calling
load_job = self._client.load_table_from_uri(
source_uris=source_uri,
destination=destination,
job_config=job_config,
)
this method has a default param:
retry: retries.Retry = DEFAULT_RETRY,
so the job should automatically retry on such errors.
Id of specific job that finished with error:
"load_job_id": "6005ab89-9edf-4767-aaf1-6383af5e04b6"
"load_job_location": "US"
after getting the error the application recreates the job, but it doesn't help.
Subsequent failed job ids:
5f43a466-14aa-48cc-a103-0cfb4e0188a2
43dc3943-4caa-4352-aa40-190a2f97d48d
43084fcd-9642-4516-8718-29b844e226b1
f25ba358-7b9d-455b-b5e5-9a498ab204f7
...
As mentioned in the error message, Wait according to the back-off requirements described in the BigQuery Service Level Agreement, then try the operation again.
If the error continues to occur, if you have a support plan please create a new GCP support case. Otherwise, you can open a new issue on the issue tracker describing your issue. You can also try to reduce the frequency of this error by using Reservations.
For more information about the error messages you can refer to this document.

Azure data factory How to catch any error on any activity and log it into database?

I am currently working on error handling in ADF.
I know how to get error from a particular activity by using # activity('').error.message and then pass it to database.
Do you guys know a generic way that able us to catch any error on any activity on a pipeline?
Best regards!
You can leverage the flow path dependency aspect within Azure data factory to manage logging of error based on single activity rather than duplicating same activities :
The below blog : https://datasharkx.wordpress.com/2021/08/19/error-logging-and-the-art-of-avoiding-redundant-activities-in-azure-data-factory/ explains all the same in details
Below are the basic principles that need to be followed :
Multiple dependencies with the same source are OR’ed together.
Multiple dependencies with different sources are AND’ed together.
Together this looks like (Act_3fails OR Act_3Skipped) AND (Act_2Completes OR Act_2skipped)
When you have certain number of activities in pipeline1, and you want to capture any error message when one of the activities fail, then using execute pipeline activity is the correct approach.
When you use execute pipeline activity to trigger pipeline1, when any activity inside pipeline1 fails, it throws an error message which includes the name of the activity which failed.
Look at the following demonstration. I have a pipeline named p1 which has 3 activities get metadata, for each and Script.
I am triggering p1 pipeline in p2 pipeline using execute pipeline activity and storing the error message in a variable using set variable activity. For demonstration, I made sure each activity throws an error each time (3 times).
When the get metadata activity fails, the respective error message captured in p2 pipeline will be as follows:
Operation on target Get Metadata1 failed: The required Blob is missing. ContainerName: data2, path: data2/files/.
When only for each activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target ForEach1 failed: Activity failed because an inner activity failed
When only script activity in p1 pipeline fails, the respective error message will be captured in p2 pipeline.
Operation on target Script1 failed: Invalid object name 'mydemotable'
So, using Execute pipeline activity (to execute pipeline1) would help you to capture the error message from any different activities that fail inside the required pipeline.

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

How to receive root cause for Pipeline Dataflow job failure

I am running my pipeline in Dataflow. I want to collect all error messages from Dataflow job using its id. I am using Apache-beam 2.3.0 and Java 8.
DataflowPipelineJob dataflowPipelineJob = ((DataflowPipelineJob) entry.getValue());
String jobId = dataflowPipelineJob.getJobId();
DataflowClient client = DataflowClient.create(options);
Job job = client.getJob(jobId);
Is there any way to receive only error message from pipeline?
Programmatic support for reading Dataflow log messages is not very mature, but there are a couple options:
Since you already have the DataflowPipelineJob instance, you could use the waitUntilFinish() overload which accepts a JobMessagesHandler parameter to filter and capture error messages. You can see how DataflowPipelineJob uses this in its own waitUntilFinish() implementation.
Alternatively, you can query job logs using the Dataflow REST API: projects.jobs.messages/list. The API takes in a minimumImportance parameter which would allow you to query just for errors.
Note that in both cases, there may be error messages which are not fatal and don't directly cause job failure.

BigQuery loads manually but not through the Java SDK

I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.