Apache Flink Stateful Reading File From S3 - amazon-s3

I have a flink batch job that reads a very large parquet file from S3 then it sinks a json into Kafka topic.
The problem is how can I make the file reading process stateful? I mean whenever the job interrupted or crushed, the job should start from previous reading state? I don't want send duplicate item to Kafka when the job restarted.
Here is my example code
val env = ExecutionEnvironment.getExecutionEnvironment
val input = Parquet.input[User](new Path(s"s3a://path"))
env.createInput(input)
.filter(r => Option(r.token).getOrElse("").nonEmpty)

Related

Get Flink FileSystem filename for secondary sink

This is for Flink on KDA and so limited to version 1.13
The desired flow of the application:
ingest from kinesis/kafka source
sink to s3
post s3 sink get objectkey/filename and publish to kafka (inform other processes that a file is ready to be examined)
The first 2 steps are simple enough. The 3rd is the tricky one.
From my understanding the S3 object key will be made of the following:
bucket partition (lets assume the defaul here of processing date)
filename. The filename is made up of + + + . It then goes through a series of changes - in-progress, pending and final.
The file will be in a finished state when a checkpoint occurs and all pending files are move to finished.
What I would like is this information as a trigger to kafka publish.
On checkpoint give me a list of all the files(object keys) that have moved to a finished state. These can then be put on a kafka topic.
Is this possible?
Thanks in advance

Bigquery Data Transfer from S3 intermittent success

When using bigquery data transfer to move data into BigQuery from S3, I get intermittent success (I've actually only seen it work correctly one time).
The success:
6:00:48 PM Summary: succeeded 1 jobs, failed 0 jobs.
6:00:14 PM Job bqts_5f*** (table test_json_data) completed successfully. Number of records: 516356, with errors: 0.
5:59:13 PM Job bqts_5f*** (table test_json_data) started.
5:59:12 PM Processing files from Amazon S3 matching: "s3://bucket-name/*.json"
5:59:12 PM Moving data from Amazon S3 to Google Cloud complete: Moved 2661 object(s).
5:58:50 PM Starting transfer from Amazon S3 for files with prefix: "s3://bucket-name/"
5:58:49 PM Starting transfer from Amazon S3 for files modified before 2020-07-27T16:48:49-07:00 (exclusive).
5:58:49 PM Transfer load date: 20200727
5:58:48 PM Dispatched run to data source with id 138***3616
The usual instance those is just 0 success, 0 failures, like the following:
8:33:13 PM Summary: succeeded 0 jobs, failed 0 jobs.
8:32:38 PM Processing files from Amazon S3 matching: "s3://bucket-name/*.json"
8:32:38 PM Moving data from Amazon S3 to Google Cloud complete: Moved 3468 object(s).
8:32:14 PM Starting transfer from Amazon S3 for files with prefix: "s3://bucket-name/"
8:32:14 PM Starting transfer from Amazon S3 for files modified between 2020-07-27T16:48:49-07:00 and 2020-07-27T19:22:14-07:00 (exclusive).
8:32:13 PM Transfer load date: 20200728
8:32:13 PM Dispatched run to data source with id 13***0415
What might be going on such that the second log above doesn't have the Job bqts... run? Is there somewhere I can get more details about these data transfer jobs? I had a different job that ran into a JSON error, so I don't believe it was that.
Thanks!
I was a bit confused by the logging, since it finds and moves the objects like
I believe I misread the docs, I had thought previously that an amazon URI of s3://bucket-name/*.json would crawl the directory for the json files, but even though the message above seems to indicate such, it only loads files into bigquery that are at the top level (for the s3://bucket-name/*.json URI).

How to receive root cause for Pipeline Dataflow job failure

I am running my pipeline in Dataflow. I want to collect all error messages from Dataflow job using its id. I am using Apache-beam 2.3.0 and Java 8.
DataflowPipelineJob dataflowPipelineJob = ((DataflowPipelineJob) entry.getValue());
String jobId = dataflowPipelineJob.getJobId();
DataflowClient client = DataflowClient.create(options);
Job job = client.getJob(jobId);
Is there any way to receive only error message from pipeline?
Programmatic support for reading Dataflow log messages is not very mature, but there are a couple options:
Since you already have the DataflowPipelineJob instance, you could use the waitUntilFinish() overload which accepts a JobMessagesHandler parameter to filter and capture error messages. You can see how DataflowPipelineJob uses this in its own waitUntilFinish() implementation.
Alternatively, you can query job logs using the Dataflow REST API: projects.jobs.messages/list. The API takes in a minimumImportance parameter which would allow you to query just for errors.
Note that in both cases, there may be error messages which are not fatal and don't directly cause job failure.

BigQuery loads manually but not through the Java SDK

I have a Dataflow pipeline, running locally. The objective is to read a JSON file using TEXTIO, make sessions and load it into BigQuery. Given the structure I have to create a temp directory in GCS and then load it into BigQuery using that. Previously I had a data schema error that prevented me to load the data, see here. That issue is resolved.
So now when I run the pipeline locally it ends with dumping a temporary JSON newline delimited file into GCS. The SDK then gives me the following:
Starting BigQuery load job beam_job_xxxx_00001-1: try 1/3
INFO [main] (BigQueryIO.java:2191) - BigQuery load job failed: beam_job_xxxx_00001-1
...
Exception in thread "main" com.google.cloud.dataflow.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:187)
at pedesys.Dataflow.main(Dataflow.java:148)
Caused by: java.lang.RuntimeException: Failed to create the load job beam_job_xxxx_00001, reached max retries: 3
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.load(BigQueryIO.java:2198)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$WriteTables.processElement(BigQueryIO.java:2146)
The errors are not very descriptive and the data is still not loaded in BigQuery. What is puzzling is that if I go to the BigQuery UI and load the same temporary file from GCS that was dumped by the SDK's Dataflow pipeline manually, in the same table, it works beautifully.
The relevant code parts are as follows:
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BigQueryOptions.class)
.setTempLocation("gs://test/temp");
Pipeline p = Pipeline.create(options)
...
...
session_windowed_items.apply(ParDo.of(new FormatAsTableRowFn()))
.apply(BigQueryIO.Write
.named("loadJob")
.to("myproject:db.table")
.withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
The SDK is swallowing the error/exception and not reporting it to users. It's most likely a schema problem. To get the actual error that is happening you need to fetch the job details by either:
CLI - bq show -j job beam_job_<xxxx>_00001-1
Browser/Web: use "try it" at the bottom of the page here.
#jkff has raised an issue here to improve the error reporting.

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?