Flink Table API streaming S3 sink throws SerializedThrowable exception - amazon-s3

I am trying to write a simple table API S3 streaming sink (csv format) using flink 1.15.1 and I am facing the following exception,
Caused by: org.apache.flink.util.SerializedThrowable: S3RecoverableFsDataOutputStream cannot sync state to S3. Use persist() to create a persistent recoverable intermediate point.
at org.apache.flink.fs.s3.common.utils.RefCountedBufferingFileStream.sync(RefCountedBufferingFileStream.java:111) ~[flink-s3-fs-hadoop-1.15.1.jar:1.15.1]
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.sync(S3RecoverableFsDataOutputStream.java:129) ~[flink-s3-fs-hadoop-1.15.1.jar:1.15.1]
at org.apache.flink.formats.csv.CsvBulkWriter.finish(CsvBulkWriter.java:110) ~[flink-csv-1.15.1.jar:1.15.1]
at org.apache.flink.connector.file.table.FileSystemTableSink$ProjectionBulkFactory$1.finish(FileSystemTableSink.java:642) ~[flink-connector-files-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.BulkPartWriter.closeForCommit(BulkPartWriter.java:64) ~[flink-file-sink-common-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.closePartFile(Bucket.java:263) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.prepareBucketForCheckpointing(Bucket.java:305) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.onReceptionOfCheckpoint(Bucket.java:277) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotActiveBuckets(Buckets.java:270) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotState(Buckets.java:261) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSinkHelper.snapshotState(StreamingFileSinkHelper.java:87) ~[flink-streaming-java-1.15.1.jar:1.15.1]
at org.apache.flink.connector.file.table.stream.AbstractStreamingWriter.snapshotState(AbstractStreamingWriter.java:129) ~[flink-connector-files-1.15.1.jar:1.15.1]
In my config, I am trying to read from Kafka and write to S3 (s3a) using table API and checkpoint configuration using s3p (presto). Even I tried with a simple datagen example instead of Kafka and I am getting the same issue. I think I am following all the exact steps mentioned in the docs and the above exceptions are not much helpful. Exactly it is failing when the code triggers the checkpoint but I don't have any clue after this. Could someone please help me to understand what I am missing here? I don't find any open issue with such logs.

Looks like a bug. I've raised it here (after discussing with the community). Sadly I am not able to find any work around for Table API S3 CSV streaming sink. Similar issue here for DataStreamAPI with a workaround.

Related

Looking for examples on Airflow GCSToS3Operator. Thanks

I am trying to send file from GCS bucket to S3 bucket using Airflow. I came across this article https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2 but looking for specific code implementations and examples which also explains the requirements for this. I am a newbie to Airflow and GCP.
Astronomer is a good place to start with . see the doc for GCSToS3Operator.
You can get dependencies, explanation on each variable and links to examples

read parquet data from s3 bucket using NiFi

guys!
I'm just starting to learn NiFi. Don't throw stones) just help or guide. I need to read parquet data from s3 bucket, I don’t understand how to set up lists3 and fetchs3object processors for reading data.
full path looks like this:
s3://inbox/prod/export/date=2022-01-07/user=100/
2022-01-09 06:51:23 23322557 cro.parquet
I"ll write data to sql database - I don"t have problems with it.
I tried to configure the lists3 processor myself and I think is not very good
bucket inbox
aws_access_key_id
aws_secret_access_key
region US EAST
endpoint override URL http://s3.wi-fi.ru:8080
What I would do is try to test the Access Key ID, and Secret Key outside of NiFi to make sure that they are working. If they are working fine, then it’s an issue with the NiFi configuration. If the keys/id isn’t working, then by getting new values that work and providing them to NiFi, it might have a better shot of working.

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

"o76.getSource EXTERNAL table not supported" Error with AWS Glue custom connector with BigQuery

I was following this step by step to connect data from BQ to AWS Glue and store in S3 , everything works ok until i tried to run the job, where the job keeps failing with:
An error occurred while calling o76.getSource. The type of table {datasetName}.{table_name} is currently not supported: EXTERNAL
I can't seem to find any similar error online, also can't find further helpful info from the log, it seems that it's stuck at the issue with the BQ table, I was following exactly as what the author did here in the blog with the key-value pair to indicate project ID and dataset/table (image refers to blog's author table name).
Does anybody know what's causing this?

Read messages from SQS into Dataflow

I've got a bunch of data being generated in AWS S3, with PUT notifications being sent to SQS whenever a new file arrives in S3. I'd like to load the contents of these files into BigQuery, so I'm working on setting up a simple ETL in Google Dataflow. However, I can't figure out how to integrate Dataflow with any service that it doesn't already support out of the box (Pubsub, Google Cloud Storage, etc.).
The GDF docs say:
In the initial release of Cloud Dataflow, extensibility for Read and Write transforms has not been implemented.
I think I can confirm this, as I tried to write a Read transform and wasn't able to figure out how to make it work (I tried to base an SqsIO class on the provided PubsubIO class).
So I've been looking at writing a custom source for Dataflow, but can't wrap my head around how to adapt a Source to polling SQS for changes. It doesn't really seem like the right abstraction anyway, but I wouldn't really care if I could get it working.
Additionally, it looks like I'd have to do some work to download the S3 files (I tried creating a Reader for that as well with no luck b/c of the above mentioned reason).
Basically, I'm stuck. Any suggestions for integrating SQS and S3 with Dataflow would be very appreciated.
The Dataflow Java SDK now includes an API for defining custom unbounded sources:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/UnboundedSource.java
This can be used to implement a custom SQS Source.