Apache Flink: Reading parquet files from S3 in Data Stream APIs - amazon-s3

We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.
Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?

The newer FileSource class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat() in particular, for reading Parquet files.
You use the FileSourceBuilder returned by the above method call, and then .monitorContinuously(Duration.ofHours(1)); (or whatever interval makes sense).

Related

Read S3 file based on the path that comes in Kafka - Apache Flink

I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

Convert Avro in Kafka to Parquet directly into S3

I have topics in Kafka that are stored in Avro format. I would like to consume the entire topic (which at time of receipt will not change any messages) and convert it into Parquet, saving directly on S3.
I currently do this but it requires me consuming the messages from Kafka one a time and processing on a local machine, converting them to a parquet file, and once the entire topic is consumed and the parquet file completely written, close the writing process and then initiate an S3 multipart file upload. Or | Avro in Kafka -> convert to parquet on local -> copy file to S3 | for short.
What I'd like to do instead is | Avro in Kafka -> parquet in S3 |
One of the caveats is that the Kafka topic name isn't static, and needs to be fed in an argument, used once, and then never used again.
I've looked into Alpakka and it seems like it might be possible - but it's unclear, I haven't seen any examples. Any suggestions?
You just described Kafka Connect :)
Kafka Connect is part of Apache Kafka, and with the S3 connector plugin. Although, at the moment the development of Parquet support is still in progress.
For a primer in Kafka Connect see http://rmoff.dev/ksldn19-kafka-connect
Try to add "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat" in your PUT request when you set up your connector.
You can find more details here.

Spark to process many tar.gz files from s3

I have many files in the format log-.tar.gz in s3. I would like to process them, process them (extract a field from each line) and store it in a new file.
There are many ways we can do this. One simple and convenient method is to access the files using textFile method.
//Read file from s3
rdd = sc.textFile("s3://bucket/project_name/date_folder/logfile1.*.gz")
I am concerned about the memory limit of the cluster. This way, the master node will be overloaded. Is there any rough estimate for the size of the files that can be processed by the type of clusters?
I am wondering if there is a way to parallelize the process of getting the *.gz files from s3 as they are already grouped by date.
With an exception of parallelize / makeRDD all methods creating RDDs / DataFrames require data to be accessible from all workers and are executed in parallel without loading on a driver.

How map-reduce works on HDFS vs S3?

I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:
Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.
Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).
Please share your thoughts.
Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).
Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.
Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.
If you have a Hadoop cluster in EC2 and you run a MapReduce job over S3 data, yes the data will be streamed into the cluster in order to run the job. As you say, S3 is just a data store, so you can not bring the computation to the data. These non-local reads could cause a bottleneck on processing large jobs, depending on the size of the data and the size of the cluster.