Flink Streaming application writing to s3 in parquet format - amazon-s3

I am developing a Flink streaming application which consumes messages from kafaka/kinesis and after processing them it has to writes the output to S3 in parquet format every let say 5 minutes.
Kindly suggest an approach to achieve this as i am facing lots of issues.
Currently I am using Flink 1.4.2 as i am thinking to deploy it on AWS EMR 5.15 cluster.
Approach already tried - I have used module "flink-s3-fs-hadoop" and parquet api and flink "BucketingSink"

Related

Do I need an S3 bucket for Python ETL scripts that run as AWS Batch jobs for a Splunk Forwarder?

I am trying to deploy (in CDK) scheduled Python ETL scripts as Batch jobs (Fargate?) to parse data from AWS and other tools we utilize. A Spunk Forwarder consumes this data and sends it to our Splunk Index. Am I going to need an S3 bucket for the output of logs from my ETL scripts? How can I deploy the Splunk Forwarder alongside these scripts?
There are about 5-6 scripts that I would like to deploy via CDK.
AWS Batch jobs can send STDERR and STDOUT to CloudWatch Logs. Depends on how logging is configured in your Python scripts, that may be the easy answer. If logging is configured to write to a file, then yes I would recommend you upload the file to S3 after the ETL is finished.
Output from the scripts (the ETL results) will need to land someplace, and S3 is a great choice for that. Your Splunk Forwarder can be set up to monitor the bucket for new data and ingest. If the scripts directly send data to the forwarder you should not need a S3 bucket, but I personally would recommend that you decouple the ETL data from ingestion of the result into Splunk.
Splunk Forwarders (stable servers) would be deployed separate from AWS Batch resources.

Send data from kafka to s3 using python

For my current project, I am working with Kafka (python) and wanted to know if there is any method by which I can send the streaming Kafka data to the AWS S3 bucket(without using Confluent). I am getting my source data from Reddit API.
I even wanted to know whether Kafka+s3 is a good combination for storing the data which will be processed using pyspark or I should skip the s3 step and directly read data from Kafka.
Kafka S3 Connector doesn't require "using Confluent". It's completely free, open source and works with any Apache Kafka cluster.
Otherwise, sure, Spark or plain Kafka Python consumer can write events to S3, but you've not clearly explained what happens when data is in S3, so maybe start with processing the data directly from Kafka

How to run the map reduce jobs on EMRserverless?

Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?
That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.

Scheduling over different AWS Components - Glue and EMR

I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.

Delta Table transactional guarantees when loading using Autoloader from AWS S3 to Azure Datalake

Trying to use autoloader where AWS S3 is source and Delta lake is in Azure Datalake Gen. When I am trying to read files it gives me following error
Writing to Delta table on AWS from non-AWS is unsafe in terms of providing transactional guarantees. If you can guarantee that no one else will be concurrently modifying the same Delta table, you may turn this check off by setting the SparkConf: "spark.databricks.delta.logStore.crossCloud.fatal" to false when launching your cluster.
Tried setting up settings at cluster level and it works fine. My question is, is there any way we can ensure transactional guarantee wile loading data from AWS3 to Azure Datalake (Datalake is backend storage for our Delta Lake). We don't want to set "spark.databricks.delta.logStore.crossCloud.fatal" at Cluster level. Will there be any issue if we do and will it be a good solution for a production ETL pipeline?
This warning appears when Databricks detects that you're doing the multicloud work.
But this warning is for case when you're writing into AWS S3 using Delta, because AWS doesn't have atomic write operation (like, put-if-absent), so it requires some kind of coordinator process that is available only on AWS.
But in your case you can ignore this message because you're just reading from AWS S3, and writing into Delta that is on Azure Datalake.