I am a newbie in Flink and I am trying to write a simple streaming job with exactly-once semantics that listens from Kafka and writes the data to S3. When I say "Exact once", I mean I don't want to end up to have duplicates, on intermediate failure between writing to S3 and commit the file sink operator. I am using Kafka of version v2.5.0, according to the connector described in this page, I am guessing my use case will end up to have exact once behavior.
Questions:
1) Whether my assumption is correct that my use case will endup to have exact once even though there is any failure occurring in any part of the steps so that I can say my S3 files won't have duplicate records?
2) How Flink handle this exact once with S3? In the documentation it says, it uses multipart upload to get exact once semantics, but my question is, how it is handled internally to achieve exact once semantics? Let's say, the task failed once the S3 multipart get succeeded and before the operator commit process, in this case, once the operator gets restarts will it stream the data again to S3 which was written to S3 already, so will it be a duplicate?
If you read from kafka and then write to S3 with the StreamingDataSink you should indeed be able to get exactly once.
Though it is not specifically about S3, this article gives a nice explanation on how to ensure exactly once in general.
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
My key takeaway: After a failure we must always be able to see where we stand from the perspective of the sink.
Related
We have records in the s3 bucket (which get updated daily via a job). And We need to listen to a Kafka stream/topic and when a new event arrives in this Kafka stream, we need to update that particular record in s3.
Is this possible?
To my understanding, we need to take the data dump of s3 (via scala code or something) and write to it. IMO, this is not a practical way.
Is there an efficient way to do it?
Unclear if you meant S3 event or Kafka event.
For S3 write events, you can use a Lambda job to notify some code to run, and Kafka does not need involved. Could be any language - Python or NodeJS seem to be most popular for lambdas.
For Kafka records, your consumer would see the event, then use a S3 client API to do whatever it needs with that data, such as write/update a file. Again, any language can be used that supports Kafka protocol, not only Scala. After which, a Lambda could also read that S3 event.
I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.
How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.
I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.
I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').
I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.
I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.
I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.
Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.
Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?
cheers and thanks in advance
It sounds like you're describing the behaviour of Kinesis Firehose, which is to concatenate multiple incoming records according to some buffering (time and size) settings, and then write the records to S3 as a single object. Firehose Data Delivery
The batching of multiple records into a single file is important if the workload will contain a large number of records, as performance (and S3 costs) for processing many small files from S3 can be less than optimal.
AWS Glue Crawlers and ETL jobs do support processing of 'JSON line' (newline delimited JSON) format.
If the crawler is failing to run please include the logs or error details (and if possible the crawler run duration, and number of tables created and updated.
I have seen a crawler fail in an instance where differences in the files being crawled forced it into a table-per-file mode, and it hit a limit on the number of tables. AWS Glue Limits
I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.
I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.
Can you please paste few lines from the json file the firehose creating? I ran the crawler on json file generated by Kinesis Streams and it was able to parse it successfully.
Did you also try "convert record format" when you create the Firehose job? There you can specify the JSONSerDe or Glue catalog to parse your data.
What solved this for me was to add a newline field '/n' to end of each payload sent to firehose.
msg_pkg = (str(json_response) + '\n').encode('utf-8')
record = {'Data': msg_pkg}
put_firehose('agg2-na-firehose', record
Because apparently the Hive JSON SerDe is the default used to proces json data. After doing this I was able to crawl the json data and read it in Athena as well.
Is there any way of allowing only 1 get operation for an S3 file, then have it deleted automatically?
My current implementation depends on a server for downloading from S3 and providing the file to the user, which works but it ends up being a bottleneck when there are too many users...
I am very interested in allowing only 1 get operation for an S3 file, maybe with some kind of policy?
I did some search but found nothing :f
Nope, there is no way (or at least I haven't heard of any) to have such restriction on Amazon S3 servers. But such function could be implemented on you client side.