Kafka Connect S3 sink - how to utilize no partitioning in s3-sink.properties - amazon-s3

For legitimate reasons I am looking to dump all Kafka messages into a AWS S3 bucket without partitioning. I have a requirement to use JSON.
My initial thought was to use the FieldPartitioner and designate a key/value that is consistent across all topic messages. This would result in all messages ending in the same partition. However, since FieldPartitioner only works with Avro and not JSON, this is not possible.
My current s3-sink.properties replicates Kafka's partitioning stucture:
name=s3-sink
connector.class=io.confluent.connect.s3.S3SinkConnector
tasks.max=50
topics=topic_a,topic_b,topic_c
s3.region=us-east-2
s3.bucket.name=datapipeline
s3.part.size=5242880
flush.size=1
storage.class=io.confluent.connect.s3.storage.S3Storage
format.class=io.confluent.connect.s3.format.json.JsonFormat
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=NONE

Related

Kafka S3 sink connector does not commit offset

I have the following case:
My lambda is sanding messages to Kafka Topic, this messages contains fields with different dates
My Kafka Connector has flush.size=1000 and partition messages from topic by: year,month,day fields to the S3 bucket.
The problem is that Kafka Connect does not commit offset on the topic. It reads the same offset all time -> it overwrites S3 object with the same data all time.
When I change flush.size=10, everything works thine.
How can I ocercome this problem to keep flush.size=1000?
Offsets only get committed when S3 file is written. If you're not sending 1000 events for each day partition, then those records will be held in memory. They shouldn't be duplicated/overridden in S3 since the sink connector has exactly once delivery (as documented)
Lowering the flush size is one solution. Or you can add scheduled rotation interval property

DynamoDB data to S3 in Kinesis Firehose output format

Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension
I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.
Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.
However, the problem is that Glue creates Hive-style partitions which look like :
s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead.
I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.
Leaving the solution I ended up implementing here in case it helps anyone.
I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.
The Lambda used Copy and delete function from boto3 s3Client to implement the same.
It worked like a charm even though I had like > 10^6 output files split across different partitions.

Kinesis Firehose to S3 in Parquet format (and Snappy compression)

I have set up AWS DMS and taken data from my SQLServer RDS into Kinesis Streams and then into Kinesis Firehose and now want to write the data into S3 in Parquet format compressed (Snappy).
On the Console for Firehose it says I need to specify a schema for the source records. Is this necessary?...if so how do I take the source data definition and create a Glue Catalog to use at this point? Also, if I add to the table columns with additional optional columns that Kinesis Streams can add (eg the time of the transaction), will these need defining in the catalog also?
On the Console for Firehose it says I need to specify a schema for the source records. Is this necessary?
Yes. Parquet is columnar in nature. Kinesis needs to know the schema to convert to Parquet format.
if I add to the table columns with additional optional columns that Kinesis Streams can add (eg the time of the transaction), will these need defining in the catalog also?
Yes. All columns you send that are not defined in schema will be ignored and lost. They need to be defined in the schema

Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.
hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M /user/hive/warehouse/my_table (<-- replication factor 3)
At the table level, I have:
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED',
'transactional'='true');
and I have also enabled:
hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;
It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?
Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage
If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".
Yes, #K.M is correct in so far that Compaction needs to be used.
a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1
b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.
d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)

"not a Parquet file (too small)" from Presto during Spark structured streaming run

I have a pipeline set up that reads data from Kafka, processes it using Spark structured streaming and then writes parquet files to HDFS. Downstream clients of the data query is using Presto configured to read the data as Hive tables.
Kafka --> Spark --> Parquet on HDFS --> Presto
In general this works. The problem arises when a query happens while the Spark job is running a batch. The Spark job creates a zero-length Parquet file on HDFS. If Presto attempts to open this file in the course of processing a query, then it throws an error:
Query 20171116_170937_07282_489cc failed: Error opening Hive split hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet (offset=0, length=0): hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet is not a Parquet file (too small)
The file is indeed zero bytes at this time, so the error is strictly correct, but this is not the behavior I want for the pipeline. I would like to be able to continuously write in to the appropriate HDFS folders, without disturbing the Presto queries.
The Spark scala code for the job looks like this:
val FilesOnDisk = 1
Spark
.initKafkaStream("fleet_profile_test")
.filter(_.name.contains(job.kafkaTag))
.flatMap(job.parser)
.coalesce(FilesOnDisk)
.writeStream
.trigger(ProcessingTime("1 hours"))
.outputMode("append")
.queryName(job.queryName)
.format("parquet")
.option("path", job.outputFilesPath)
.start()
The job starts at the top of the hour, :00. The file is first visible on HDFS as a zero-length file at :05. It is not updated until it is written completely at :21, just before the job finishes. This makes the table effectively unusable from Presto 25% of the time.
Each file is only a little over 500kB, so I wouldn't expect the physical writing of the file to take very long. From my understanding, Parquet files have their metadata at the end of the file so someone writing bigger files would have even more trouble.
What strategies have people used to integrate Spark structured streaming and Presto while working around this Presto error?
You could try to persuade Presto (or Presto team) to ignore empty files, but that wouldn't help, as the program writing the file (here: Spark) will eventually flush partial data and the file would appear partial, non-empty and not well formed, thus leading to an error as well.
The approach preventing Presto (or other programs reading the table data for that matter) from seeing partial file would be to assembler the file in different location and then atomically move the file into the correct location.