Kafka S3 sink connector does not commit offset - amazon-s3

I have the following case:
My lambda is sanding messages to Kafka Topic, this messages contains fields with different dates
My Kafka Connector has flush.size=1000 and partition messages from topic by: year,month,day fields to the S3 bucket.
The problem is that Kafka Connect does not commit offset on the topic. It reads the same offset all time -> it overwrites S3 object with the same data all time.
When I change flush.size=10, everything works thine.
How can I ocercome this problem to keep flush.size=1000?

Offsets only get committed when S3 file is written. If you're not sending 1000 events for each day partition, then those records will be held in memory. They shouldn't be duplicated/overridden in S3 since the sink connector has exactly once delivery (as documented)
Lowering the flush size is one solution. Or you can add scheduled rotation interval property

Related

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists

I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02 and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region.
Any other insights?
It could be for some reason that you get this error.
When you load data from Cloud Storage into a BigQuery table, the
dataset that contains the table must be in the same regional or
multi- regional location as the Cloud Storage bucket.
Due to consistency, for buckets, while metadata updates are strongly
consistent for read-after-metadata-update operations, the process
could take time to finish the changes.
Using a Multi-region bucket is not recommended.
In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.
Unfortunately, there is not a simple way, to solve this problem, but here are some options:
You can add a pub/sub routine to the bucket and/or file and quick off
your query after the service finished updating the files.
Make a workflow that blocks the updating of the files in their
buckets until their query finishes.
If the query fails with “not found” for file ABCD and you have
verified ABCD exists in GCS, then retry the query X times.
You need to backup your data into another location where you won't
update these files constantly, just once a day.
You could move the data into a managed storage where you won't have
this problem because you can do snapshotting.

BigQuery streaming insert from Dataflow - no results

I have a Dataflow pipeline which is reading messages from PubSub Lite and streams data into a BigQuery table. The table is partitioned by day. When querying the table with:
SELECT * FROM `my-project.my-dataset.my-table` WHERE DATE(timestamp) = "2021-10-14"
The BigQuery UI tells me This query will process 1.9 GB when run. But when actually running the query I don't get any results. My pipeline is running for a whole week now and I am getting the same results for the last two days. However, for 2021-10-11 and the days before that I am seeing actual results.
I am currently using Apache Beam version 2.26 and my Dataflow writer looks like this:
return BigQueryIO.<Event>write()
.withSchema(createTableSchema())
.withFormatFunction(event -> createTableRow(event))
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTimePartitioning(new TimePartitioning().setType("DAY").setField("timestamp"))
.to(TABLE);
Why is BigQuery taking so long for committing the values to the partitions but at the same time telling me there is actually data available?
EDIT 1:
BigQuery is processing data and not returning any rows because its processing also the data in your streaming buffer. Data on buffer is can take up to 90 min to be committed in the partitioned tables.
Check more details in this stack and also in the documentation available here.
When streaming to a partitioned table, data in the
streaming buffer has a NULL value for the _PARTITIONTIME pseudo column.
If you are having problems to write the data from pubsub in BigQuery, I recommend you to use an template avaiable in dataflow.
Use an Dataflow template avaiable in GCP to write the data from PubSub to BigQuery:
There is an tempate to write data from a pubsub topic to bigquery and it already takes care of the possible corner cases.
I tested it as following and works perfectly:
Create a subscription in you PubSub topic;
Create bucket for temporary storage;
Create the job as following:
For testing, I just sent a message to the topic in json format and the new data was added in the output table:
gcloud pubsub topics publish test-topic --message='{"field_dt": "2021-10-15T00:00:00","field_ts": "2021-10-15 00:00:00 UTC","item": "9999"}'
If you want something more complex, you can fork from the templates code from github and adjust it for your need.

Kafka Connect S3 sink - how to utilize no partitioning in s3-sink.properties

For legitimate reasons I am looking to dump all Kafka messages into a AWS S3 bucket without partitioning. I have a requirement to use JSON.
My initial thought was to use the FieldPartitioner and designate a key/value that is consistent across all topic messages. This would result in all messages ending in the same partition. However, since FieldPartitioner only works with Avro and not JSON, this is not possible.
My current s3-sink.properties replicates Kafka's partitioning stucture:
name=s3-sink
connector.class=io.confluent.connect.s3.S3SinkConnector
tasks.max=50
topics=topic_a,topic_b,topic_c
s3.region=us-east-2
s3.bucket.name=datapipeline
s3.part.size=5242880
flush.size=1
storage.class=io.confluent.connect.s3.storage.S3Storage
format.class=io.confluent.connect.s3.format.json.JsonFormat
schema.generator.class=io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=NONE

Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.
hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M /user/hive/warehouse/my_table (<-- replication factor 3)
At the table level, I have:
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED',
'transactional'='true');
and I have also enabled:
hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;
It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?
Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage
If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".
Yes, #K.M is correct in so far that Compaction needs to be used.
a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1
b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.
d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)

read external data in a spark-streaming job

Suppose a system is generating a batch of data every 1h on hdfs or aws-s3, like this
s3://..../20170901-00/ # generated at 0am
s3://..../20170901-01/ # 1am
...
and I need to pipe these batches of data into kafka once they're generated.
My solution for this is, set up a spark-streaming job and set a moderate job interval (say, half an hour), so at each streaming interval try to read from s3 and if the data is there, then read it and write to kafka.
Is this doable? And I don't know how to read from s3 or hdfs in a spark-streaming job, how?