read external data in a spark-streaming job - amazon-s3

Suppose a system is generating a batch of data every 1h on hdfs or aws-s3, like this
s3://..../20170901-00/ # generated at 0am
s3://..../20170901-01/ # 1am
...
and I need to pipe these batches of data into kafka once they're generated.
My solution for this is, set up a spark-streaming job and set a moderate job interval (say, half an hour), so at each streaming interval try to read from s3 and if the data is there, then read it and write to kafka.
Is this doable? And I don't know how to read from s3 or hdfs in a spark-streaming job, how?

Related

Kafka S3 sink connector does not commit offset

I have the following case:
My lambda is sanding messages to Kafka Topic, this messages contains fields with different dates
My Kafka Connector has flush.size=1000 and partition messages from topic by: year,month,day fields to the S3 bucket.
The problem is that Kafka Connect does not commit offset on the topic. It reads the same offset all time -> it overwrites S3 object with the same data all time.
When I change flush.size=10, everything works thine.
How can I ocercome this problem to keep flush.size=1000?
Offsets only get committed when S3 file is written. If you're not sending 1000 events for each day partition, then those records will be held in memory. They shouldn't be duplicated/overridden in S3 since the sink connector has exactly once delivery (as documented)
Lowering the flush size is one solution. Or you can add scheduled rotation interval property

How to export Load Balancer log to BigQuery in Real Time?

We are trying to export all the http request to our google load balancer into big query. Unfortunately we notice that data arrives 3 minutes later to BigQuery.
Starting from this tutorial:https://cloud.google.com/solutions/serverless-pixel-tracking
We created a Load Balancer that points to a pixel.png on a public storage
Created a sink to export all log to Pub/Sub
Created DataFlow with streaming insert pub/sub to BigQuery Table with provided template
Table is partitioned on date and has a cluster column on hour and minutes.
After we scale to 1000 request per seconds we noticed that data was delayed by 2 or 3 minutes
SELECT * FROM DATASET ORDER BY Timestamp desc Limit 100
this query will be executed with few seconds but the last result is 3 minutes old
I am exporting log for a lot of different resources into BigQuery directly without using dataflow or pub/sub and I can see them in realtime. If yuo do not need to do some special pre-processing in dataflow, you might want to try to export directly into BigQuery and remove other stuff in between that introduce latency.

Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.
hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M /user/hive/warehouse/my_table (<-- replication factor 3)
At the table level, I have:
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED',
'transactional'='true');
and I have also enabled:
hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;
It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?
Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage
If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".
Yes, #K.M is correct in so far that Compaction needs to be used.
a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1
b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.
d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)

How to avoid reading old files from S3 when appending new data?

Once in 2 hours, spark job is running to convert some tgz files to parquet.
The job appends the new data into an existing parquet in s3:
df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")
In spark-submit output I can see significant time is being spent on reading old parquet files, for example:
16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading
16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key
'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet'
seeking to position '149195444'
It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.
Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?
I use EMR 4.8.2 and DirectParquetOutputCommitter:
sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')
I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3
Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.

Performance issue writing to S3 from Spark Structured Streaming application

Basically I am running a structured streaming job 24 x 7, writing to S3. But came across this issue that _spark_metadata is taking hours to write a single file, no new data ingestion is active during this time.
Any idea how to solve this issue and enable no-downtime ingestion?
19/10/24 00:48:34 INFO ExecutorAllocationManager: Existing executor 40 has been removed (new total is 1)
19/10/24 00:49:03 INFO CheckpointFileManager: Writing atomically to s3a://.../data/_spark_metadata/88429.compact using temp file s3a://.../data/_spark_metadata/.88429.compact.00eb0d4b-ec83-4f8c-9a67-4155918a5f83.tmp
19/10/24 03:32:53 INFO CheckpointFileManager: Renamed temp file s3a://.../data/_spark_metadata/.88429.compact.00eb0d4b-ec83-4f8c-9a67-4155918a5f83.tmp to s3a://brivo-prod-dataplatform-kafka-streaming/data/_spark_metadata/88429.compact
19/10/24 03:32:53 INFO FileStreamSinkLog: Current compact batch id = 88429 min compaction batch id to delete = 88329
19/10/24 03:32:54 INFO ManifestFileCommitProtocol: Committed batch 88429
rename is mimicked on s3 with a copy then a delete and is O(data). Checkpoint more often to create smaller files