Performance issue writing to S3 from Spark Structured Streaming application - amazon-s3

Basically I am running a structured streaming job 24 x 7, writing to S3. But came across this issue that _spark_metadata is taking hours to write a single file, no new data ingestion is active during this time.
Any idea how to solve this issue and enable no-downtime ingestion?
19/10/24 00:48:34 INFO ExecutorAllocationManager: Existing executor 40 has been removed (new total is 1)
19/10/24 00:49:03 INFO CheckpointFileManager: Writing atomically to s3a://.../data/_spark_metadata/88429.compact using temp file s3a://.../data/_spark_metadata/.88429.compact.00eb0d4b-ec83-4f8c-9a67-4155918a5f83.tmp
19/10/24 03:32:53 INFO CheckpointFileManager: Renamed temp file s3a://.../data/_spark_metadata/.88429.compact.00eb0d4b-ec83-4f8c-9a67-4155918a5f83.tmp to s3a://brivo-prod-dataplatform-kafka-streaming/data/_spark_metadata/88429.compact
19/10/24 03:32:53 INFO FileStreamSinkLog: Current compact batch id = 88429 min compaction batch id to delete = 88329
19/10/24 03:32:54 INFO ManifestFileCommitProtocol: Committed batch 88429

rename is mimicked on s3 with a copy then a delete and is O(data). Checkpoint more often to create smaller files

Related

AWS GLUE Pyspark job delete S3 folder unexpectly

My glue workflow is DDB -> GLUE table (by using Crawler) -> S3 (by using GLUE job)
I create S3 folder manually before the workflow run.
For DDB table with size at 500~MB it always works fine (runs 7-10min to finish), the s3 path will have correct result: e.g. s3://glue_example/ddb_500MB/ (I know data is correct by checking them in athena after connecting to s3)
For DDB table with size 50GB the folder is deleted by the GLUE JOB (runs 2 hours to finish, no error), e.g. s3://glue_example/ddb_50GB this folder is deleted. (I enabled the log for s3, and in log, GlueJobRunnerSession used DeleteObject on this folder path)
This delete folder behavior is not consistent, it happened most of the time, but if I find the folder is deleted, and I created manually, next run will have correct data in that s3 folder.
The code of GLUE job (Glue 3.0 - Supports spark 3.1, Scala 2, Python 3) is super simple. the only line that write to s3 is: ApplyMapping_node2.toDF().write.mode("overwrite").format("parquet").save('s3://glue_example/ddb_50GB')
concurrency of workflow/job is 1, so it's not competing caused problem
I use overwrite to keep the folder to have only latest data. but I don't know why this keep deleting folder with large size DDB as data source. Any idea?
The issue was due to whole table being read into single partition as it is default behaviour. Increasing dynamodb.splits while reading from DDB table should help as it reads data in parallel into multiple partitions.Below is an example in pySpark.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "test_source",
"dynamodb.throughput.read.percent": "1.0",
"dynamodb.splits": "100"
}
)
Refer to below link for more information:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

Not able to update Big query table with Transfer from a Storage file

I am not able to update a big query table from a storage file. I have latest data file and transfer runs successfully. But it say "8:36:01 AM Detected that no changes will be made to the destination table.".
Tried multiple ways.
Please help.
Thanks,
-Srini
You have to wait 1 hour after your file has been updated in Cloud Storage: https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer?hl=en_US#minimum_intervals
I had the same error. I created two transfers from GCS to BigQuery, with write preference set to MIRROR and APPEND. I got the logs below (no error). The GCS file was uploaded less than one hour before.
MIRROR: Detected that no changes will be made to the destination table. Summary: succeeded 0 jobs, failed 0 jobs.
APPEND: None of the 1 new file(s) found matching "gs://mybucket/myfile" meet the requirement of being at least 60 minutes old. They will be loaded in next run. Summary: succeeded 0 jobs, failed 0 jobs.
Both jobs went through one hour later.

Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.
hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M /user/hive/warehouse/my_table (<-- replication factor 3)
At the table level, I have:
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED',
'transactional'='true');
and I have also enabled:
hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;
It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?
Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage
If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".
Yes, #K.M is correct in so far that Compaction needs to be used.
a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1
b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.
d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)

Hive Table deletion and query processing

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..
If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)

How to avoid reading old files from S3 when appending new data?

Once in 2 hours, spark job is running to convert some tgz files to parquet.
The job appends the new data into an existing parquet in s3:
df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")
In spark-submit output I can see significant time is being spent on reading old parquet files, for example:
16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading
16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key
'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet'
seeking to position '149195444'
It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.
Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?
I use EMR 4.8.2 and DirectParquetOutputCommitter:
sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')
I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3
Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.