Stream data into Silver Delta Table - apache-spark-sql

I have one delta table name "ali" . I have read stream from that delta table :
from pyspark.sql.functions import col
streamDF = spark.readStream.format('delta').load('dbfs:/user/hive/warehouse/pdf/ali')
display(streamDF)
Now I want write my data stream into my silver delta table :
(streamDF
.writeStream
.format("delta")
.option("checkpointLocation" ,"dbfs:/user/hive/warehouse/pdf/ali") #allows us to pick up where we left off if we lose connectivity
.outputMode("append") # appends data to our table
.option("mergeSchema","true")
.start('dbfs:/tmp/hive/ali4') )
Spark job continuously in running stage :
What should I do ?
Thanks

These are streaming jobs and will be in running mode till you manually cancel it. It will stream the data from source delta (Say bronze) and it will write to destination delta table (Say Silver).
You will see a graph showing whats the input rate and whats the output rate.
This page has more info around streaming UI of spark services - https://www.databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html
If you want to validate, you can perform below.
-> Open a new notebook instead of adding a new cell below your write stream code and add a new cell like this - spark.read.format("delta").load("silver location")
-> ingest some sample data to your source delta table (In your case it would be bronze delta location)
-> Run the cell you created in first step where you are reading the data from silver and you should see the newly ingested data flowing into silver(destination) from bronze location(Source).

Related

Spark streaming writing as delta and checkpoint location

I am trying to stream from a delta table as a source and then also writing as delta after performing some transformations. so, this all worked. I recently looked at some videos and posts about best practices and found that I needed to do an additional thing and a modification.
The addition was adding queryName
Changing the checkpoint location, so that it resides alongside the data and not in a separate directory , like I was doing.
So, I have one question and a problem
Question is- can I add the queryName now, after my stream has been running for sometime , without any consequences?
and the problem, is: Now, that I have put my checkpoint location as the same directory as my delta table would be , I can't seem to create an external hive table anymore , it seems. It fails with
pyspark.sql.utils.AnalysisException: Cannot create table ('`spark_catalog`.`schemaname`.`tablename`'). The associated location ('abfss://refined#datalake.dfs.core.windows.net/curated/schemaname/tablename') is not empty but it's not a Delta table
So, this was my original code, which worked
def upsert(microbatchdf, batchId):
.....some transformations on microbatchdf
..........................
..........................
# Create Delta table beforehand as otherwise generated columns can't be created
# after having written the data into the data lake with the usual partionBy
deltaTable = (
DeltaTable.createIfNotExists(spark)
.tableName(f"{target_schema_name}.{target_table_name}")
.addColumns(microbatchdf_deduplicated.schema)
.addColumn(
"trade_date_year",
"INT",
generatedAlwaysAs="Year(trade_date) ",
)
.addColumn(
"trade_date_month",
"INT",
generatedAlwaysAs="MONTH(trade_date)",
)
.addColumn("trade_date_day", "INT", generatedAlwaysAs="DAY(trade_date)")
.partitionedBy("trade_date_year", "trade_date_month", "trade_date_day")
.location(
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}"
)
.execute()
)
.....some transformations and writing to the delta table
#end
#this is how the stream is run
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/curated/checkpoints/",
)
.start()
)
streamjob.awaitTermination()
Now, to this working piece , I only tried adding the queryName and modifying the checkpoint location (see comment for the modification and addition)
streamjob = (
spark.readStream.format("delta")
.table(f"{source_schema_name}.{source_table_name}")
.writeStream.format("delta")
.queryName(f"{source_schema_name}.{source_table_name}") # this added
.outputMode("append")
.foreachBatch(upsert)
.trigger(availableNow=True)
.option(
"checkpointLocation",
f"abfss://{target_table_location_filesystem}#{datalakename}.dfs.core.windows.net/{target_table_location_directory}/_checkpoint", # this changed
)
.start()
)
streamjob.awaitTermination()
In my datalake the _checkpoint did get created and apparently for this folder, the external table creation complains of non empty folder, whereas the documentation here, mentions that
So, why is the external hive table creation fails then? Also, please note my question about the queryName addition to an already running stream.
Point to note is- I have tried dropping the external table and also removed the contents of that directory, so there is nothing in that directory except the _checkpoint folder Which got created when I ran the streaming job , just before it got to creating the table inside the upsert method.
Any questions and I can help clarify.
The problem is that checkpoint files are put before you call the ``DeltaTable.createIfNotExists` function that checks if you have any data in that location or not, and fails because additional files are there, but they don't belong to the Delta Lake table.
If you want to keep checkpoint with your data, you need to put DeltaTable.createIfNotExists(spark)... outside of the upsert function - in this case, table will be created before any checkpoint files are created.

Stream analytics time agnostic processing

my stream analytics jobs is not generating any output and watermark delay matrix keeps on increasing. for all other metrices like Input events, output events, cpu%(increases to 10%), runtime error, data conversion error there is no change (value is zero).
Input stream: 2 gb file in ADLS gen2. with file having multiple JSON objects (not JSON array)
Sample data:
{"a":"1"} //row 1
{"a":"2"} // row 2
{"a":"1"} // row 3
The data doesn't have any time field to indicate event time.
output stream: Cosmos db
late event set to 5 sec
out of order event set to 25 sec.
Expected output: i want job to write to output every X mins.
current output: job is not writing any data to output and all metric values are zero
Thanks

How to set up job dependencies in google bigquery?

I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,
Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function
It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete
You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.
Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics

Cloud Storage to BigQuery (upsert) via DataFlow

Whenever a file is written to Cloud Storage, I want it to trigger a Cloud Function that executes a DataFlow template to transform the file content and write the results to BigQuery.
I think I got a handle that much for the most part. But the problem is that I don't need to just insert into a BQ table, I need to upsert (using the Merge operation). This seems like it would be a common requirement, but the Apache Beam BQ connector doesn't offer this option (only write, create and truncate/write).
So then I thought... OK, if I can just capture when the DataFlow pipeline is done executing, I could have DataFlow write to a temporary table and then I could call a SQL Merge query to merge data from the temp table to the target table. However, I'm not seeing any way to trigger a cloud function upon pipeline execution completion.
Any suggestions on how to accomplish the end goal?
Thanks
There is no native built in solution to generate an event at the end of Dataflow job. However, you can cheat thanks to the logs.
For this:
Go to logs, select advanced filter (arrow on the right of the filter bar) and paste this custom filter:
resource.type="dataflow_step" textPayload="Worker pool stopped."
You should see only your end of dataflow. Then, you have to create a sink into PubSub of this result. Then, you have to plug your function on these PubSub messages and you can do what you want.
For this, after having filling up your custom filter
Click on create sink
Set a sink name
Set the destination to PubSub
Select your topic
Now, plug a function on this topic, it will be trigger only at the end of dataflow.
I have implemented the exact use case, but instead of using 2 different pipeline, you can just create 1 pipeline.
Step 1: Read file from gcs and convert it into TableRow.
Step 2: Read the entire row from BigQuery.
Step 3: Create 1 pardo where you have your custom upsert operation like below code.
PCollection<KV<String,TableRow>> val = p.apply(BigQueryIO.readTableRows().from(""));
PCollection<KV<String,TableRow>> val1 = p.apply(TextIO.read().from("")).apply(Convert to TableRow()));
Step 4: Perform CoGroupByKey and perform pardo on top of that result to get the updated one(equivalent to MERGE OPERATION).
Step 5: Insert the complete TableRow to BQ using WRITE_TRUNCATE mode.
Here the code part would be little bit complicate, but that would perform better using single pipeline.
Interesting question, some good ideas already but I'd like to show another possibility with just Dataflow and BigQuery. If this is a non-templated Batch job we can use PipelineResult.waitUntilFinish() which:
Waits until the pipeline finishes and returns the final status.
Then we check if State is DONE and proceed with the MERGE statement if needed:
PipelineResult res = p.run();
res.waitUntilFinish();
if (res.getState() == PipelineResult.State.DONE) {
LOG.info("Dataflow job is finished. Merging results...");
MergeResults();
LOG.info("All done :)");
}
In order to test this we can create a BigQuery table (upsert.full) which will contain the final results and be updated each run:
bq mk upsert
bq mk -t upsert.full name:STRING,total:INT64
bq query --use_legacy_sql=false "INSERT upsert.full (name, total) VALUES('tv', 10), ('laptop', 20)"
at the start we'll populate it with a total of 10 TVs. But now let's imagine that we sell 5 extra TVs and, in our Dataflow job, we'll write a single row to a temporary table (upsert.temp) with the new corrected value (15):
p
.apply("Create Data", Create.of("Start"))
.apply("Write", BigQueryIO
.<String>write()
.to(output)
.withFormatFunction(
(String dummy) ->
new TableRow().set("name", "tv").set("total", 15))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withSchema(schema));
So now we want to update the original table with the following query (DML syntax):
MERGE upsert.full F
USING upsert.temp T
ON T.name = F.name
WHEN MATCHED THEN
UPDATE SET total = T.total
WHEN NOT MATCHED THEN
INSERT(name, total)
VALUES(name, total)
Therefore, we can use BigQuery's Java Client Library in MergeResults:
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
QueryJobConfiguration queryConfig =
QueryJobConfiguration.newBuilder(
"MERGE upsert.full F "
+ ...
+ "VALUES(name, total)")
.setUseLegacySql(false)
.build();
JobId jobId = JobId.of(UUID.randomUUID().toString());
Job queryJob = bigquery.create(JobInfo.newBuilder(queryConfig).setJobId(jobId).build());
This is based on this snippet which includes some basic error handling. Note that you'll need to add this to your pom.xml or equivalent:
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquery</artifactId>
<version>1.82.0</version>
</dependency>
and it works for me:
INFO: 2020-02-08T11:38:56.292Z: Worker pool stopped.
Feb 08, 2020 12:39:04 PM org.apache.beam.runners.dataflow.DataflowPipelineJob logTerminalState
INFO: Job 2020-02-08_REDACTED finished with status DONE.
Feb 08, 2020 12:39:04 PM org.apache.beam.examples.BigQueryUpsert main
INFO: Dataflow job is finished. Merging results...
Feb 08, 2020 12:39:09 PM org.apache.beam.examples.BigQueryUpsert main
INFO: All done :)
$ bq query --use_legacy_sql=false "SELECT name,total FROM upsert.full LIMIT 10"
+--------+-------+
| name | total |
+--------+-------+
| tv | 15 |
| laptop | 20 |
+--------+-------+
Tested with the 2.17.0 Java SDK and both the Direct and Dataflow runners.
Full example here

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.