How to disable the rollups tables in OpsCenter keyspace? - datastax

We are using DSE 4.8.8 with OpsCenter 5.2.4. I don't want to collect 1min and 5min metrics. Here is my cluster config file settings.
[cassandra_metrics]
1min_ttl = -1
5min_ttl = -1
2hr_ttl = 7776000
24hr_ttl = 31536000
I have the same issue with 4.8.15 DSE. According to the documentation, TTL value of -1 should disable the metric collection. I restarted Opscenter. However, I still see the 1min and 5min metrics being collected. I even truncated the tables and clear the snapshots to make sure there is no old records. I check the rollups60 and rollups300 directories and the number of files and size of the files grow over time. How to disable the Cassandra metrics collection?

Related

How to run Airflow S3 sensor exactly once?

I want to continue the DAG only if a csv file exists in S3, otherwise it should just end.
The DAG itself is being scheduled hourly.
with DAG(dag_id="my_dag",
start_date=datetime(2023, 1, 1),
schedule_interval='#hourly',
catchup=False
) as dag:
check_for_new_csv = S3KeySensor(
task_id='check_for_new_csv',
bucket_name='bucket-data',
bucket_key='*.csv',
wildcard_match=True,
soft_fail=True,
retries=1
)
start_instance = EC2StartInstanceOperator(
task_id="start_ec2_instance_task",
instance_id=INSTANCE_ID,
region_name=REGION
)
check_for_new_csv >> start_instance
But the sensor seems to run forever - in the log I can see it keeps on running:
[2023-01-10, 15:02:06 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
[2023-01-10, 15:03:08 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
Maybe the sensor in not the best choice for such logic?
A sensor is a perfect choice for this use case. I'd try setting the poke_interval and timeout to different smaller values than their default to make sure the sensor that Airflow is checking on the right intervals (by default, they are very long).
One thing to watch out for is if your sensors run on longer intervals than your schedule interval. For example, if your DAG is scheduled to run hourly, but your sensors's timeout is set for 2 hours, your next DAG run may not run as expected (depending on your concurrency and max_active_dag settings), or it may run unexpectedly because the sensor detects an older file. Ideally, you can append a timestamp in the name of your file to avoid this.

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Janusgraph not using index in production

Problem
When performing queries in my production environment, the index is not being used and a full scan is performed, but my development environment works fine and uses the index.
After looking deeper at the problem in production, it also seems that the index information is being saved to the storage backend, but the data is not, and is being stored locally. I have no idea why this is...
I will explain the architecture now:
Environments
The following describe my two environments. Important to note, the index in question is a composite index, as such uses the storage backend, but I still included the index-backend in the architecture environment (aka Elasticsearch).
Both local and production environment versions are the same, i.e:
Janusgraph: 0.5.2
ScyllaDB: 0.5.2
Elasticsearch: 7.13.1
Local Environment
Services are running in docker-compose, consisting of a single Janusgraph instance, a single ScyllaDB instance, and a single Elasticsearch Instance.
Production Environment
Running on AWS, kubernetes cluster managed with EKS, I have multiple janusgraph deployments, which connect to a ScyllaDB cluster (in the same k8s cluster), which is done via Scylla For Kubernetes (https://operator.docs.scylladb.com/stable/), and an Elasticsearch cluster.
Setup
The following will give the simplest example I can that contains the problems I describe.
I pre-create the index's with the Janusgraph management system, such as:
# management.groovy
import org.janusgraph.graphdb.database.management.ManagementSystem
cluster = Cluster.open("/opt/janusgraph/my_scripts/gremlin.yaml")
client = cluster.connect()
graph = JanusGraphFactory.open("/opt/janusgraph/my_scripts/env.properties")
g = graph.traversal().withRemote(DriverRemoteConnection.using(client, "g"))
m = graph.openManagement()
uid_property = m.makePropertyKey("uid").dataType(String).make()
user_label = m.makeVertexLabel("User").make()
m.buildIndex("index::User::uid", Vertex.class).addKey(uid_property).indexOnly(user_label).buildCompositeIndex()
m.commit()
Upon inspection with m.printSchema() I can see that the index's are ENABLED, in both my local environment and production environment.
I proceed to import all the data that needs to exist on the graph, both local env and production env are OK.
Performing Queries
The following outline what happens when I run a query
Local Environment
What we see here is a simple lookup just to check that the query is using the index:
gremlin> g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 1.837 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=true
\_query=multiKSQ[1]#4000
\_index=index::User::uid
\_orders=[]
\_isOrdered=true
optimization 0.038
optimization 0.497
backend-query 1 0.901
\_query=index::User::uid:multiKSQ[1]#4000
\_limit=4000
>TOTAL - - 1.837 -
Production Environment
Again, we run the query to see if it using the index (which it is not)
g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 11296.568 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=false
\_query=[]
\_orders=[]
\_isOrdered=true
optimization 0.025
optimization 0.102
scan 0.000
\_condition=VERTEX
\_query=[]
\_fullscan=true
>TOTAL - - 11296.568 -
What Happened? So far my best guess:
The storage backend is NOT being used for storing data, but is being used for storing information about the indexes
Update: Aug 16 2021, after digging around some more I found out something interesting
It is now clear that the data is actually not being saved to the storage backend at all.
In my local environment I set the storage.directory environment variable to /var/lib/janusgraph/data, which mounts onto an empty directory, this directory remains empty. Any vertex/edge updates get's saved to the scyllaDB storage backend, and the data persists between janusgraph instance restarts.
In my production environment, this directory (/var/lib/janusgraph/data) is populated with files:
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.lck
-rw-r--r-- 1 janusgraph janusgraph 9650 Aug 16 05:46 je.config.csv
-rw-r--r-- 1 janusgraph janusgraph 450 Aug 16 05:46 je.info.0
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.info.0.lck
drwxr-xr-x 2 janusgraph janusgraph 118 Aug 16 05:46 .
-rw-r--r-- 1 janusgraph janusgraph 7533 Aug 16 05:46 00000000.jdb
drwx------ 1 janusgraph janusgraph 75 Aug 16 05:53 ..
-rw-r--r-- 1 janusgraph janusgraph 19951 Aug 16 06:09 je.stat.csv
and any subsequent updates on the graph seem to be reflected here, the update do not get put onto the storage backend, and other janusgraph instances on kubernetes cannot see any changes other instances make, leading me to come to the conclusion, the storage backend is not being used for storing data
The domain name used for the storage.hostname and index.hostname both resolve to IP address's, confirmed with using nslookup.
The endpoints must also work, as the keyspace janusgraph is created, and also has a different replication factor that I defined, and also retains the index information regardless of restarting the janusgraph instances.
Idea 1 (Index is not enabled)
This was disproved via running m.printSchema() showing that all the index's were ENABLED
Idea 2 (Storage backends have different data)
I looked at the data stored in scylladb, and got a summary with nodetool cfstats, this does show something different:
# Local
Keyspace : janusgraph
Read Count: 1688328
Read Latency: 2.5682805710738673E-5 ms
Write Count: 1055210
Write Latency: 1.702409946835227E-5 ms
...
Memtable cell count: 126411
Memtable data size: 345700491
Memtable off heap memory used: 480247808
# Production
Keyspace : janusgraph
Read Count: 6367
Read Latency: 2.1203078372860058E-5 ms
Write Count: 21
Write Latency: 0.0 ms
...
Memtable cell count: 4
Memtable data size: 10092
Memtable off heap memory used: 131072
Although I don't know how to explain the difference, it is clear that both backends contain all the data, verified with various count() queries over labels, such as g.V().hasLabel("User").count(), which both environments report the same result
Idea 3 (Elasticsearch Warnings)
When launching a gremlin console session, there is a difference in that the production environment shows:
07:27:09 WARN org.elasticsearch.client.RestClient - request [PUT http://*******<i_removed_the_domain>******:9200/_cluster/settings] returned 3 warnings: [299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.master] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]
but as my problem is using composite index's, I believe we can disregard elasticsearch warnings.
Idea 4 (ScyllaDB cluster node resources)
Another idea I had was increasing the node resources, even with 7gb RAM, the problem still persists.
Finally...
I don't know what to try next in order to solve this problem, this is my first time pushing Janusgraph into production and perhaps I have missed something important. I have been stuck on this problem for quite a while, hence now asking the community here for help.
Thank you very much for reading this for, and hopefully helping me to solve this problem
I solved the problem myself, I realised that my K8s Deployment .yaml file I use for deploying needed all environment variables to have the prefix janusgraph., as such the janusgraph server was starting with all default variables rather than my selected ones.
Every-time I was creating a gremlin shell session (which connected to it's localhost server), although I was specifying all the correct endpoints and configuration, it was still saving the data according to default janusgraph variables. Although, even in this case, I don't know why the index's were successfully created on my specified backend.
But none the less, the solution was to make sure environment variables have the prefix janusgraph.

How to scale down a CrateDB cluster?

For testing, I wanted to shrink my 3 node cluster to 2 nodes, to later go and do the same thing for my 5 node cluster.
However, after following the best practice of shrinking a cluster:
Back up all tables
For all tables: alter table xyz set (number_of_replicas=2) if it was less than 2 before
SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>;
3 a. If the data check should always be green, set the min_availability to 'full':
https://crate.io/docs/reference/configuration.html#graceful-stop
Initiate graceful stop on one node
Wait for the data check to turn green
Repeat from 3.
When done, persist the node configurations in crate.yml:
gateway.recover_after_nodes: n
discovery.zen.minimum_master_nodes:[![enter image description here][1]][1] (n/2) +1
gateway.expected_nodes: n
My cluster never went back to "green" again, and I also have a critical node check failing.
What went wrong here?
crate.yml:
...
################################## Discovery ##################################
# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.
# Set to ensure a node sees M other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# We highly recommend to set the minimum master nodes as follows:
# minimum_master_nodes: (N / 2) + 1 where N is the cluster size
# That will ensure a full recovery of the cluster state.
#
discovery.zen.minimum_master_nodes: 2
# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
# discovery.zen.ping.timeout: 3s
#
# Time a node is waiting for responses from other nodes to a published
# cluster state.
#
# discovery.zen.publish_timeout: 30s
# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
# For example, Amazon Web Services doesn't support multicast discovery.
# Therefore, you need to specify the instances you want to connect to a
# cluster as described in the following steps:
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
# to perform discovery when new nodes (master or data) are started:
#
# If you want to debug the discovery process, you can set a logger in
# 'config/logging.yml' to help you doing so.
#
################################### Gateway ###################################
# The gateway persists cluster meta data on disk every time the meta data
# changes. This data is stored persistently across full cluster restarts
# and recovered after nodes are started again.
# Defines the number of nodes that need to be started before any cluster
# state recovery will start.
#
gateway.recover_after_nodes: 3
# Defines the time to wait before starting the recovery once the number
# of nodes defined in gateway.recover_after_nodes are started.
#
#gateway.recover_after_time: 5m
# Defines how many nodes should be waited for until the cluster state is
# recovered immediately. The value should be equal to the number of nodes
# in the cluster.
#
gateway.expected_nodes: 3
So there are two things that are important:
The number of replicas is essentially the number of nodes you can loose in a typical setup (2 is recommended so that you can scale down AND loose a node in the process and still be ok)
The procedure is recommended for clusters > 2 nodes ;)
CrateDB will automatically distribute the shards across the cluster in a way that no replica and primary share a node. If that is not possible (which is the case if you have 2 nodes and 1 primary with 2 replicas, the data check will never return to 'green'. So in your case, set the number of replicas to 1 in order to get the cluster back to green (alter table mytable set (number_of_replicas = 1)).
The critical node check is due to the cluster not having received an updated crate.yml yet: Your file also still has the configuration of a 3-node cluster in it, hence the message. Since CrateDB only loads the expected_nodes at startup (it's not a runtime setting), a restart of the whole cluster is required to conclude scaling down. It can be done with a rolling restart, but be sure to set SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>; properly, otherwise the consensus will not work...
Also, it's recommended to scale down one-by-one in order to avoid overloading the cluster with rebalancing and accidentally loosing data.

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.