Writing Spark checkpoints to S3 is too slow - amazon-s3

I'm using Spark Streaming 1.5.2 and I am ingesting data from Kafka using the Direct Stream approach.
I have enabled the checkpoints so that my Driver can be restarted and pick up where it left off without loosing unprocessed data.
Checkpoints are written to S3 as I'm on Amazon AWS and not running on top of a Hadoop cluster.
The batch interval is 1 second as I want a low latency.
Issue is, it takes from 1 to 20 seconds to write a single checkpoint to S3. They are backing up in memory and, eventually, the application fails.
2016-04-28 18:26:55,483 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6071 bytes and 1724 ms
2016-04-28 18:26:58,812 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6024 bytes and 3329 ms
2016-04-28 18:27:00,327 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6068 bytes and 1515 ms
2016-04-28 18:27:06,667 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6024 bytes and 6340 ms
2016-04-28 18:27:11,689 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6067 bytes and 5022 ms
2016-04-28 18:27:15,982 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6024 bytes and 4293 ms
Is there a way to increase the interval between checkpoints without increasing the batch interval?

Yes, you can achieve that using checkpointInterval parameter. You can set the duration while doing checkpoint like shown in below doc.
Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.


Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

How to effectively use the TFRC program with the GCP AI platform Jobs

I'm trying to run a hyperparameter tunning job into GCP's AI platform job service, the Tensorflow Research Cloud program approved to me
100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
20 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a
I already built a custom model on Tensorflow 2, and I want to run the job specifying the exact zone to take advantage of the TFRC program plus the AI platform job service; right now I have a YAML config file that looks like:
scaleTier: basic-tpu
region: us-central1
hyperparameterMetricTag: val_accuracy
maxTrials: 100
maxParallelTrials: 16
maxFailedTrials: 30
enableTrialEarlyStopping: True
In theory, if I run 16 parallel jobs each one in a separate TPU instance should work but, instead return an error due to the petition exceed the quota of TPU_V2
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project ###################. The request for 128 TPU_V2 accelerators for 16 parallel runs exceeds the allowed maximum of 0 A100, 0 TPU_V2_POD, 0 TPU_V3_POD, 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 30 K80, 30 P100, 6 T4 accelerators.
Then I reduce the maxParallelTrials to only 2 and worked, which confirms given the above error message the quota is counting by TPU chip, not by TPU instance.
Therefore I think, maybe I completely misunderstood the approved quota of the TFRC program then I proceed to check if the job is using the us-central1-f zone but turns out that is using an unwanted zone:
-tpu_node={"project": "p091c8a0a31894754-tp", "zone": "us-central1-c", "tpu_node_name": "cmle-training-1597710560117985038-tpu"}"
That behavior doesn't allow me to use effectively the free approved quota, and if I understand correctly the job running in the us-central1-c is taking credits of my account but does not use the free resources. Hence I wonder if there's some way to set the zone in the AI platform job, and also it is possible to pass some flag to use preemptible TPUs.
Unfortunately the two can't be combined.

Disk I/O extremely slow on P100-NC6s-V2

I am training an image segmentation model on azure ML pipeline. During the testing step, I'm saving the output of the model to the associated blob storage. Then I want to find the IOU (Intersection over Union) between the calculated output and the ground truth. Both of these set of images lie on the blob storage. However, IOU calculation is extremely slow, and I think it's disk bound. In my IOU calculation code, I'm just loading the two images (commented out other code), still, it's taking close to 6 seconds per iteration, while training and testing were fast enough.
Is this behavior normal? How do I debug this step?
A few notes on the drives that an AzureML remote run has available:
Here is what I see when I run df on a remote run (in this one, I am using a blob Datastore via as_mount()):
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 103080160 11530364 86290588 12% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3568556 0 3568556 0% /sys/fs/cgroup
/dev/sdb1 103080160 11530364 86290588 12% /etc/hosts
shm 2097152 0 2097152 0% /dev/shm
//danielscstorageezoh...-620830f140ab 5368709120 3702848 5365006272 1% /mnt/batch/tasks/.../workspacefilestore
blobfuse 103080160 11530364 86290588 12% /mnt/batch/tasks/.../workspaceblobstore
The interesting items are overlay, /dev/sdb1, //danielscstorageezoh...-620830f140ab and blobfuse:
overlay and /dev/sdb1 are both the mount of the local SSD on the machine (I am using a STANDARD_D2_V2 which has a 100GB SSD).
//danielscstorageezoh...-620830f140ab is the mount of the Azure File Share that contains the project files (your script, etc.). It is also the current working directory for your run.
blobfuse is the blob store that I had requested to mount in the Estimator as I executed the run.
I was curious about the performance differences between these 3 types of drives. My mini benchmark was to download and extract this file: http://download.tensorflow.org/example_images/flower_photos.tgz (it is a 220 MB tar file that contains about 3600 jpeg images of flowers).
Here the results:
Filesystem/Drive Download_and_save Extract
Local_SSD 2s 2s
Azure File Share 9s 386s
Premium File Share 10s 120s
Blobfuse 10s 133s
Blobfuse w/ Premium Blob 8s 121s
In summary, writing small files is much, much slower on the network drives, so it is highly recommended to use /tmp or Python tempfile if you are writing smaller files.
For reference, here the script I ran to measure: https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535
And this is how I ran it: https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c

How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?
Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.

What is the performance overhead of XADisk for read and write operations?

What does XADisk do in addition to reading/writing from the underlying file? How does that translate into a percentage of the read/write throughput (approximately)?
depends on the size, if large set the flag "heavyWrite" as true while opening the xaFileOutputStream.
test with 500 files of size 1MB each. Below is the amount of time taken, averaged over 10 executions...
Java IO - 37.5 seconds
Java NIO - 24.8 seconds
XADisk - 30.3 seconds