What is the performance overhead of XADisk for read and write operations? - xadisk

What does XADisk do in addition to reading/writing from the underlying file? How does that translate into a percentage of the read/write throughput (approximately)?

depends on the size, if large set the flag "heavyWrite" as true while opening the xaFileOutputStream.
test with 500 files of size 1MB each. Below is the amount of time taken, averaged over 10 executions...
Java IO - 37.5 seconds
Java NIO - 24.8 seconds
XADisk - 30.3 seconds

Related

Scylladb : scylla_io_setup script iops calculation not matching with other iops calculation tool like fio

I have 3 nodes containerized scylla cluster. At the launch of scylla container , scylla calculate iops (read and write) for mounted disk . I see iops rate calculated by scylla_io_setup is much lower compared to fio tool.
IOPS Number for scylla_io_setup
disks:
- mountpoint: /var/lib/scylla
read_iops: 264
read_bandwidth: 84052448
write_iops: 1197
write_bandwidth: 129923792
IOPS using fio
read_iops: 70000
write_iops: 60000
Soo many(around 6k got it from grafana) write request to scylla are blocking on commitlog,i am seeing write latency issue with scylla.

What will be the best choice for batch size for one device (using Mirrored Strategy in TF)?

Question:
Suppose you have 4 GPUs (having 2GB memory each) to train your deep learning model. You have 1000 data points in your dataset that takes around 10 GB of storage. What will be the best choice for batch size for one device (using Mirrored Strategy in TF)?
Can someone help me to solve this assignment problem? Thanks in advance.
Each GPU has a memory of 2GB and there are 4 GPUs which means you have a total of 8 GB memory to work with.
Now you can't divide 10 GB of data into 8 GB in one go, so you split 10GB into halves, and have an overall batch size of 500 data points(or rather 512 to be closer to a power of 2)
Now you distribute these 500 data points across the 4 GPUs, getting a batch size of ~128 data points per device.
So overall batch size would be 512 data points, and per GPU batch size would 128.

What is the best window size (in seconds) and hop size (in seconds) for a audio sample which has 3 second length?

I have some voice samples with 3s length size for an audio feature extraction project. First I select 0.5s window size and 0.2 hop size but I doubt how to select best window size and hop size for better results.
Unfortunately, these are hyper-parameters that need to be optimized on your data.
I often obtain decent results from a hop_length between 10 ms - 40 ms and a window length between 10 ms - 100 ms, depending on whether you want more frequency- or time- resolution.

How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?
Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.

Writing Spark checkpoints to S3 is too slow

I'm using Spark Streaming 1.5.2 and I am ingesting data from Kafka 0.8.2.2 using the Direct Stream approach.
I have enabled the checkpoints so that my Driver can be restarted and pick up where it left off without loosing unprocessed data.
Checkpoints are written to S3 as I'm on Amazon AWS and not running on top of a Hadoop cluster.
The batch interval is 1 second as I want a low latency.
Issue is, it takes from 1 to 20 seconds to write a single checkpoint to S3. They are backing up in memory and, eventually, the application fails.
2016-04-28 18:26:55,483 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6071 bytes and 1724 ms
2016-04-28 18:26:58,812 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6024 bytes and 3329 ms
2016-04-28 18:27:00,327 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6068 bytes and 1515 ms
2016-04-28 18:27:06,667 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6024 bytes and 6340 ms
2016-04-28 18:27:11,689 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6067 bytes and 5022 ms
2016-04-28 18:27:15,982 INFO [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6024 bytes and 4293 ms
Is there a way to increase the interval between checkpoints without increasing the batch interval?
Yes, you can achieve that using checkpointInterval parameter. You can set the duration while doing checkpoint like shown in below doc.
Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too infrequently causes the lineage and task sizes to grow, which may have detrimental effects. For stateful transformations that require RDD checkpointing, the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.