Scylladb : scylla_io_setup script iops calculation not matching with other iops calculation tool like fio - scylla

I have 3 nodes containerized scylla cluster. At the launch of scylla container , scylla calculate iops (read and write) for mounted disk . I see iops rate calculated by scylla_io_setup is much lower compared to fio tool.
IOPS Number for scylla_io_setup
disks:
- mountpoint: /var/lib/scylla
read_iops: 264
read_bandwidth: 84052448
write_iops: 1197
write_bandwidth: 129923792
IOPS using fio
read_iops: 70000
write_iops: 60000
Soo many(around 6k got it from grafana) write request to scylla are blocking on commitlog,i am seeing write latency issue with scylla.

Related

Terminology used in Nsight Compute

Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.

Disk I/O extremely slow on P100-NC6s-V2

I am training an image segmentation model on azure ML pipeline. During the testing step, I'm saving the output of the model to the associated blob storage. Then I want to find the IOU (Intersection over Union) between the calculated output and the ground truth. Both of these set of images lie on the blob storage. However, IOU calculation is extremely slow, and I think it's disk bound. In my IOU calculation code, I'm just loading the two images (commented out other code), still, it's taking close to 6 seconds per iteration, while training and testing were fast enough.
Is this behavior normal? How do I debug this step?
A few notes on the drives that an AzureML remote run has available:
Here is what I see when I run df on a remote run (in this one, I am using a blob Datastore via as_mount()):
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 103080160 11530364 86290588 12% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3568556 0 3568556 0% /sys/fs/cgroup
/dev/sdb1 103080160 11530364 86290588 12% /etc/hosts
shm 2097152 0 2097152 0% /dev/shm
//danielscstorageezoh...-620830f140ab 5368709120 3702848 5365006272 1% /mnt/batch/tasks/.../workspacefilestore
blobfuse 103080160 11530364 86290588 12% /mnt/batch/tasks/.../workspaceblobstore
The interesting items are overlay, /dev/sdb1, //danielscstorageezoh...-620830f140ab and blobfuse:
overlay and /dev/sdb1 are both the mount of the local SSD on the machine (I am using a STANDARD_D2_V2 which has a 100GB SSD).
//danielscstorageezoh...-620830f140ab is the mount of the Azure File Share that contains the project files (your script, etc.). It is also the current working directory for your run.
blobfuse is the blob store that I had requested to mount in the Estimator as I executed the run.
I was curious about the performance differences between these 3 types of drives. My mini benchmark was to download and extract this file: http://download.tensorflow.org/example_images/flower_photos.tgz (it is a 220 MB tar file that contains about 3600 jpeg images of flowers).
Here the results:
Filesystem/Drive Download_and_save Extract
Local_SSD 2s 2s
Azure File Share 9s 386s
Premium File Share 10s 120s
Blobfuse 10s 133s
Blobfuse w/ Premium Blob 8s 121s
In summary, writing small files is much, much slower on the network drives, so it is highly recommended to use /tmp or Python tempfile if you are writing smaller files.
For reference, here the script I ran to measure: https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535
And this is how I ran it: https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c

How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?
Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.

Why no linear scaling of Redis Cluster

I am trying to build one horizontal scalability system based on Redis Cluster. So I've measured the throughput of Redis Cluster with different nodes. But finally, the measured result doesn't show the linear scalability as the cluster spec stated, “High performance and linear scalability up to 1000 nodes.”
redis cluster benchmark:
The image above shows the measure result of redis cluster of (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12). (3+3) means 3 master node plus 3 slave nodes. The result of C (reate) and you (update) don't show the linear scalability of redis cluster as following picture.
I'd like to know why these measured result don't show the linear scalability. Is there any possible reason to limit the scaling?
My test environment and related information are described as below
Server
HW: HP BL460c G9, 24 CPU (E5-2620 v3 #2.40GHz), 64G memory, 300G disk
I have two machines. In order to know the capacity of one HW machine, I run all master nodes on one machine and all slaves nodes on another machine. All redis nodes will be include in one Redis Cluster.
OS: SLES 12
I have updates some system settings to achieve higher performance.
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0
Furthermore, I've turned off the swap, which could cause very unstable throughput when AOF re-write happened even swappiness is already set to 0. As observed, 15 million records in my test will occupy around 48G memory.
Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.
Client
HW: HP DL380 G7, 16 CPU (E5620 #2.40GHz), 24G memory, 600G disk
OS: SLES 12
YCSB (0.6.0) with jedis (2.8.0)
I will use hash key to store all records (1 key and 21 fields) and N sorted sets to store all keys and its random scores. Here N is the number of master nodes in the cluster. N sorted sets will be distributed evenly in each master node.
The YCSB workload configuration is pasted below:
workload=com.yahoo.ycsb.workloads.CoreWorkload
recordcount=15000000
operationcount=150000000
insertstart=0
fieldcount=21
fieldlength=188
readallfields=true
writeallfields=false
fieldlengthdistribution=zipfian
readproportion=0.0
updateproportion=1.0
insertproportion=0
readmodifywriteproportion=0.0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=subscriber
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
At most cases, the computer resource is enough in my view though the throughput already hit the limit.
CPU: there are much CPU left, 60~70% CPU idle
I/O usage: it's not so busy, it's 30~40% utility at peak time.
Memory: only memory could be exhausted almost at peak time, i.e. when AOF re-write happened. At most time it's around 80%.

What is the performance overhead of XADisk for read and write operations?

What does XADisk do in addition to reading/writing from the underlying file? How does that translate into a percentage of the read/write throughput (approximately)?
depends on the size, if large set the flag "heavyWrite" as true while opening the xaFileOutputStream.
test with 500 files of size 1MB each. Below is the amount of time taken, averaged over 10 executions...
Java IO - 37.5 seconds
Java NIO - 24.8 seconds
XADisk - 30.3 seconds