Why no linear scaling of Redis Cluster - redis

I am trying to build one horizontal scalability system based on Redis Cluster. So I've measured the throughput of Redis Cluster with different nodes. But finally, the measured result doesn't show the linear scalability as the cluster spec stated, “High performance and linear scalability up to 1000 nodes.”
redis cluster benchmark:
The image above shows the measure result of redis cluster of (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12). (3+3) means 3 master node plus 3 slave nodes. The result of C (reate) and you (update) don't show the linear scalability of redis cluster as following picture.
I'd like to know why these measured result don't show the linear scalability. Is there any possible reason to limit the scaling?
My test environment and related information are described as below
Server
HW: HP BL460c G9, 24 CPU (E5-2620 v3 #2.40GHz), 64G memory, 300G disk
I have two machines. In order to know the capacity of one HW machine, I run all master nodes on one machine and all slaves nodes on another machine. All redis nodes will be include in one Redis Cluster.
OS: SLES 12
I have updates some system settings to achieve higher performance.
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0
Furthermore, I've turned off the swap, which could cause very unstable throughput when AOF re-write happened even swappiness is already set to 0. As observed, 15 million records in my test will occupy around 48G memory.
Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.
Client
HW: HP DL380 G7, 16 CPU (E5620 #2.40GHz), 24G memory, 600G disk
OS: SLES 12
YCSB (0.6.0) with jedis (2.8.0)
I will use hash key to store all records (1 key and 21 fields) and N sorted sets to store all keys and its random scores. Here N is the number of master nodes in the cluster. N sorted sets will be distributed evenly in each master node.
The YCSB workload configuration is pasted below:
workload=com.yahoo.ycsb.workloads.CoreWorkload
recordcount=15000000
operationcount=150000000
insertstart=0
fieldcount=21
fieldlength=188
readallfields=true
writeallfields=false
fieldlengthdistribution=zipfian
readproportion=0.0
updateproportion=1.0
insertproportion=0
readmodifywriteproportion=0.0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=subscriber
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
At most cases, the computer resource is enough in my view though the throughput already hit the limit.
CPU: there are much CPU left, 60~70% CPU idle
I/O usage: it's not so busy, it's 30~40% utility at peak time.
Memory: only memory could be exhausted almost at peak time, i.e. when AOF re-write happened. At most time it's around 80%.

Related

how to get the current consumed cpu % of vmhost in vcenter using powershell

How to get the current consumed CPU% of vmhost in vcenter using powershell script.
Below command doesn't gives similar output what we checked manually.
Get-Stat -Entity $command1 -Stat cpu.usagemhz.average -Realtime -MaxSamples 1
Get-Stat -Entity $myHost -Stat cpu.usage.average -Realtime -MaxSamples 1 -Instance ""
From VMware's doc on this cpu usage perf counter:
Actively used CPU, as a percentage of the total available CPU, for
each physical CPU on the host. Active CPU is approximately equal to
the ratio of the used CPU to the available CPU.
Available CPU = # of physical CPUs × clock rate.
100% represents all CPUs on the host. For example, if a four-CPU host
is running a virtual machine with two CPUs, and the usage is 50%, the
host is using two CPUs completely
Explanations from Luc Dekens around the -Instance filter...
If the ESX/ESXi server is equipped with a quadcore CPU, there will be
four instances: 0, 1, 2 and 3. In this case the instance corresponds
with the numeric position within the CPU core
And there will be a so-called aggregate, which is the metric averaged
over all the instances.
These instances each get their own identifier which will be part of
the returned statistical data. The aggregate instance is always
represented by a blank identifier.
...and -MaxSamples
Although I asked for 1 sample (-MaxSamples 1) the cmdlet returned 9
values. The -MaxSamples parameter apparently only looks at the
Timestamp. It doesn’t count the number of returned values

Terminology used in Nsight Compute

Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.

Scylladb : scylla_io_setup script iops calculation not matching with other iops calculation tool like fio

I have 3 nodes containerized scylla cluster. At the launch of scylla container , scylla calculate iops (read and write) for mounted disk . I see iops rate calculated by scylla_io_setup is much lower compared to fio tool.
IOPS Number for scylla_io_setup
disks:
- mountpoint: /var/lib/scylla
read_iops: 264
read_bandwidth: 84052448
write_iops: 1197
write_bandwidth: 129923792
IOPS using fio
read_iops: 70000
write_iops: 60000
Soo many(around 6k got it from grafana) write request to scylla are blocking on commitlog,i am seeing write latency issue with scylla.

How to run TensorFlow on multiple nodes with several CPUs each

I want to run linear regression with TensorFlow on very large datasets. I have a cluster with 9 nodes and 36 CPUs each. What is the best way to distribute the computations across all the resources available?
According to this course https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed for the parallelisation. I then tried to run my script my_code.py (on a "small" dataset with 120 million data points and 2 feature columns to test the code) on nodes 2 and 3 as follows:
python my_code.py \
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"
where i is the number of the node (either 2 or 3); whereas on node 1 I do the same but with --job_name=ps and --task_index=0. However this way it seems that only one CPU per node is used. Do I need to specify each CPU individually?
Thank you in advance.
As far as I understand, the best thing to do is to use all the CPUs on the same node together as a single worker, in order to make the most of the shared memory. So for example in the case above, one would have to specify manually only 9 workers and make sure that each of them corresponds to one node where all the 36 CPUs are used. The commands to do this depend on the specific cluster used.

How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?
Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.