How to compute 'DynamoDB read throughput ratio' while setting up DataPipeline to export DynamoDB data to S3 - amazon-s3

I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?

Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.

Related

Terminology used in Nsight Compute

Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.

What are Redis timeseries module restrictions

I didn't find it in any official document, what is redis timeseries module's restrictions in terms of the following:
Max number of labels which can be added?
Max size of keys?
Max number of keys or time series?
Please let me know
Max number of labels which can be added?
A. There is no hard limit but you might experience some performance degradation when querying a very large number of labels.
Max size of keys?
A. There is no limit as long as you have memory available for the module. It is recommended to downsample your data and retire raw data by using the RETENTION option. Data compression was recently added to the module which reduces memory footprint significantly.
Max number of keys or time series?
A. There is no limit beyond the usual limits on Redis itself.

Scylladb : scylla_io_setup script iops calculation not matching with other iops calculation tool like fio

I have 3 nodes containerized scylla cluster. At the launch of scylla container , scylla calculate iops (read and write) for mounted disk . I see iops rate calculated by scylla_io_setup is much lower compared to fio tool.
IOPS Number for scylla_io_setup
disks:
- mountpoint: /var/lib/scylla
read_iops: 264
read_bandwidth: 84052448
write_iops: 1197
write_bandwidth: 129923792
IOPS using fio
read_iops: 70000
write_iops: 60000
Soo many(around 6k got it from grafana) write request to scylla are blocking on commitlog,i am seeing write latency issue with scylla.

How is BigQuery slot quota calculated?

We are considering moving to flat rate pricing for BigQuery, but it is unclear from the documentation how slot utilization is computed.
You pay for flat rate with a monthly rate, and if I look at out our slot utlization over a month in Stack driver it is consistently reported under 500 slots. But if I change to graphing out the daily utilization we sometimes peak over 2000 slots.
So is the allocated slots we are allowed to use measured against average or peak usage?
Allocated slots are counted against peak usage. With a 500 slot quota you can never utilize more than 500 slots at the same time. The result is that your query takes longer to run.

Why no linear scaling of Redis Cluster

I am trying to build one horizontal scalability system based on Redis Cluster. So I've measured the throughput of Redis Cluster with different nodes. But finally, the measured result doesn't show the linear scalability as the cluster spec stated, “High performance and linear scalability up to 1000 nodes.”
redis cluster benchmark:
The image above shows the measure result of redis cluster of (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12). (3+3) means 3 master node plus 3 slave nodes. The result of C (reate) and you (update) don't show the linear scalability of redis cluster as following picture.
I'd like to know why these measured result don't show the linear scalability. Is there any possible reason to limit the scaling?
My test environment and related information are described as below
Server
HW: HP BL460c G9, 24 CPU (E5-2620 v3 #2.40GHz), 64G memory, 300G disk
I have two machines. In order to know the capacity of one HW machine, I run all master nodes on one machine and all slaves nodes on another machine. All redis nodes will be include in one Redis Cluster.
OS: SLES 12
I have updates some system settings to achieve higher performance.
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0
Furthermore, I've turned off the swap, which could cause very unstable throughput when AOF re-write happened even swappiness is already set to 0. As observed, 15 million records in my test will occupy around 48G memory.
Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.
Client
HW: HP DL380 G7, 16 CPU (E5620 #2.40GHz), 24G memory, 600G disk
OS: SLES 12
YCSB (0.6.0) with jedis (2.8.0)
I will use hash key to store all records (1 key and 21 fields) and N sorted sets to store all keys and its random scores. Here N is the number of master nodes in the cluster. N sorted sets will be distributed evenly in each master node.
The YCSB workload configuration is pasted below:
workload=com.yahoo.ycsb.workloads.CoreWorkload
recordcount=15000000
operationcount=150000000
insertstart=0
fieldcount=21
fieldlength=188
readallfields=true
writeallfields=false
fieldlengthdistribution=zipfian
readproportion=0.0
updateproportion=1.0
insertproportion=0
readmodifywriteproportion=0.0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=subscriber
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
At most cases, the computer resource is enough in my view though the throughput already hit the limit.
CPU: there are much CPU left, 60~70% CPU idle
I/O usage: it's not so busy, it's 30~40% utility at peak time.
Memory: only memory could be exhausted almost at peak time, i.e. when AOF re-write happened. At most time it's around 80%.