Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.
I didn't find it in any official document, what is redis timeseries module's restrictions in terms of the following:
Max number of labels which can be added?
Max size of keys?
Max number of keys or time series?
Please let me know
Max number of labels which can be added?
A. There is no hard limit but you might experience some performance degradation when querying a very large number of labels.
Max size of keys?
A. There is no limit as long as you have memory available for the module. It is recommended to downsample your data and retire raw data by using the RETENTION option. Data compression was recently added to the module which reduces memory footprint significantly.
Max number of keys or time series?
A. There is no limit beyond the usual limits on Redis itself.
We are considering moving to flat rate pricing for BigQuery, but it is unclear from the documentation how slot utilization is computed.
You pay for flat rate with a monthly rate, and if I look at out our slot utlization over a month in Stack driver it is consistently reported under 500 slots. But if I change to graphing out the daily utilization we sometimes peak over 2000 slots.
So is the allocated slots we are allowed to use measured against average or peak usage?
Allocated slots are counted against peak usage. With a 500 slot quota you can never utilize more than 500 slots at the same time. The result is that your query takes longer to run.
I have a DynamoDB with ~16M records where each record is of size 4k. The table is configured for autoscaling Target utilization: 70%, Minimum provisioned capacity for Reads: 250 and Maximum provisioned capacity for Writes: 3000.
I am trying to setup data pipeline to backup DynamoDB to S3. The pipeline configuration asks for Read Throughput Ratio which is 0.25 by default.
So the question is how to compute Read Throughput Ratio to back up the table in ~1 Hours. I understand the read capacity units. How is the Read Throughput Ratio related to Read Capacity Units and Auto Scaling Configuration?
Theoretically an RCU is 4KB so if you divide your data volume by 4KB you will get total RCU required for reading the complete data for the given second. So if you divide this value by 60*60 ( Minutes*Seconds) for 1 hour you will get the required RCU configuration but take into account the time required to setup EMR cluster.
But I am confused on how this will behave if auto scaling is configured to the particular table.
I want to create a cloud watch alarm which triggers autoscaling based on more than one metric data. Since this is not natively supported by Cloud Watch ( Correct me if i am wrong ). I was wondering how to overcome this.
Can we get the data from different metrics, say CPUUtilization, NetworkIn, NetworkOut and then create a custom metrics using mon-put-data and enter these data to create a new metric based on which to trigger an autoscaling ?
You can now make use of CloudWatch Metric Math.
Metric math enables you to query multiple CloudWatch metrics and use
math expressions to create new time series based on these metrics. You
can visualize the resulting time series in the CloudWatch console and
add them to dashboards.
More information regarding Metric Math Syntax and Functions available here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html#metric-math-syntax
However, it needs to be noted that there are no logical operators and you have to use arithmetic functions to make your way out.
To help out anyone bumping here, posting an example:
Lets say you want to trigger an alarm if CPUUtilization < 20% and MemoryUtilization < 30%.
m1 = Avg CPU Utilization % for 5mins
m2 = Avg Mem Utilization % for 5mins
Then:
Avg. CPU Utilization % < 20 for 5 mins AND Avg Mem Utilization % < 30 for 5mins ... (1)
is same as
(m1 - 20) / ABS([m1 - 20]) + (m2 - 30) / ABS([m2 - 30]) < 0 ... (2)
So, define your two metrics and build a metric query which looks like LHS of equation (2) above. Set your threshhold to be 0 and set comparison operator to be LessThanThreshold.
Yes .. Cloudwatch Alarms can only trigger on a single Cloudwatch Metric so you would need to publish your own 'aggregate' custom metric and alarm on that as you suggest yourself.
Here is a blog post describing using custom metrics to trigger autoscaling.
http://www.thatsgeeky.com/2012/01/autoscaling-with-custom-metrics/
This is supported now. You can check
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html for the same.
As an example, you can use something like (CPU Utilization>80) OR (MEMORY Consumed>55)