how to get the current consumed cpu % of vmhost in vcenter using powershell - virtual-machine

How to get the current consumed CPU% of vmhost in vcenter using powershell script.
Below command doesn't gives similar output what we checked manually.
Get-Stat -Entity $command1 -Stat cpu.usagemhz.average -Realtime -MaxSamples 1

Get-Stat -Entity $myHost -Stat cpu.usage.average -Realtime -MaxSamples 1 -Instance ""
From VMware's doc on this cpu usage perf counter:
Actively used CPU, as a percentage of the total available CPU, for
each physical CPU on the host. Active CPU is approximately equal to
the ratio of the used CPU to the available CPU.
Available CPU = # of physical CPUs × clock rate.
100% represents all CPUs on the host. For example, if a four-CPU host
is running a virtual machine with two CPUs, and the usage is 50%, the
host is using two CPUs completely
Explanations from Luc Dekens around the -Instance filter...
If the ESX/ESXi server is equipped with a quadcore CPU, there will be
four instances: 0, 1, 2 and 3. In this case the instance corresponds
with the numeric position within the CPU core
And there will be a so-called aggregate, which is the metric averaged
over all the instances.
These instances each get their own identifier which will be part of
the returned statistical data. The aggregate instance is always
represented by a blank identifier.
...and -MaxSamples
Although I asked for 1 sample (-MaxSamples 1) the cmdlet returned 9
values. The -MaxSamples parameter apparently only looks at the
Timestamp. It doesn’t count the number of returned values

Related

Opencl Maximum Size of Private memory per Work Item

I Have an AMD RX 570 4G,
Opencl tells me that I can use a Maximum of 256 Workgroup and 256 WorkItem per group...
Let's say I use all 256 Workgroup with 256 WorkItem in each of them,
Now, What is the Maximum Size of private memory per work item?
Is Private memory Equal to Total VRAM(4GB) Divided by Total Work Items(256x256)?
Or is it equal to Cache if so, how?
VRAM is represented in OpenCL as global memory.
Private memory is initially allocated from the register file. Your RX 570 is from AMD's Polaris architecture, a.k.a. GCN 4 where each compute unit (64 shader processors) has access to 256 vector (SIMD) registers (64x32 bits wide) and 512 32-bit scalar registers. So that works out to about 66KiB per CU, but it's not as simple as just quoting that total.
A workgroup will always be scheduled on a single compute unit, so if you assign it 256 work items, then it will have to perform every vector instruction 4 times in sequence (64 x 4 = 256) and the vector registers will (simplifying slightly) effectively have to be treated as 64 256-entry registers.
Scalar registers are used for data and calculations which are identical on each work item, e.g. incrementing a loop counter, holding buffer base pointers, etc.
Private memory will usually spill to global if you use more than will fit in your register file. So performance simply drops.
So essentially, on GCN, your optimal workgroup size is usually 64. Use as little private memory as possible; definitely aim for less than half of the available register file so that more than one workgroup can be scheduled so latency from memory access can be papered over, otherwise your shader cores will be spending a lot of time just waiting for data to arrive or be written out.
Cache is used for OpenCL local and constant memory spaces. (Constant will again spill to global if you try to use too much. The size of local memory can be checked via the OpenCL API and again is divided among workgroups scheduled on the same compute unit, so if you use more than half, only one group can run on a CU, etc.)
I don't know where you're getting a limit of 256 workgroups from, the limit is essentially set by whether the GPU uses 32-bit or 64-bit addressing. Most applications won't get close to 4bn work items even in the 32-bit case.
Private memory space is registers on the GPU die (0 cycle access latency) and not related to the amount of VRAM (global memory space) at all. The amount of private memory depends on the device (private memory per compute unit).
I don't know private memory size for the RX 570, but for older HD7000 series GPUs it is 256kB per CU. If you have a work group size of 256, you get 1kB per work item, which is equal to 256 float variables.
Cache size determines the size of local and constant memory space.

How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want.
Currently I am able only to create a experiment with multi GPUs with telling exactly what type I want:
--gres=gpu:gres_type:amount
If I don't specify gres_type, then sometimes I get mixed GPUs packs (let say 2x titan V and 2x titan X).
If you are fortunate enough that the cluster is consistent in the types of nodes that host the GPUs, and that the features of the nodes a properly specified and allow distinguishing between the nodes that host the different GPU types, you can use the --constraint parameter.
For the sake of the argument, let's assume that the nodes that host the titanV have haswell CPUs, and those that host the titanX have skylake CPUs and that those are defined as features. Then, you can request
--gres=gpu:2
--constraint=[haswell|skylake]
If the above does not apply to your use case, you can submit two jobs and keep only the one that starts the earliest. For that, give your jobs an identical name, and use the singleton dependency.
Write a submission script like this one
#!/bin/bash
#SBATCH --dependency=singleton
#SBATCH --job-name=gpujob
# Other options
scancel --state=PENDING --jobname=gpujob
# etc.
and submit it twice with
$ sbatch --gres=gpu:titanX:2 submit.sh
$ sbatch --gres=gpu:titanV:2 submit.sh
Each job will be assigned only one type of GPU, and the first one that starts will cancel the other one. This approach can scale up with more than two GPU types.

Terminology used in Nsight Compute

Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.

headache for clEnqueueNDRangeKernel local work size

For opencl optimization, my idea is try to make match for
1 workgroup(kernel coding) as compute unit(GPU Hardware)
1 workitem(kernel coding) as process element(GPU Hardware)
( Maybe my idea is not correct, please teach me )
for example:
1. I have a global work size of 4000 by 3000.
2. My GPU opnecl device has a maximum work-group-size of 8192.
3. I call clEnqueueNDRangeKernel with the desired local-work-size (along with all other necessary parameters)
4. by fucntion call:
a. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
b. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
above a and b are return 8192.
maximum work-group-size, CL_KERNEL_WORK_GROUP_SIZE, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE all are 8192.
I have no idea what I should follow to define my local work size...
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_item_size, local_work_item_size, 0, NULL, NULL);
Very headache to define this "local_work_item_size" of clEnqueueNDRangeKernel function.
(Q2)
Could some one explain the difference if I set local work size = 1,1 between
local work size = 4000,3000 ?
Thank you in advance!
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
As pmdj pointed out, this highly depends on your application. Since it is unclear how you selected your global_work_size and it is also linked to the local_work_size I would like to explain that one first. Usually what you would want to do is to map the size of the data you want to process to the global_work_size. E.g. if you have an array with 1024 values you would also pick a global_work_size of 1024 because then you can easily use the global id as an index in your OpenCL program:
int index = get_global_id(0);
input_array[index]++; // your data processing
However, the global_work_size is limited to a maximum 2^32 - 1. If you have more data to process than that you can pass your global_work_size and data size as parameters and use a loop like the following one:
int index = get_global_id(0);
for (int i = index; i < data_size; i += global_work_size) {
input_array[i]++; // your data processing
}
The last fact which is important for the global_work_size is that it needs to be dividable by the local_work_size. This can result into a your global_work_size being bigger than your data size, e.g. you could have 1000 values while your local_work_size is 32. Then you would make your global_work_size 1024 and ensure through a condition like the one above (i < data_size) that the redundant work items are not doing anything weird like accessing not allocated memory areas.
The local_work_size depends on your platform. First of all you should always have a local_work_size which is a multiple of 32 for NVIDIA or a multiple of 64 for AMD GPUs. This represents the amount of operations which are scheduled together anyway. If you use a different number the GPU will have idle threads which won't do anything but decrease your performance.
Not only the manufacturer but also the specific type of your GPU has to be considered to find the optimal local_work_size. The global_work_size divided by the local_work_size is the number of work groups. Each work group is executed by one thread inside your CPU/GPU. If you use OpenCL to run your application on powerful hardware you want to make sure that it runs as parallel as possible. E.g. if you use an Intel i7 with 8 threads you would want to make sure that you have at least 8 work groups (global_work_size / local_work_size >= 8). If you use a NVIDIA GeForce GTX 1060 with 1280 CUDA Cores you would want to have at least 1280 work groups. But never at the cost of having a local_work_size of less than 32 which is more important!
If you are having more work groups than your hardware has threads that does not matter, they will be processed sequentially. Hence for most applications you can always set your local_work_size to 32/64. The only exception is if you require synchronization among more than work items. E.g. barriers only work inside work groups but not among different work groups. An example: If you need to to sum up chunks of 1024 values before being able to proceed with your algorithm you would need to set your local_work_size to 1024 for the barrier to work as desired.
(Q2) Could some one explain the difference if I set local work size = 1,1 between local work size = 4000,3000 ?
Both, the global_work_size and the local_work_size can have more than one dimension. If this is used or not solely depends on the preference of the programmer. All algorithms can be implemented in one dimension as well and the number of work groups is calculated by multiplying the dimensions, e.g. if your global_work_size is 20*20 and your local_work_size is 10*10 you would run the program with (20*20) / (10*10) = 400 work groups.
I personally like to use the dimensions if I am processing data which has multiple dimensions. Imagine your input is a two-dimensional image, you could simply use its width and height as global_work_size (e.g. 1024 * 1024) and the local_work_size accordingly (e.g. 32 * 32). In your code you could then use the following indices:
int x = get_global_id(0);
int y = get_global_id(1);
input_array[x][y]++; // your data processing

Why no linear scaling of Redis Cluster

I am trying to build one horizontal scalability system based on Redis Cluster. So I've measured the throughput of Redis Cluster with different nodes. But finally, the measured result doesn't show the linear scalability as the cluster spec stated, “High performance and linear scalability up to 1000 nodes.”
redis cluster benchmark:
The image above shows the measure result of redis cluster of (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12). (3+3) means 3 master node plus 3 slave nodes. The result of C (reate) and you (update) don't show the linear scalability of redis cluster as following picture.
I'd like to know why these measured result don't show the linear scalability. Is there any possible reason to limit the scaling?
My test environment and related information are described as below
Server
HW: HP BL460c G9, 24 CPU (E5-2620 v3 #2.40GHz), 64G memory, 300G disk
I have two machines. In order to know the capacity of one HW machine, I run all master nodes on one machine and all slaves nodes on another machine. All redis nodes will be include in one Redis Cluster.
OS: SLES 12
I have updates some system settings to achieve higher performance.
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0
Furthermore, I've turned off the swap, which could cause very unstable throughput when AOF re-write happened even swappiness is already set to 0. As observed, 15 million records in my test will occupy around 48G memory.
Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.
Client
HW: HP DL380 G7, 16 CPU (E5620 #2.40GHz), 24G memory, 600G disk
OS: SLES 12
YCSB (0.6.0) with jedis (2.8.0)
I will use hash key to store all records (1 key and 21 fields) and N sorted sets to store all keys and its random scores. Here N is the number of master nodes in the cluster. N sorted sets will be distributed evenly in each master node.
The YCSB workload configuration is pasted below:
workload=com.yahoo.ycsb.workloads.CoreWorkload
recordcount=15000000
operationcount=150000000
insertstart=0
fieldcount=21
fieldlength=188
readallfields=true
writeallfields=false
fieldlengthdistribution=zipfian
readproportion=0.0
updateproportion=1.0
insertproportion=0
readmodifywriteproportion=0.0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=subscriber
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
At most cases, the computer resource is enough in my view though the throughput already hit the limit.
CPU: there are much CPU left, 60~70% CPU idle
I/O usage: it's not so busy, it's 30~40% utility at peak time.
Memory: only memory could be exhausted almost at peak time, i.e. when AOF re-write happened. At most time it's around 80%.