scylladb : scylla_io_setup script not showing Recommended --max-io-requests param in output - scylla

I am running scylla_io_setup script inside docker container, i have mounted /var/lib/scylla/ data directory to 1TB xfs sdd. the scylla_io_setup not showing Recommended --max-io-requests parameter in output. following is the output of the script.
[root#ip /]# ./usr/lib/scylla/scylla_io_setup
tuning /sys/devices/virtual/block/dm-4
tuning: /sys/devices/virtual/block/dm-4/queue/nomerges 2
warning: unable to tune /sys/devices/virtual/block/dm-4/queue/nomerges to 2
tuning /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host0/target0:2:4/0:2:4:0/block/sde
tuning: /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host0/target0:2:4/0:2:4:0/block/sde/queue/nomerges 2
warning: unable to tune /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host0/target0:2:4/0:2:4:0/block/sde/queue/nomerges to 2
tuning /sys/devices/virtual/block/dm-4
tuning /sys/devices/virtual/block/dm-4
tuning /sys/devices/virtual/block/dm-4
tuning /sys/devices/virtual/block/dm-4
WARNING: unable to mbind shard memory; performance may suffer:
WARN 2020-01-29 10:26:47,892 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
WARNING: unable to mbind shard memory; performance may suffer:
INFO 2020-01-29 10:26:48,161 [shard 0] iotune - /var/lib/scylla/saved_caches passed sanity checks
WARN 2020-01-29 10:26:48,161 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host0/target0:2:4/0:2:4:0/block/sde/queue/scheduler set to deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN 2020-01-29 10:26:48,161 [shard 0] iotune - nomerges for /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/host0/target0:2:4/0:2:4:0/block/sde/queue/nomerges set to 0. It is recommend to set it to 2 before evaluation so that merges are disabled. Results can be skewed otherwise.
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 188 MB/s
Measuring sequential read bandwidth: 424 MB/s
Measuring random write IOPS: 23843 IOPS
Measuring random read IOPS: 66322 IOPS
Writing result to /etc/scylla.d/io_properties.yaml
Writing result to /etc/scylla.d/io.conf

Was the result written to io.conf?

Related

how to grep the multiple strings from a file and print them in group wise

I'm trying to grep a list of errors form host log file, being a huge file it prints a lot of data and hard to see what errors repeated and logged
0x45bae19d6bc0 IO type 16648 (READ) isOrdered:NO isSplit:NO isEncr:NO since 7990 msec status I/O error
Throttled: 82 IO failed on disk e3d17cdb-3190-9e21-ea45-4cff39420501, Wake up 0x45ba3a34f9c0 with status I/O error
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10432 microseconds to 5392073 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10444 microseconds to 10822733 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 10822733 microseconds to 2163435 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 2163435 microseconds to 426054 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10465 microseconds to 925119 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10469 microseconds to 1904014 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10472 microseconds to 3936215 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10479 microseconds to 8517984 microseconds.
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10490 microseconds to 17358740 microseconds.
0x45bae0fefe40 IO type 16648 (READ) isOrdered:NO isSplit:NO isEncr:NO since 48543 msec status I/O error
Throttled: 82 IO failed on disk e3d17cdb-3190-ea45-4cff39420501, Wake up 0x45da36318840 with status I/O error
naa.5000c500ba661f performance has improved. I/O latency reduced from 17358740 microseconds to 3372968 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 3372968 microseconds to 674458 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10677 microseconds to 1353205 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 1353205 microseconds to 268942 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10682 microseconds to 419051 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10682 microseconds to 872847 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10684 microseconds to 1770518 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10687 microseconds to 3640051 microseconds.
0x45dae4fe25c0 IO type 16648 (READ) isOrdered:NO isSplit:NO isEncr:NO since 15991 msec status I/O error
Throttled: 82 IO failed on disk e3d17cdb-3190--ea45-4cff39420501, Wake up 0x45da362677c0 with status I/O error
0x45dae4fe2340 IO type 16648 (READ) isOrdered:NO isSplit:NO isEncr:NO since 24806 msec status I/O error
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu10:36926358)MemSchedAdmit: 471: Admission failure in path: vm.36926352/vmmanon.36926352
cpu23:36926381)MemSchedAdmit: 471: Admission failure in path: vm.36926375/vmmanon.36926375
Throttled: 82 IO failed on disk e3d17cdb-3190-9e21-ea45-4cff39420501, Wake up 0x45ba3abe8880 with status I/O error
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10696 microseconds to 7557465 microseconds.
Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10711 microseconds to 15202991 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 15202991 microseconds to 2944264 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 2944264 microseconds to 577176 microseconds.
naa.5000c500bb7a661f performance has improved. I/O latency reduced from 577176 microseconds to 112712 microseconds.
I'm expecting the following output, I've searched alot of places and didn't find a suitable solution, hoping it may possible with awk and sed
egrep -i "latency|I/O error|Failure" error.log
Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
cpu3:2099278)Migrate: 448: Error reading from pending connection: Failure
IO Errors
cpu5:2098752)WARNING: LSOM: RCIOCompletionLoop:93: Throttled: 82 IO failed on disk e3d17cdb-3190-9e21-ea45-4cff39420501, Wake up 0x45da362677c0 with status I/O error
cpu6:2097866)LSOMCommon: IORETRYCompleteIO:470: Throttled: 0x45dae4fe2340 IO type 16648 (READ) isOrdered:NO isSplit:NO isEncr:NO since 24806 msec status I/O error
cpu2:2098752)WARNING: LSOM: RCIOCompletionLoop:93: Throttled: 82 IO failed on disk e3d17cdb-3190-9e21-ea45-4cff39420501, Wake up 0x45ba3abe8880 with status I/O error
cpu9:2099365 opID=add9908b)WARNING: ScsiDeviceIO: 12028: READ CAPACITY on device “naa.5000c500bb7a661f” from Plugin “HPP” failed. I/O error
LAtency
cpu5:2097866)WARNING: ScsiDeviceIO: 1596: Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10682 microseconds to 419051 microseconds.
cpu19:2097867)WARNING: ScsiDeviceIO: 1596: Device naa.5000c500bb7a661f performance has deteriorated. I/O latency increased from average value of 10682 microseconds to 872847 microseconds
Assumptions:
if multiple patterns match a single line we'll display the line in each of the output groups
group headings are exact reprints of the search patterns (ie, won't be reformatting the group headers as is done in the question where search pattern I/O error becomes group heading IO Errors)
there is no requirement to match on only whole words (eg, failure will match on failure, failures, nonfailures, stufffailuresXYZ)
within an output group we wish to maintain the input ordering of the rows
The question's current input/output doesn't match so until fixed we'll use a small(er) set of input data for demonstration purposes:
$ cat test.log
you can ignore this line
you should match this line on abcLaTeNcYxyz
yeah, match this line on Failures and throttled
you can ignore this line
more matches for i/o error and latency
single match on I/O error
couple more matches on failures
couple more matches on failure
ignore this line, too
Adding a non-matching string (no-match) to the mix:
$ patterns='latency|I/O error|Failure|throttled|no-match'
One GNU awk idea (for array of arrays and PROCINFO["sorted_in"]):
awk -v plist="${patterns}" '
BEGIN { IGNORECASE=1
delete groups
n=split(plist,arr,"|") # break plist up into components
for (i=1;i<=n;i++) {
ptns[arr[i]] # assign as indices of ptns[] array for easier processing
groups[arr[i]][0] # place holder to allow us to print an empty group
}
}
{ for (ptn in ptns) # loop through list of patterns and ...
if ($0 ~ ptn) # if found then ...
groups[ptn][c++]=$0 # save in groups[] array
}
END { PROCINFO["sorted_in"]="#ind_str_asc"
for (ptn in ptns) {
printf "\n######### %s\n\n", ptn
PROCINFO["sorted_in"]="#ind_num_asc" # sort the c++ values in ascending order => maintain input ordering
for (i in groups[ptn])
if (groups[ptn][i] != "")
print groups[ptn][i]
}
}
' test.log
This generates:
######### Failure
yeah, match this line on Failures and throttled
couple more matches on failures
couple more matches on failure
######### I/O error
more matches for i/o error and latency
single match on I/O error
######### latency
you should match this line on abcLaTeNcYxyz
more matches for i/o error and latency
######### no-match
######### throttled
yeah, match this line on Failures and throttled

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

aws gpu oom issue onnx cuda

Doing predictions on AWS GPU instance g4dn.4xlarge(16gb gpu memory,64 gb cpu mem) and deployed with k8s & dockers.
Tested with (cuda10.1 + onnxruntime-gpu==1.4.0 ) and (cuda10.2 + onnxruntime-gpu==1.6.0) same error.Models are customised for our purpose,cant point to weights.
Problem is :
Getting cuda oom(out of memory) error:
Error: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_16' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:298 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 33554432
On some backtracking:
Using nvidia-smi commands and GPU memory profiling, found for the 1st prediction and for next all predictions a constant GPU memory of ~1.8GB minimum for some models ~ 3 GB is blocked for some (I think it's blocked for multiprocess ). Releasing mem doesnt make sense , coz for next prediction same amount of mem will be blocked.
My understanding:
So at the peak, we are scaling up to 22 pods & in every pod, the model load is initialized, and hence every pod is blocking 1.8 ~ 3gb of memory & pointing to 1 GPU instance of 16 GB GPU memory.So, with 22 pods, oom is expected.
What is confusing:
Above cuda message throws oom, but gpu profiling shows memory utilisation is never more than 50% , though SM(Streaming multiprocessing) is 100% at peak(when pods scaled to 22).Attached image for refernce.
On research I understood that SM has nothing to do with oom and cuda would handle sm efficiently. Then why getting cuda oom error if only 50% mem is utilised?
Ruled out.
I ruled out memory leak from model , as it runs w/o oom error when load is low.
Why GPU and not CPU for prediction.
Want faster predictions. Ran on CPU w/o any error ,even on high load.
What I am looking for:
A solution to scale AWS GPU instances on the basis of GPU memory.If oom is reason ,scaling on GPU mem should solve problem.I can't find.
Understanding cuda msg , when mem is available why oom ?
Being very hypothetical. If there is a way to create singleton object by design or using k8s for particular model load and saled up pods can utilise that model load object for prediction rather than creating new server. BUt that would kill sense or using k8s for availabilty & scalabilty.

Terminology used in Nsight Compute

Two questions:
According to Nsight Compute, my kernel is compute bound. The SM % of utilization relative to peak performance is 74% and the memory utilization is 47%. However, when I look at each pipeline utilization percentage, LSU utilization is way higher than others (75% vs 10-15%). Wouldn't that be an indication that my kernel is memory bound? If the utilization of compute and memory resources doesn't correspond to pipeline utilization, I don't know how to interpret those terms.
The schedulers are only issuing every 4 cycles, wouldn't that mean that my kernel is latency bound? People usually define that in terms of utilization of compute and memory resources. What is the relationship between both?
In Nsight Compute on CC7.5 GPUs
SM% is defined by sm__throughput, and
Memory% is defined by gpu__compute_memory_throughtput
sm_throughput is the MAX of the following metrics:
sm__instruction_throughput
sm__inst_executed
sm__issue_active
sm__mio_inst_issued
sm__pipe_alu_cycles_active
sm__inst_executed_pipe_cbu_pred_on_any
sm__pipe_fp64_cycles_active
sm__pipe_tensor_cycles_active
sm__inst_executed_pipe_xu
sm__pipe_fma_cycles_active
sm__inst_executed_pipe_fp16
sm__pipe_shared_cycles_active
sm__inst_executed_pipe_uniform
sm__instruction_throughput_internal_activity
sm__memory_throughput
idc__request_cycles_active
sm__inst_executed_pipe_adu
sm__inst_executed_pipe_ipa
sm__inst_executed_pipe_lsu
sm__inst_executed_pipe_tex
sm__mio_pq_read_cycles_active
sm__mio_pq_write_cycles_active
sm__mio2rf_writeback_active
sm__memory_throughput_internal_activity
gpu__compute_memory_throughput is the MAX of the following metrics:
gpu__compute_memory_access_throughput
l1tex__data_bank_reads
l1tex__data_bank_writes
l1tex__data_pipe_lsu_wavefronts
l1tex__data_pipe_tex_wavefronts
l1tex__f_wavefronts
lts__d_atomic_input_cycles_active
lts__d_sectors
lts__t_sectors
lts__t_tag_requests
gpu__compute_memory_access_throughput_internal_activity
gpu__compute_memory_access_throughput
l1tex__lsuin_requests
l1tex__texin_sm2tex_req_cycles_active
l1tex__lsu_writeback_active
l1tex__tex_writeback_active
l1tex__m_l1tex2xbar_req_cycles_active
l1tex__m_xbar2l1tex_read_sectors
lts__lts2xbar_cycles_active
lts__xbar2lts_cycles_active
lts__d_sectors_fill_device
lts__d_sectors_fill_sysmem
gpu__dram_throughput
gpu__compute_memory_request_throughput_internal_activity
In your case the limiter is sm__inst_executed_pipe_lsu which is an instruction throughput. If you review sections/SpeedOfLight.py latency bound is defined as having both sm__throughput and gpu__compute_memory_throuhgput < 60%.
Some set of instruction pipelines have lower throughput such as fp64, xu, and lsu (varies with chip). The pipeline utilization is part of sm__throughput. In order to improve performance the options are:
Reduce instructions to the oversubscribed pipeline, or
Issue instructions of different type to use empty issue cycles.
GENERATING THE BREAKDOWN
As of Nsight Compute 2020.1 there is not a simple command line to generate the list without running a profiling session. For now you can collect one throughput metric using breakdown:<throughput metric>avg.pct_of_peak_sustained.elapsed and parse the output to get the sub-metric names.
For example:
ncu.exe --csv --metrics breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed --details-all -c 1 cuda_application.exe
generates:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed","%","0.38"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed","%","0.05"
"0","33396","cuda_application.exe","127.0.0.1","kernel()","2020-Aug-20 13:26:26","1","7","Command line profiler metrics","l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed","%","0.05"
...
The keyword breakdown can be used in Nsight Compute section files to expand a throughput metric. This is used in the SpeedOfLight.section.

SAS regression plot errors

I use proc reg to model regression and plot the results. However, I got the error below. Is there any way to solve it?
ods graphics on;
proc reg data = Work.Cmds PLOTS(MAXPOINTS=NONE);
model Investment = Size Growth_New Leverage complex Deficit pc_income_NEW
Density/hcc adjrsq ;
output out=CMDSreg r = AbnInvestment; run;
ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit exceeded.
ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: GC overhead limit exceeded.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.CMDSREG may be incomplete. When this step was stopped there were 0 observations and 0 variables.
WARNING: Data set WORK.CMDSREG was not replaced because this step was stopped.
NOTE: PROCEDURE REG used (Total process time):
real time 1:05.39
cpu time 13.48 seconds
quit;
ods graphics off;
I browsed the website here but still don't understand.
Note:
The data set WORK.CMDS has 587831 observations and 142 variables.