Why is memset slow? - optimization

Why is memset slow? - optimization

The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) says my Q9400 should be getting 5.336MB/s. What's wrong?
I've tried replacing memset or memcpy with assignment loops. I've googled around to try to learn about memory alignment. I've tried different compiler flags. I've spent an embarrassing number of hours on this. Thanks for any help you can provide!
I'm using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic
The code:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))
int main(int argc,char**argv){
if(argc!=3){
printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
return -1;
}
char*__restrict__ source=(char*)malloc(numBytes);
char*__restrict__ dest=(char*)malloc(numBytes);
struct timespec start,end;
long totalTimeMs;
int i;
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memset(source,0,numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
for(i=0;i<numTransfers;++i)
memcpy( dest, source, numBytes);
clock_gettime(CLOCK_MONOTONIC_RAW,&end);
totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);
free(source);
free(dest);
return EXIT_SUCCESS;
}
Compile commands:
gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt
Sample outputs:
memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).
My hardware according to lshw:
product: OptiPlex 960 ()
vendor: Winbond Electronics
width: 64 bits
*-core
description: Motherboard
product: 0Y958C
vendor: Winbond Electronics
*-firmware
description: BIOS
capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
*-cpu
product: Intel(R) Core(TM)2 Quad CPU Q9400 # 2.66GHz
physical id: 400
size: 2666MHz
width: 64 bits
clock: 1333MHz
capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
configuration: cores=4 enabledcores=4 threads=4
*-cache:0
description: L1 cache
physical id: 700
size: 256KiB
capacity: 256KiB
capabilities: internal write-back unified
*-cache:1
description: L2 cache
physical id: 701
size: 6MiB
capacity: 6MiB
capabilities: internal varies unified
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 4GiB
*-bank:0
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
product: CT51264AA667.M16FC
vendor: 7F7F7F7F7F9B0000
slot: DIMM_1
size: 4GiB
width: 64 bits
clock: 667MHz (1.5ns)
*-bank:1
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:2
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
*-bank:3
description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]

Memory addresses are "virtualized", the addresses your program uses are translated to real addresses. This translation makes it possible to allocate what your program sees as contiguous memory from whatever pieces are handy at the time. Every general-purpose CPU does this. The translation requires a table lookup, which requires memory access. The CPU's got caches for it, but long stretches of virtual addresses can easily blow its cache, the "TLB" ("translation lookaside buffer"). So every 4KB (2MB on a linux system that's figured out what you're doing) the CPU stalls hunting up where to really send your memory traffic. Those stalls can take quite a bit of time. You might try running two copies of your benchmark, it seems reasonable that the TLB misses won't coincide and you'll get aggregate bandwidth much closer to your rated capacity.
(edit: um, you might want to replace your #defines with
size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);
in the main body ...)
Edit: by the way: the bandwidth I saw (and reported in comments) from this test on my box was so far below rated capacity for my cpu that it got me investigating my own system. My box builder had put really crap memory in those slots. I've long since replaced them with a known-good brand, more than doubled the reported throughput, and very visibly improved the performance of my machine.

Last I checked memcpy and memset were not optimized in GCC. This was still true in 2012. See Agner Fog's Optimizing software in C++ section 2.6
2.6 "Choice of function libraries" and Table 2.1. He compares for several different compilers and OS's.
GCC has built in functions for doing memcpy. Apparently, they are even worse than memcpy in Glib. As far as I understand the GCC developers and the Glib developers work independently. To get the libraries from Glib you need to use -fno-builtin. However, although Glib is (or at least was) better it's still not optimal. To get the best results use Agner Fog's asmlib. He has optimized memcpy and memset and many other common functions in assembly to take advantage of SSE and AVX among other optimizations.

Related

Openvino: Failed to create plugin libclDNNPlugin.so for device GPU

I would like to run OpenVINO on an integrated GPU Intel HD 400.
When I run it I have the following error:
12.05.21 16:02:27 (-0400) self._engine = self._ie.load_network(**openvino_config)
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 311, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 320, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) RuntimeError: Failed to create plugin /opt/intel/openvino_2021.1.110/deployment_tools/inference_engine/lib/intel64/libclDNNPlugin.so for device GPU
I followed the instructions to install the GPU plugin here Steps for Intel® Processor Graphics (GPU) and I can confirm that libclDNNPlugin.so exists.
I am running the code within a docker container and I am not sure if the host os has the proper drivers installed.
I run lsmod on host os and I got the following output
Module Size Used by
bnep 20480 2
ip6t_REJECT 16384 2
nf_reject_ipv6 16384 1 ip6t_REJECT
ip6table_filter 16384 1
xt_state 16384 0
ipt_REJECT 16384 2
nf_reject_ipv4 16384 1 ipt_REJECT
ip6_tables 28672 1 ip6table_filter
xt_MASQUERADE 16384 12
nf_conntrack_netlink 32768 0
nfnetlink 16384 2 nf_conntrack_netlink
xfrm_user 36864 1
xt_owner 16384 0
snd_hda_codec_hdmi 53248 1
snd_hda_codec_realtek 98304 1
snd_hda_codec_generic 65536 1 snd_hda_codec_realtek
snd_hda_intel 32768 0
coretemp 16384 0
snd_intel_dspcfg 16384 1 snd_hda_intel
snd_hda_codec 94208 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
i915 1802240 5
at24 20480 0
r8169 73728 0
snd_hda_core 65536 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
regmap_i2c 16384 1 at24
snd_pcm 86016 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
efivars 20480 0
realtek 20480 2
libphy 81920 2 r8169,realtek
snd_timer 28672 1 snd_pcm
video 40960 1 i915
backlight 16384 2 video,i915
sch_fq_codel 20480 4
The error specifies: [CLDNN ERROR]. No GPU device was found.
Also, clinfo reported 0 platform available.
I run the following commands:
sudo apt install ocl-icd-libopencl1
sudo apt install opencl-headers
sudo apt install clinfo
sudo apt install ocl-icd-opencl-dev
sudo apt install beignet
and now the clinfo output is:
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
root#406231114801:/src# clinfo
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
How can I solve the issue?

According to these supported platforms information, your GPU currently might not supported. Could you run the programs on CPU? Try running this command " lspci | grep -i vga " and share the output here.

How to compute GPU memory bus?

I'm learning OpenCL/CUDA for GPU computing.
When I study the GDDR5 architecture, I'm told that
memory bus = quntity of memory channel * memory channel width
I see an AMD GPU has 16 memory channels with 32-bits wide, so I get the memory pin width = 16 * 32 = 512bits.
But I found that the mainstream graphic card has only 256/384-bits memory bus.
What's going wrong with it?

For GPUs, the number of memory channels usually is not explicitely stated, but rather the total bus width (in bits) for all channels combined. The bus width varies greatly depending on how many memory modules are on the PCB and the bus width per memory module. GPUs with 256bit total bus width typically have 8 memory modules with 1GB capacity each and GPUs with 384bit have 12.
For CPUs or integrated GPUs which share main memory:
memory bus width per channel = 64bit
numer of memory channels = 2 (mainstream plattforms) / 4 or 8 (high-end desktop / workstation)
memory clock = 1600MHz (DDR3) - 3200+MHz (DDR4)
memory bandwidth = 0.125 * memory bus width per channel * numer of memory channels * memory clock
For dedicated GPUs:
total memory bus width = 64bit (GDDR3) - 256bit (GDDR5) - 5120bit (HBM2)
effective memory clock = <5GHz (GDDR5) - 19.5GHz (GDDR6X)
memory bandwidth = 0.125 * total memory bus width * effective memory clock

Why after installation Raspbian on Orangepi PC disk size is shown only 3.6G of 32G?

I am very newbie in OrangePI PC. I have installed it by dd on my macOS, and I have tried installing a Raspbian image which downloaded from the orangepi.org in Windows as well, after installation when I check free disk space it is showing:
root#orangepi:~# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 3.4G 2.7G 474M 86% /
/dev/root 3.4G 2.7G 474M 86% /
devtmpfs 374M 0 374M 0% /dev
tmpfs 101M 188K 101M 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 201M 0 201M 0% /run/shm
/dev/mmcblk0p1 41M 4.9M 37M 12% /boot
I have installed it on 32G flash drive. But when I check it through fdisk command it shows 32G as a disk size:
root#orangepi:~# sudo fdisk -l
Disk /dev/mmcblk0: 32.0 GB, 32010928128 bytes
4 heads, 16 sectors/track, 976896 cylinders, total 62521344 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x34605ba5
Device Boot Start End Blocks Id System
/dev/mmcblk0p1 40960 124927 41984 83 Linux
/dev/mmcblk0p2 124928 7170047 3522560 83 Linux
root#orangepi:~#
How to fix this?

This solved my problem (solution is taken from here):
root#orangepi:~# fdisk /dev/mmcblk0
Command (m for help): p
Disk /dev/mmcblk0: 15.8 GB, 15804137472 bytes
4 heads, 16 sectors/track, 482304 cylinders, total 30867456 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x34605ba5
Device Boot Start End Blocks Id System
/dev/mmcblk0p1 40960 124927 41984 83 Linux
/dev/mmcblk0p2 124928 7170047 3522560 83 Linux
Command (m for help): d
Partition number (1-4): 2
Command (m for help): n
Partition type:
p primary (1 primary, 0 extended, 3 free)
e extended
Select (default p): p
Partition number (1-4, default 2): 2
First sector (2048-30867455, default 2048): 124928
Last sector, +sectors or +size{K,M,G} (124928-30867455, default 30867455):
Using default value 30867455
Command (m for help): w
Then quit (command q), reboot. You will then be able to use resize:
resize2fs /dev/root

Tensorflow strange CPU usage

my problem is with Tensorflow and the usage of the CPU.
My System:
CPU => AMD FX 8320 (8 Cores á 3,5ghz) and 8 Threads
Grafik => GTX 970
RAM => 16Gb and i belive ddr3 2600
I want to run a A3C algorithm for Starcraft 2 (pysc2) on my pc what works fine but the usage of the cpu ist somewhat strange.
If i start the algorithm with 4 Workers i get something about 150k Steps in 1h
and all cpu's are used about 25-30%
If i start the same algorithm with 8 Workers i get something about 120k Steps in 1h and all cpu's are used about 25-30%
If i now start the algorithm with 4 workers twice i get each 150k steps 1h and the cpu usage is 60-70%
Why cant i start the algorithm with 8 Workers, get the double amount of steps in 1H and the cpu is used to 70%?

JVM fails to allocate XMS under Suse SLES10 X64 running on VMWare ESX

I am trying to allocate ram with xms = xmx on a sles10 x64 running under VMware.
When stopping the JVM the following error is thrown:
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (errno = 12).
The RAM of the VM is 8 GB and they are reserved.
The VM sees 8GB and it can be allocated during runtime via the XMX setting.
On another Virtual SLES10 with 16 GB RAM Reserved via VMWare I don't have a problem with allocation of RAM even when setting the hugepages and shmax only by echo it works fine.
echo 8000 > /proc/sys/vm/nr_hugepages
echo 8589934592 > /proc/sys/kernel/shmmax
Using the echo commands on the other SLES10 show no effect in /proc/meminfo at all.
here are my configs 1st on is the SLES10 where XMS fails to allocate.
# more /apps/liferay-portal-5.2.5/tomcat-5.5.27/bin/setenv.sh
JAVA_HOME=/apps/java5
JRE_HOME=/apps/java5
JAVA_OPTS="$JAVA_OPTS -Xms3G -Xmx3G -XX:NewRatio=3 -XX:MaxPermSize=256m -XX:SurvivorRatio=20 -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000 -XX:+UsePa
rallelGC -XX:ParallelGCThreads=4 -XX:+UseLargePages -Xloggc:/apps/gc.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGC -XX:+PrintGCTimeStamps -
XX:+PrintGCDetails -Dfile.encoding=UTF8 -Duser.timezone=GMT+2 -Djava.security.auth.login.config=$CATALINA_HOME/conf/jaas.config -Dorg.apache.catalina.loader.WebappClassLoader.ENABLE_C
LEAR_REFERENCES=false"
more /etc/sysctl.conf
kernel.shmmax=7516192768
vm.nr_hugepages=3072
vm.hugetlb_shm_group=1000
more /etc/securtiy/limits.conf
#
#
#* soft core 0
#* hard rss 10000
##student hard nproc 20
##faculty soft nproc 20
##faculty hard nproc 50
#ftp hard nproc 0
##student - maxlogins 4
* soft memlock unlimited
* hard memlock unlimited
tomcat soft memlock 6291456
tomcat hard memlock 6291456
# End of file
# cat /proc/meminfo
MemTotal: 7928752 kB
MemFree: 737004 kB
Buffers: 0 kB
Cached: 417368 kB
SwapCached: 0 kB
Active: 487428 kB
Inactive: 324072 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 7928752 kB
LowFree: 737004 kB
SwapTotal: 2097144 kB
SwapFree: 2097020 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 397208 kB
Mapped: 72180 kB
Slab: 62136 kB
CommitLimit: 2915792 kB
Committed_AS: 748576 kB
PageTables: 3292 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 7028 kB
VmallocChunk: 34359731271 kB
HugePages_Total: 3072
HugePages_Free: 2305
HugePages_Rsvd: 897
Hugepagesize: 2048 kB
# ipcs -l
Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 7340032
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1
Semaphore Limits --------
max number of arrays = 1024
max semaphores per array = 250
max semaphores system wide = 256000
max ops per semop call = 32
semaphore max value = 32767
Messages: Limits --------
max queues system wide = 16
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536
# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 65536
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65536
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
On the Second VM it looks like this
cat /proc/meminfo
MemTotal: 16190448 kB
MemFree: 176812 kB
Buffers: 52752 kB
Cached: 755256 kB
SwapCached: 0 kB
Active: 713808 kB
Inactive: 425300 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 16190448 kB
LowFree: 176812 kB
SwapTotal: 35658896 kB
SwapFree: 35658796 kB
Dirty: 932 kB
Writeback: 0 kB
AnonPages: 333620 kB
Mapped: 79120 kB
Slab: 37492 kB
CommitLimit: 36356744 kB
Committed_AS: 646284 kB
PageTables: 3584 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 23500 kB
VmallocChunk: 34359713907 kB
HugePages_Total: 7224
HugePages_Free: 6654
HugePages_Rsvd: 582
Hugepagesize: 2048 kB
JAVA_OPTS="$JAVA_OPTS -Xms2G -Xmx2G -XX:NewRatio=3 -XX:MaxPermSize=256m -XX:SurvivorRatio=20 -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcI
nterval=1800000 -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+UseLargePages -Xloggc:/apps/gc.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplication
ConcurrentTime -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dfile.encoding=UTF8 -Duser.timezone=GMT+2 -Djava.security.auth.login.config=$CATALINA
_HOME/conf/jaas.config -Dorg.apache.catalina.loader.WebappClassLoader.ENABLE_CLEAR_REFERENCES=false"
hepide01pep1:~ # ipcs -l
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 8388608
max total shared memory (kbytes) = 4611686018427386880
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 1024
max semaphores per array = 250
max semaphores system wide = 256000
max ops per semop call = 32
semaphore max value = 32767
------ Messages: Limits --------
max queues system wide = 16
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536

Did you tried with the less size of heap.. may be with 2gig. You can just do simple try with
java -Xmx3G -version .Let us know how it goes and what it spit out.

I have stumbled with this issue (errno 12) on CentOS 5.9 as well using 16G heaps.
After verifying hard / soft memory locks were unlimited in /etc/security/limits.conf and still getting the error, I started running java -version as suggested by Anil, with all of my JAVA_OPTS intact.
I have found that removing the "-XX:+UseLargePages" option gets rid of that error.
I hope this helps you!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why is memset slow? - optimization

Related

Openvino: Failed to create plugin libclDNNPlugin.so for device GPU

How to compute GPU memory bus?

Why after installation Raspbian on Orangepi PC disk size is shown only 3.6G of 32G?

Tensorflow strange CPU usage

JVM fails to allocate XMS under Suse SLES10 X64 running on VMWare ESX

Categories

Resources