Vega 56 Bandwidth Slow in (not only) TensowFlow - Why and how? - tensorflow

I just built a new AMD-based PC, with CPU - AMD Ryzen 7 3700X, GPU - AMD Radeon RX Vega 56, OS - Ubuntu 18.04. In order to use AMD GPU for Tensorflow, I follow these two to install ROCm. Everything seems fine and no problems in installation. I think I install ROCm 3. I do exact as the posts.
https://towardsdatascience.com/train-neural-networks-using-amd-gpus-and-keras-37189c453878
https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU
video link: https://www.youtube.com/watch?v=fkSRkAoMS4g
But when I ran rocm-bandwidth-test in the terminal, as the video, I had result as below.
(base) nick#nick-nbpc:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 3700X 8-Core Processor
Device: 1, Vega 10 XT [Radeon RX Vega 64], 2f:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.295924
1 8.892247 72.654038
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.103560
1 17.103560 N/A
(base) nick#nick-nbpc:~$
The video is using AMD RX 580 GPU, and I compare the technical specs from the link below.
https://www.youtube.com/watch?v=shstdFZJJ_o
which is showing that RX580 has memory bandwidth 256 Gb/s and Vega 56 has 409.6 Gb/s. In the other video, the uploader has a bandwidth 195 Gb/s at time 11:09 of the video. But my Vega 56 only has 72.5 Gb/s! This is a huge difference. I don't know what is wrong.
Then I install python 3.6 and TensorFlow-ROCm. And I git clone https://github.com/tensorflow/benchmarks.git, just as the video, to do the benchmark test in tensorflow.
Execute the code:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Gives this result:
Done warm up
Step Img/sec total_loss
1 images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40 images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50 images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60 images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70 images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80 images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90 images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------
The result is not as good as I expected. I was expecting some number 100+. But due to my limited knowledge on Ubuntu/AMD/TensorFlow, I might be very likely wrong. If not, can someone tell me why my bandwidth is not as fast as 400 Gb/s?
========================================
clinfo
(base) nick#nick-nbpc:~$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3137.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon RX Vega 64]
Device Topology: PCI[ B#47, D#0, F#0 ]
Max compute units: 56
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1590Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fe56aa5fcf0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3137.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
(base) nick#nick-nbpc:~$
rocminfo
(base) nick#nick-nbpc:~$ rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 3700X 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 3700X 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-02151e1bb9ee2144
Marketing Name: Vega 10 XT [Radeon RX Vega 64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1590
BDFID: 12032
Internal Node ID: 1
Compute Unit: 56
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
(base) nick#nick-nbpc:~$

i cant answer the bandwidth question but i have just tried out the same benchmarks (according to the youtube video)
i get:
(vrocm1) user1#t1000test:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 2700X Eight-Core Processor
Device: 1, Vega 10 XL/XT [Radeon RX Vega 56/64], 28:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.542044
1 9.028717 72.202459
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.144430
1 17.144430 N/A
which is the same as you got. but:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
gives me:
Done warm up
Step Img/sec total_loss
1 images/sec: 172.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 8.049
20 images/sec: 172.6 +/- 0.1 (jitter = 0.4) 7.808
30 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.976
40 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.591
50 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.549
60 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
70 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
80 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.848
90 images/sec: 172.6 +/- 0.0 (jitter = 0.5) 8.025
100 images/sec: 172.5 +/- 0.0 (jitter = 0.5) 8.029
----------------------------------------------------------------
total images/sec: 172.39
----------------------------------------------------------------
cliinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3182.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Device Topology: PCI[ B#40, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1630Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7efe04b66cd0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3182.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 2700X Eight-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 2700X Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-021508a5025618e4
Marketing Name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1630
BDFID: 10240
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
the only thing that you seem to have done differently is :
Execute the code: python tf_cnn_benchmarks.py --num_gpus=1
--batch_size=32 --model=resnet50
which executes the test in python2 ... (but maybe its just a typo)
greetings gspeet

Related

Openvino: Failed to create plugin libclDNNPlugin.so for device GPU

I would like to run OpenVINO on an integrated GPU Intel HD 400.
When I run it I have the following error:
12.05.21 16:02:27 (-0400) self._engine = self._ie.load_network(**openvino_config)
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 311, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 320, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) RuntimeError: Failed to create plugin /opt/intel/openvino_2021.1.110/deployment_tools/inference_engine/lib/intel64/libclDNNPlugin.so for device GPU
I followed the instructions to install the GPU plugin here Steps for Intel® Processor Graphics (GPU) and I can confirm that libclDNNPlugin.so exists.
I am running the code within a docker container and I am not sure if the host os has the proper drivers installed.
I run lsmod on host os and I got the following output
Module Size Used by
bnep 20480 2
ip6t_REJECT 16384 2
nf_reject_ipv6 16384 1 ip6t_REJECT
ip6table_filter 16384 1
xt_state 16384 0
ipt_REJECT 16384 2
nf_reject_ipv4 16384 1 ipt_REJECT
ip6_tables 28672 1 ip6table_filter
xt_MASQUERADE 16384 12
nf_conntrack_netlink 32768 0
nfnetlink 16384 2 nf_conntrack_netlink
xfrm_user 36864 1
xt_owner 16384 0
snd_hda_codec_hdmi 53248 1
snd_hda_codec_realtek 98304 1
snd_hda_codec_generic 65536 1 snd_hda_codec_realtek
snd_hda_intel 32768 0
coretemp 16384 0
snd_intel_dspcfg 16384 1 snd_hda_intel
snd_hda_codec 94208 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
i915 1802240 5
at24 20480 0
r8169 73728 0
snd_hda_core 65536 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
regmap_i2c 16384 1 at24
snd_pcm 86016 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
efivars 20480 0
realtek 20480 2
libphy 81920 2 r8169,realtek
snd_timer 28672 1 snd_pcm
video 40960 1 i915
backlight 16384 2 video,i915
sch_fq_codel 20480 4
The error specifies: [CLDNN ERROR]. No GPU device was found.
Also, clinfo reported 0 platform available.
I run the following commands:
sudo apt install ocl-icd-libopencl1
sudo apt install opencl-headers
sudo apt install clinfo
sudo apt install ocl-icd-opencl-dev
sudo apt install beignet
and now the clinfo output is:
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
root#406231114801:/src# clinfo
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
How can I solve the issue?
According to these supported platforms information, your GPU currently might not supported. Could you run the programs on CPU? Try running this command " lspci | grep -i vga " and share the output here.

Rigaya's NVEnc encodes file with no video or audio track

My source video file (1h 30min movie) is playable in both PotPlayer and VLC: h264, 8-bit color and 7755kb/s bitrate.
The NVEnc command I'm using is this:
.\nvencc\NVEncC64.exe --avhw -i "input.mkv" --codec hevc --preset quality --bframes 4 --ref 7 --cu-max 32 --cu-min 8 --output-depth 10 --audio-copy --sub-copy -o "output.mkv"
Encoding works fine (I believe):
NVEncC (x64) 5.26 (r1786) by rigaya, Jan 31 2021 09:23:04 (VC 1928/Win/avx2)
OS Version Windows 10 x64 (19042)
CPU AMD Ryzen 5 1600 Six-Core Processor [3.79GHz] (6C/12T)
GPU #0: GeForce GTX 1660 (1408 cores, 1830 MHz)[PCIe3x16][457.51]
NVENC / CUDA NVENC API 11.0, CUDA 11.1, schedule mode: auto
Input Buffers CUDA, 21 frames
Input Info avcuvid: H.264/AVC, 1920x800, 24000/1001 fps
AVSync vfr
Vpp Filters cspconv(nv12 -> p010)
Output Info H.265/HEVC main10 # Level auto
1920x800p 1:1 23.976fps (24000/1001fps)
avwriter: hevc, eac3, subtitle#1 => matroska
Encoder Preset quality
Rate Control CQP I:20 P:23 B:25
Lookahead off
GOP length 240 frames
B frames 4 frames [ref mode: disabled]
Ref frames 7 frames, MultiRef L0:auto L1:auto
AQ off
CU max / min 32 / 8
Others mv:auto
encoded 142592 frames, 219.97 fps, 1549.90 kbps, 1098.83 MB
encode time 0:10:48, CPU: 8.7%, GPU: 5.2%, VE: 98.3%, VD: 21.5%, GPUClock: 1966MHz, VEClock: 1816MHz
frame type IDR 595
frame type I 595, avgQP 20.00, total size 39.44 MB
frame type P 28519, avgQP 23.00, total size 471.93 MB
frame type B 113478, avgQP 25.00, total size 587.45 MB
but when I try to play it in either PotPlayer or VLC it says there is no video track or it just doesn't play at all.
MediaInfo also doesn't show any video, audio, or subtitle tracks either, just the name of the file and the file size. Am I missing something?
Switching --avhw to --avsw solved the problem.

Stats.txt in gem5 is always 0 Bytes

I compiled a higher version of the kernel.
The program data that was eventually run in fs mode is not written to stats.txt.
How do I solve this problem?
Gem5 is the latest version
ubuntu-14.04
build/X86/gem5.opt configs/example/fs.py --disk-image=linux-x86.img --kernel=/home/jiaotong/gem5/gem5-NUCA-master/fs-image/binaries/vmlinux
The kernel is a 3.2.24 kernel compiled with the configuration file of the official website 2.6.28.4.
this is system.terminal:
Booting Linux on physical CPU 0x0
Initializing cgroup subsys cpuset
Linux version 3.14.0 (sim#openstack1) (gcc version 4.8.4 (Ubuntu/Linaro 4.8.4-2ubuntu1~14.04.1) ) #1 SMP PREEMPT Sun May 15 16:33:56 CST 2016
Kernel was built at commit id 'b2af788'
CPU: ARMv7 Processor [410fc0f0] revision 0 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Machine model: V2P-CA15
bootconsole [earlycon0] enabled
Memory policy: Data cache writealloc
kdebugv2m: Following are test values to confirm proper working
kdebugv2m: Ranges 42000000 0
kdebugv2m: Regs 30000000 1000000
kdebugv2m: Virtual-Reg f0000000
kdebugv2m: pci node addr_cells 3
kdebugv2m: pci node size_cells 2
kdebugv2m: motherboard addr_cells 2
On node 0 totalpages: 131072
free_area_init_node: node 0, pgdat 80679840, node_mem_map 9fbf9000
Normal zone: 1024 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 131072 pages, LIFO batch:31
sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 178956969942ns
PERCPU: Embedded 8 pages/cpu #9fbe1000 s12224 r8192 d12352 u32768
pcpu-alloc: s12224 r8192 d12352 u32768 alloc=8*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048
Kernel command line: earlyprintk=pl011,0x1c090000 console=ttyAMA0 lpj=19988480 norandmaps rw loglevel=8 mem=512MB root=/dev/sda1
PID hash table entries: 2048 (order: 1, 8192 bytes)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)

Why after installation Raspbian on Orangepi PC disk size is shown only 3.6G of 32G?

I am very newbie in OrangePI PC. I have installed it by dd on my macOS, and I have tried installing a Raspbian image which downloaded from the orangepi.org in Windows as well, after installation when I check free disk space it is showing:
root#orangepi:~# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 3.4G 2.7G 474M 86% /
/dev/root 3.4G 2.7G 474M 86% /
devtmpfs 374M 0 374M 0% /dev
tmpfs 101M 188K 101M 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 201M 0 201M 0% /run/shm
/dev/mmcblk0p1 41M 4.9M 37M 12% /boot
I have installed it on 32G flash drive. But when I check it through fdisk command it shows 32G as a disk size:
root#orangepi:~# sudo fdisk -l
Disk /dev/mmcblk0: 32.0 GB, 32010928128 bytes
4 heads, 16 sectors/track, 976896 cylinders, total 62521344 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x34605ba5
Device Boot Start End Blocks Id System
/dev/mmcblk0p1 40960 124927 41984 83 Linux
/dev/mmcblk0p2 124928 7170047 3522560 83 Linux
root#orangepi:~#
How to fix this?
This solved my problem (solution is taken from here):
root#orangepi:~# fdisk /dev/mmcblk0
Command (m for help): p
Disk /dev/mmcblk0: 15.8 GB, 15804137472 bytes
4 heads, 16 sectors/track, 482304 cylinders, total 30867456 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x34605ba5
Device Boot Start End Blocks Id System
/dev/mmcblk0p1 40960 124927 41984 83 Linux
/dev/mmcblk0p2 124928 7170047 3522560 83 Linux
Command (m for help): d
Partition number (1-4): 2
Command (m for help): n
Partition type:
p primary (1 primary, 0 extended, 3 free)
e extended
Select (default p): p
Partition number (1-4, default 2): 2
First sector (2048-30867455, default 2048): 124928
Last sector, +sectors or +size{K,M,G} (124928-30867455, default 30867455):
Using default value 30867455
Command (m for help): w
Then quit (command q), reboot. You will then be able to use resize:
resize2fs /dev/root

Using `overlap`, `kernel time` and `utilization` to optimize one's kernels

My kernel archive 100% utilization, but the kernel time is at only 3% and there is no time overlap between memory copies and kernels.
Especially the high utilization and the low kernel time don't make sense to me.
So how should I proceed in optimizing my kernel?
I already made sure, that I only have coalesced and pinned memory access, like the profiler recommended.
`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`
Kernel time = 3.05 % of total GPU time
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU
Furtermore I have no warp serialization, no divergent branches, and no occupancy limiting factor.
Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None
p.s. I don't claim that I wrote wundercode, but I just don't know how to proceed from here.
it seems the grid size of your kernel is too small to make full use of SM.
why not decrease block size and increase the grid size.
i think it will do some help.