My source video file (1h 30min movie) is playable in both PotPlayer and VLC: h264, 8-bit color and 7755kb/s bitrate.
The NVEnc command I'm using is this:
.\nvencc\NVEncC64.exe --avhw -i "input.mkv" --codec hevc --preset quality --bframes 4 --ref 7 --cu-max 32 --cu-min 8 --output-depth 10 --audio-copy --sub-copy -o "output.mkv"
Encoding works fine (I believe):
NVEncC (x64) 5.26 (r1786) by rigaya, Jan 31 2021 09:23:04 (VC 1928/Win/avx2)
OS Version Windows 10 x64 (19042)
CPU AMD Ryzen 5 1600 Six-Core Processor [3.79GHz] (6C/12T)
GPU #0: GeForce GTX 1660 (1408 cores, 1830 MHz)[PCIe3x16][457.51]
NVENC / CUDA NVENC API 11.0, CUDA 11.1, schedule mode: auto
Input Buffers CUDA, 21 frames
Input Info avcuvid: H.264/AVC, 1920x800, 24000/1001 fps
AVSync vfr
Vpp Filters cspconv(nv12 -> p010)
Output Info H.265/HEVC main10 # Level auto
1920x800p 1:1 23.976fps (24000/1001fps)
avwriter: hevc, eac3, subtitle#1 => matroska
Encoder Preset quality
Rate Control CQP I:20 P:23 B:25
Lookahead off
GOP length 240 frames
B frames 4 frames [ref mode: disabled]
Ref frames 7 frames, MultiRef L0:auto L1:auto
AQ off
CU max / min 32 / 8
Others mv:auto
encoded 142592 frames, 219.97 fps, 1549.90 kbps, 1098.83 MB
encode time 0:10:48, CPU: 8.7%, GPU: 5.2%, VE: 98.3%, VD: 21.5%, GPUClock: 1966MHz, VEClock: 1816MHz
frame type IDR 595
frame type I 595, avgQP 20.00, total size 39.44 MB
frame type P 28519, avgQP 23.00, total size 471.93 MB
frame type B 113478, avgQP 25.00, total size 587.45 MB
but when I try to play it in either PotPlayer or VLC it says there is no video track or it just doesn't play at all.
MediaInfo also doesn't show any video, audio, or subtitle tracks either, just the name of the file and the file size. Am I missing something?
Switching --avhw to --avsw solved the problem.
Related
I would like to run OpenVINO on an integrated GPU Intel HD 400.
When I run it I have the following error:
12.05.21 16:02:27 (-0400) self._engine = self._ie.load_network(**openvino_config)
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 311, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) File "ie_api.pyx", line 320, in openvino.inference_engine.ie_api.IECore.load_network
12.05.21 16:02:27 (-0400) RuntimeError: Failed to create plugin /opt/intel/openvino_2021.1.110/deployment_tools/inference_engine/lib/intel64/libclDNNPlugin.so for device GPU
I followed the instructions to install the GPU plugin here Steps for IntelĀ® Processor Graphics (GPU) and I can confirm that libclDNNPlugin.so exists.
I am running the code within a docker container and I am not sure if the host os has the proper drivers installed.
I run lsmod on host os and I got the following output
Module Size Used by
bnep 20480 2
ip6t_REJECT 16384 2
nf_reject_ipv6 16384 1 ip6t_REJECT
ip6table_filter 16384 1
xt_state 16384 0
ipt_REJECT 16384 2
nf_reject_ipv4 16384 1 ipt_REJECT
ip6_tables 28672 1 ip6table_filter
xt_MASQUERADE 16384 12
nf_conntrack_netlink 32768 0
nfnetlink 16384 2 nf_conntrack_netlink
xfrm_user 36864 1
xt_owner 16384 0
snd_hda_codec_hdmi 53248 1
snd_hda_codec_realtek 98304 1
snd_hda_codec_generic 65536 1 snd_hda_codec_realtek
snd_hda_intel 32768 0
coretemp 16384 0
snd_intel_dspcfg 16384 1 snd_hda_intel
snd_hda_codec 94208 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
i915 1802240 5
at24 20480 0
r8169 73728 0
snd_hda_core 65536 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
regmap_i2c 16384 1 at24
snd_pcm 86016 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
efivars 20480 0
realtek 20480 2
libphy 81920 2 r8169,realtek
snd_timer 28672 1 snd_pcm
video 40960 1 i915
backlight 16384 2 video,i915
sch_fq_codel 20480 4
The error specifies: [CLDNN ERROR]. No GPU device was found.
Also, clinfo reported 0 platform available.
I run the following commands:
sudo apt install ocl-icd-libopencl1
sudo apt install opencl-headers
sudo apt install clinfo
sudo apt install ocl-icd-opencl-dev
sudo apt install beignet
and now the clinfo output is:
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
root#406231114801:/src# clinfo
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics Cherryview
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 12
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (cl_khr_fp16)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 pixels
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max number of constant args 8
Max constant buffer size 134217728 (128MiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_khr_fp16
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics Cherryview
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
How can I solve the issue?
According to these supported platforms information, your GPU currently might not supported. Could you run the programs on CPU? Try running this command " lspci | grep -i vga " and share the output here.
I am trying to set the fps of the coral mipi camera to 60fps. From the datasheet, it says it can run 720p at 60fps so i know the camera is capable.
I have set the resolution using the following command which works.
v4l2-ctl --device=/dev/video1 --set-fmt-video=width=1280,height=720
but when I set the fps to 60, the max limit seem to be 30.
mendel#mocha-calf:~$ v4l2-ctl --device=/dev/video1 -p 60
Frame rate set to 30.000 fps
There are no option to set exposure or gain. Would I have to rebuild the driver to be able to have these options?
Regards
Paul
60 FPS not supported. Please see the supported formats below.
mendel#tuned-rabbit:~$ v4l2-ctl --list-formats-ext
ioctl: VIDIOC_ENUM_FMT
Type: Video Capture
[0]: 'YUYV' (YUYV 4:2:2)
Size: Discrete 640x480
Interval: Discrete 0.033s (30.000 fps)
Size: Discrete 720x480
Interval: Discrete 0.033s (30.000 fps)
Size: Discrete 1280x720
Interval: Discrete 0.033s (30.000 fps)
Size: Discrete 1920
And for changing/setting exposure and gain, you may need to make modifications in the current driver code and rebuild the kernel image. Driver code can be found at:
https://coral.googlesource.com/linux-imx/+/refs/heads/master/drivers/media/platform/mxc/capture/ov5645_mipi_v2.c#2774
And instructions to build the kernel image can be seen at : https://coral.googlesource.com/docs/+/refs/heads/master/GettingStarted.md
I just built a new AMD-based PC, with CPU - AMD Ryzen 7 3700X, GPU - AMD Radeon RX Vega 56, OS - Ubuntu 18.04. In order to use AMD GPU for Tensorflow, I follow these two to install ROCm. Everything seems fine and no problems in installation. I think I install ROCm 3. I do exact as the posts.
https://towardsdatascience.com/train-neural-networks-using-amd-gpus-and-keras-37189c453878
https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU
video link: https://www.youtube.com/watch?v=fkSRkAoMS4g
But when I ran rocm-bandwidth-test in the terminal, as the video, I had result as below.
(base) nick#nick-nbpc:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 3700X 8-Core Processor
Device: 1, Vega 10 XT [Radeon RX Vega 64], 2f:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.295924
1 8.892247 72.654038
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.103560
1 17.103560 N/A
(base) nick#nick-nbpc:~$
The video is using AMD RX 580 GPU, and I compare the technical specs from the link below.
https://www.youtube.com/watch?v=shstdFZJJ_o
which is showing that RX580 has memory bandwidth 256 Gb/s and Vega 56 has 409.6 Gb/s. In the other video, the uploader has a bandwidth 195 Gb/s at time 11:09 of the video. But my Vega 56 only has 72.5 Gb/s! This is a huge difference. I don't know what is wrong.
Then I install python 3.6 and TensorFlow-ROCm. And I git clone https://github.com/tensorflow/benchmarks.git, just as the video, to do the benchmark test in tensorflow.
Execute the code:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Gives this result:
Done warm up
Step Img/sec total_loss
1 images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40 images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50 images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60 images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70 images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80 images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90 images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------
The result is not as good as I expected. I was expecting some number 100+. But due to my limited knowledge on Ubuntu/AMD/TensorFlow, I might be very likely wrong. If not, can someone tell me why my bandwidth is not as fast as 400 Gb/s?
========================================
clinfo
(base) nick#nick-nbpc:~$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3137.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon RX Vega 64]
Device Topology: PCI[ B#47, D#0, F#0 ]
Max compute units: 56
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1590Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fe56aa5fcf0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3137.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
(base) nick#nick-nbpc:~$
rocminfo
(base) nick#nick-nbpc:~$ rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 3700X 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 3700X 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-02151e1bb9ee2144
Marketing Name: Vega 10 XT [Radeon RX Vega 64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1590
BDFID: 12032
Internal Node ID: 1
Compute Unit: 56
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
(base) nick#nick-nbpc:~$
i cant answer the bandwidth question but i have just tried out the same benchmarks (according to the youtube video)
i get:
(vrocm1) user1#t1000test:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 2700X Eight-Core Processor
Device: 1, Vega 10 XL/XT [Radeon RX Vega 56/64], 28:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.542044
1 9.028717 72.202459
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.144430
1 17.144430 N/A
which is the same as you got. but:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
gives me:
Done warm up
Step Img/sec total_loss
1 images/sec: 172.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 8.049
20 images/sec: 172.6 +/- 0.1 (jitter = 0.4) 7.808
30 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.976
40 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.591
50 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.549
60 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
70 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
80 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.848
90 images/sec: 172.6 +/- 0.0 (jitter = 0.5) 8.025
100 images/sec: 172.5 +/- 0.0 (jitter = 0.5) 8.029
----------------------------------------------------------------
total images/sec: 172.39
----------------------------------------------------------------
cliinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3182.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Device Topology: PCI[ B#40, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1630Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7efe04b66cd0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3182.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 2700X Eight-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 2700X Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-021508a5025618e4
Marketing Name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1630
BDFID: 10240
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
the only thing that you seem to have done differently is :
Execute the code: python tf_cnn_benchmarks.py --num_gpus=1
--batch_size=32 --model=resnet50
which executes the test in python2 ... (but maybe its just a typo)
greetings gspeet
Loading 1500 images of size (1000,1000,3) breaks the code and throughs kill 9 without any further error. Memory used before this line of code is 16% of system total memory. Total size of images direcotry is 7.1G.
X = np.asarray(images).astype('float64')
y = np.asarray(labels).astype('float64')
system spec is:
OS: macOS Catalina
processor: 2.2 GHz 6-Core Intel Core i7 16 GB 2
memory: 16 GB 2400 MHz DDR4
Update:
getting the bellow error while running the code on 32 vCPUs, 120 GB memory.
MemoryError: Unable to allocate 14.1 GiB for an array with shape (1200, 1024, 1024, 3) and data type float32
You would have to provide some more info/details for an exact answer but, assuming that this is a memory error(incredibly likely, size of the images on disk does not represent the size they would occupy in memory, so that is irrelevant. In 100% of all cases, the images in memory will occupy a lot more space due to pointers, objects that are needed and so on. Intuitively I would say that 16GB of ram is nowhere nearly enough to load 7GB of images. It's impossible to tell you how much you would need but from experience I would say that you'd need to bump it up to 64GB. If you are using Keras, I would suggest looking into the DirectoryIterator.
Edit:
As Cris Luengo pointed out, I missed the fact that you stated the size of the images.
No matter what library/SDK to use, I want to convert from avi to asf very quickly (I could even sacrifice some quality of video and audio). I am working on Windows platform (Vista and 2008 Server), better .Net SDK/code, C++ code is also fine. :-)
I learned from the below link, that there could be a very quick way to convert from avi to asf to support streaming better, as mentioned "could convert the video from AVI to ASF format using a simple copy (i.e. the content is the same, but container changes).". My question is after some hours of study and trial various SDK/tools, as a newbie, I do not know how to begin with so I am asking for reference sample code to do this task. :-)
(as this is a different issue, we decide to start a new topic. :-) )
Issue with streaming AVI files
thanks in advance,
George
EDIT 1:
I have tried to get the binary of ffmpeg from,
http://ffmpeg.arrozcru.org/autobuilds/ffmpeg-latest-mingw32-static.tar.bz2
then run the following command,
C:\software\ffmpeg-latest-mingw32-static\bin>ffmpeg.exe -i test.avi -acodec copy
-vcodec copy test.asf
FFmpeg version SVN-r18506, Copyright (c) 2000-2009 Fabrice Bellard, et al.
configuration: --enable-memalign-hack --prefix=/mingw --cross-prefix=i686-ming
w32- --cc=ccache-i686-mingw32-gcc --target-os=mingw32 --arch=i686 --cpu=i686 --e
nable-avisynth --enable-gpl --enable-zlib --enable-bzlib --enable-libgsm --enabl
e-libfaac --enable-pthreads --enable-libvorbis --enable-libmp3lame --enable-libo
penjpeg --enable-libtheora --enable-libspeex --enable-libxvid --enable-libfaad -
-enable-libschroedinger --enable-libx264
libavutil 50. 3. 0 / 50. 3. 0
libavcodec 52.25. 0 / 52.25. 0
libavformat 52.32. 0 / 52.32. 0
libavdevice 52. 2. 0 / 52. 2. 0
libswscale 0. 7. 1 / 0. 7. 1
built on Apr 14 2009 04:04:47, gcc: 4.2.4
Input #0, avi, from 'test.avi':
Duration: 00:00:44.86, start: 0.000000, bitrate: 5291 kb/s
Stream #0.0: Video: msvideo1, rgb555le, 1280x1024, 5 tbr, 5 tbn, 5 tbc
Stream #0.1: Audio: pcm_s16le, 22050 Hz, mono, s16, 352 kb/s
Output #0, asf, to 'test.asf':
Stream #0.0: Video: CRAM / 0x4D415243, rgb555le, 1280x1024, q=2-31, 1k tbn,
5 tbc
Stream #0.1: Audio: pcm_s16le, 22050 Hz, mono, s16, 352 kb/s
Stream mapping:
Stream #0.0 -> #0.0
Stream #0.1 -> #0.1
Press [q] to stop encoding
frame= 224 fps=222 q=-1.0 Lsize= 29426kB time=44.80 bitrate=5380.7kbits/s
video:26910kB audio:1932kB global headers:0kB muxing overhead 2.023317%
C:\software\ffmpeg-latest-mingw32-static\bin>
http://www.microsoft.com/windows/windowsmedia/player/webhelp/default.aspx?&mpver=11.0.6001.7000&id=C00D11B1&contextid=230&originalid=C00D36E6
then have the following error when using Windows Media Player to play it, does anyone have any ideas?
http://www.microsoft.com/windows/windowsmedia/player/webhelp/default.aspx?&mpver=11.0.6001.7000&id=C00D11B1&contextid=230&originalid=C00D36E6
Maybe you could use FFMPEG and run a command like this (I haven't tried):
ffmpeg.exe -i test.avi -acodec copy -vcodec copy test.asf