I am trying to install OpenCL for BEAGLE. First, I have downloaded intel_sdk_for_opencl_applications_2020.3.494.tar.gz from here. Then I unzipped it & run install.sh. Installation was successful. I have BEAGLE installed so I have decided to go to build folder in beagle-lib & run cmake -DCMAKE_INSTALL_PREFIX:PATH=$HOME .. in order to go on to run make install but I get the next message:
-- JAVA_HOME=
-- JNI_INCLUDE_DIRS=/usr/lib/jvm/java/include;/usr/lib/jvm/java/include/linux;/usr/lib/jvm/java/include
-- JNI_LIBRARIES=/usr/lib/jvm/java/lib/libjawt.so;/usr/lib/jvm/java/lib/server/libjvm.so
-- Not using libtools for plugins
-- Could NOT find OpenCL (missing: OpenCL_LIBRARY OpenCL_INCLUDE_DIR)
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /export/home/rinkman/beagle-lib/build
I have tried to set variables both environmental & shell to
OpenCL_LIBRARY=~/intel/system_studio_2020/opencl/SDK/include/CL/cl.h
OpenCL_INCLUDE_DIR=True
But the result of running cmake is the same. I have verified that variables were set up right. So I cannot understand what is wrong. I want to use OpenCL framework in BEAGLE when I am running BEAST 2 software. Please, could anyone help with this?
P.S. I am a noviciate in Linux.
My result of running beast -beagle-info:
BEAST v2.6.6, 2002-2021
Bayesian Evolutionary Analysis Sampling Trees
Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
Centre for Computational Evolution
University of Auckland
r.bouckaert#auckland.ac.nz
alexei#cs.auckland.ac.nz
Institute of Evolutionary Biology
University of Edinburgh
a.rambaut#ed.ac.uk
David Geffen School of Medicine
University of California, Los Angeles
msuchard#ucla.edu
Downloads, Help & Resources:
http://beast2.org/
Source code distributed under the GNU Lesser General Public License:
http://github.com/CompEvol/beast2
BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie
Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer
--- BEAGLE RESOURCES ---
0 : CPU
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU
One on my computer with Windows for comparison:
BEAST v2.6.6, 2002-2021
Bayesian Evolutionary Analysis Sampling Trees
Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
Centre for Computational Evolution
University of Auckland
r.bouckaert#auckland.ac.nz
alexei#cs.auckland.ac.nz
Institute of Evolutionary Biology
University of Edinburgh
a.rambaut#ed.ac.uk
David Geffen School of Medicine
University of California, Los Angeles
msuchard#ucla.edu
Downloads, Help & Resources:
http://beast2.org/
Source code distributed under the GNU Lesser General Public License:
http://github.com/CompEvol/beast2
BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie
Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer
--- BEAGLE RESOURCES ---
0 : CPU
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU
1 : NVIDIA GeForce 940MX
Global memory (MB): 2048
Clock speed (Ghz): 1.19
Number of cores: 384
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH COMPUTATION_ASYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA
2 : Intel(R) HD Graphics 620 (OpenCL 2.1 )
Global memory (MB): 3219
Clock speed (Ghz): 1.00
Number of compute units: 24
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH COMPUTATION_ASYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_OPENCL
3 : Intel(R) Core(TM) i5-7200U CPU # 2.50GHz (OpenCL 2.1 (Build 10))
Global memory (MB): 8067
Clock speed (Ghz): 2.50
Number of compute units: 4
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH COMPUTATION_ASYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_OPENCL
Could NOT find OpenCL (missing: OpenCL_LIBRARY OpenCL_INCLUDE_DIR)
This means that the CMake variables OpenCL_LIBRARY and OpenCL_INCLUDE_DIR are missing, because OpenCL could not be found.
Setting those two as environment variables changes nothing.
You need to tell CMake where things are, if they are not installed in standard directories (as seems to be the case with your OpenCL installation).
Try adding -DOCL_ROOT=path/to/base/of/your/OpenCL to your call to CMake for a package-specific hint, or -DCMAKE_LIBRARY_PATH=path/to/... to make CMake search in that path (in addition to standard paths) for any package it might be looking for. Make sure you cleaned away all cached files beforehand, so that CMake runs clean and does not use cached values.
The solution was to set variables through -D in cmake as cmake -DOpenCL_INCLUDE_DIR=~/intel/system_studio_2020/opencl/SDK/include/ -DOpenCL_LIBRARY=~/intel/system_studio_2020/opencl/SDK/lib64/libOpenCL.so.1.2 -DCMAKE_INSTALL_PREFIX:PATH=$HOME ...
The compiling was successful after that. make test was executed without any problems. I detected the file libhmsbeagle-opencl.so.40.0.0 in ~/lib folder.
Related
I’m trying to run a TensorFlow2 example from the Graphcore public examples (MNIST). I’m using the IPU model instead of IPU hardware because my machine doesn’t have access to IPU hardware, so I’ve followed the documentation (Running on the IPU Model simulator) and added the following to my model:
# Using IPU model instead of IPU hardware
if self.base_dictionary['ipu_model']:
os.environ['TF_POPLAR_FLAGS'] = '--use_ipu_model'
When I run the model, it fails with: Illegal instruction (core dumped). I don’t see where this comes from as I used an existing example. What is this error and how do I solve it?
Illegal instruction means that your program is generating instructions that your CPU can’t handle. The Graphcore TensorFlow wheel is compiled for Skylake class CPUs with the AVX-512 instruction set available, so processors that do not fit the requirements (i.e. a Skylake class CPU with AVX-512 capabilities) will not be able to run Graphcore Tensorflow code. (You can see the requirements in the “Requirements” section of the SDK Overview documentation here).
To see if your processors have AVX-512 capabilities, run cat /proc/cpuinfo and look at the flags field of any of the processors - they should all have the same flags. Here If you don’t see avx512f, your processors don’t fit the Graphcore requirements for running Tensorflow code. Here is an example of what the cat command returns on a machine that fits the requirements (result truncated to one processor):
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 1200.703
cache size : 33792 KB
physical id : 1
siblings : 48
core id : 27
cpu cores : 24
apicid : 119
initial apicid : 119
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 5401.49
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Machines provided by Graphcore or their partners will always fit these requirements, so it’s best to use them. They’ll also have enough cores and memory, which might not be the case on e.g. a personal laptop.
I am going to buy a laptop to do some TF work. Is the GPU version of TF able to take advantage of Nvidia Quadro P1000 and P2000? Will it run faster on these two GPUs than on the mobile version of 1050Ti? Thanks
If I am correct, Tensorflow can run in all Nvidia devices that supports CUDA.
Check this website for their computational compabilities:
https://developer.nvidia.com/cuda-gpus
There you can see the computational power of Nvidia GPU cards.
For your questions about those three cards (P1000, P2000, GeForce 1050Ti), they all have the same computational capabilities: 6.1, which means they won't differ too much in GPU computation.
But from their datasheet (P2000, P1000, 1050ti):
---------------------------------------------------------
| | Memory | Memory Interface | Memory Bandwidth|
---------------------------------------------------------
|P1000 |4G GDRR5| 128 bit | 82Gb/s |
|P2000 |5G GDDR5| 160 bit | 140Gb/s |
|1050Ti |4G GDDR5| 128 bit | 112Gb/s |
---------------------------------------------------------
I would say, P2000 > 1050Ti > P1000
BTW, what does that 6.1 number mean? Basically, it means how much operations and functions they can support. You can find the details in the figure below and this link, and similar discussion here
I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]
I have 6 GPUs of RX470. It should be mining average 25-27 mh/s each but it's only 20 mh/s. Overall is 120 instead of 150-170. I think the problem is GPU bios configuration but can't figure out any other thing. Any suggestions?
25 mh/s is what you would expect from an RX 480 stock. To get the same hashrate for RX 470, you'd be looking at overclocking memory speed (+600). In terms of how to overclock, it depends on whether your running linux or windows.
Can anyone shed any light on the output of intel_gpu_top? Specifically, what is task GAM, VS etc (The man page isn't much help.)
What does bitstream busy mean? It always seems to be zero...
render busy: 45%: █████████ render space: 83/131072
bitstream busy: 0%: bitstream space: 0/131072
blitter busy: 0%: blitter space: 0/131072
task percent busy
GAM: 43%: ████████▋ vert fetch: 0 (0/sec)
VS: 35%: ███████ prim fetch: 0 (0/sec)
CL: 33%: ██████▋ VS invocations: 2101845324 (1427552/sec)
SF: 33%: ██████▋ GS invocations: 0 (0/sec)
VF: 33%: ██████▋ GS prims: 0 (0/sec)
GAFS: 33%: ██████▋ CL invocations: 701123988 (475776/sec)
SOL: 32%: ██████▌ CL prims: 701708489 (475888/sec)
GS: 32%: ██████▌ PS invocations: 1254669239424 (116548992/sec)
DS: 32%: ██████▌ PS depth pass: 604287310764 (222384008/sec)
TDG: 2%: ▌
URBM: 2%: ▌
GAFM: 1%: ▎
HS: 0%:
SVG: 0%:
VFE: 0%:
I was curious as well, so here are just a few things I could grab from the reference manuals. Also of interest is the intel-gpu-tools source, and especially lib/instdone.c which describes what can appear in all Intel GPU models. This patch was also hugely helpful in translating all those acronyms!
Some may be wrong, I'd love it if somebody more knowledgeable could chime in! I'll come back to update the answer with more as I learn this stuff.
First, the three lines on the right :
The render space is probably used by regular 3D operations.
The bitstream section refers to the BSD (Bit-Stream Decoder), which handles hardware acceleration for media decoding. It does not appear on my GPU though (Skylake HD 530), so it might not be enabled/visible everywhere.
The blitter is described in vol. 11 and seems responsible for hardware acceleration of 2D operations (blitting).
Fixed function (FF) pipeline units (old-school GPU features) :
VF: Vertex Fetcher (vol. 1), the first FF unit in the 3D Pipeline responsible for fetching vertex data from memory.
VS: Vertex Shader (vol.1), computes things on the vertices of each primitive drawn by the GPU. Pretty standard operation on GPUs.
HS: Hull Shader
TE: Tessellation Engine
DS: Domain Shader
GS: Geometry Shader
SOL: Stream Output Logic
CL: Clip Unit
SF: Strips and Fans (vol.1), FF unit whose main function is to decompose primitive topologies such as strips and fans into primitives or objects.
Units used for thread and pipeline management, for both FF units and GPGPU (see Intel Open Source HD Graphics Programmers Manual for a lot of info on how this all works) :
CS: Command Streamer (vol.1), functional unit of the Graphics Processing Engine that fetches commands, parses them, and routes them to the appropriate pipeline.
TDG: Thread Dispatcher
VFE: Video Front-End
TSG: Thread Spawner
URBM: Unified Return Buffer Manager
Other stuff :
GAM: see GFX Page Walker (vol. 5), also called Memory Arbiter, has to do with how the GPU keeps track of its memory pages, seems quite similar to what the TLB (see also SLAT) does for your RAM.
SDE: South Display Engine ; according to vol. 12, "the South Display Engine supports Hot Plug Detection, GPIO, GMBUS, Panel Power Sequencing, and Backlight Modulation".