Tensorflow 2.0 utilize all CPU cores 100% - tensorflow

My Tensorflow model makes heavy use of data preprocessing that should be done on the CPU to leave the GPU open for training.
top - 09:57:54 up 16:23, 1 user, load average: 3,67, 1,57, 0,67
Tasks: 400 total, 1 running, 399 sleeping, 0 stopped, 0 zombie
%Cpu(s): 19,1 us, 2,8 sy, 0,0 ni, 78,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 32049,7 total, 314,6 free, 5162,9 used, 26572,2 buff/cache
MiB Swap: 6779,0 total, 6556,0 free, 223,0 used. 25716,1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17604 joro 20 0 22,1g 2,3g 704896 S 331,2 7,2 4:39.33 python
This is what top shows me. I would like to make this python process use at least 90% of available CPU across all cores. How can this be achieved?
GPU utilization is better, around 90%. Even though I don't know why it is not at 100%
Mon Aug 10 10:00:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 35% 41C P2 90W / 260W | 10515MiB / 11016MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1128 G /usr/lib/xorg/Xorg 102MiB |
| 0 1648 G /usr/lib/xorg/Xorg 380MiB |
| 0 1848 G /usr/bin/gnome-shell 279MiB |
| 0 10633 G ...uest-channel-token=1206236727 266MiB |
| 0 13794 G /usr/lib/firefox/firefox 6MiB |
| 0 17604 C python 9457MiB |
+-----------------------------------------------------------------------------+
All i found was a solution for tensorflow 1.0:
sess = tf.Session(config=tf.ConfigProto(
intra_op_parallelism_threads=NUM_THREADS))
I have an Intel 9900k and a RTX 2080 Ti and use Ubuntu 20.04
E: When I add the following code on top, it uses 1 core 100%
tf.config.threading.set_intra_op_parallelism_threads(1)
tf.config.threading.set_inter_op_parallelism_threads(1)
But increasing this number to 16 again only utilizes all cores ~30%

Just setting the set_intra_op_parallelism_threads and set_inter_op_parallelism_threads wasn't working for me. Incase someone else is in the same place, after a lot of struggle with the same issue, below piece of code worked for me in limiting the CPU usage of tensorflow below 500%:
import os
import tensorflow as tf
num_threads = 5
os.environ["OMP_NUM_THREADS"] = "5"
os.environ["TF_NUM_INTRAOP_THREADS"] = "5"
os.environ["TF_NUM_INTEROP_THREADS"] = "5"
tf.config.threading.set_inter_op_parallelism_threads(
num_threads
)
tf.config.threading.set_intra_op_parallelism_threads(
num_threads
)
tf.config.set_soft_device_placement(True)

There can be many issues for this, I solved it for me the following way:
Set
tf.config.threading.set_intra_op_parallelism_threads(<Your_Physical_Core_Count>) tf.config.threading.set_inter_op_parallelism_threads(<Your_Physical_Core_Count>)
both to your physical core count. You do not want Hyperthreading for highly vectorized operations as you cannot benefit from parallized operations when there aren't any gaps.
"With a high level of vectorization, the number of execution gaps is
very small and there is possibly insufficient opportunity to make up
any penalty due to increased contention in HT."
From: Saini et al, published by NASAA dvanced Supercomputing Division, 2011: The Impact of Hyper-Threading on Processor
Resource Utilization in Production Applications
EDIT: I am not sure anymore, if one of the two has to be 1. But one 100% needs to be set to Physical.

Related

Tensorflow can't find the GPU

pip install tensorflow==1.14.0
pip install tensorflow-gpu==1.14
import tensorflow as tf
tf.test.is_gpu_available(
cuda_only=False,
min_cuda_compute_capability=None
)
False
GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
print("# of CPU: {0}".format(psutil.cpu_count()))
print("CPU type: {0}".format(platform.uname()))
print("GPU Type: {0}".format(gpu.name))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
Gen RAM Free: 65.8 GB | Proc size: 294.4 MB
# of CPU: 8
CPU type: uname_result(system='Linux', node='lian-2', release='5.3.0-53-generic', version='#47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020', machine='x86_64', processor='x86_64')
GPU Type: GeForce RTX 2080 Ti
GPU RAM Free: 10992MB | Used: 26MB | Util 0% | Total 11018MB
I reinstalled cuda 10
Mon Jun 8 15:48:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 24% 28C P8 2W / 260W | 26MiB / 11018MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1805 G /usr/lib/xorg/Xorg 9MiB |
| 0 2192 G /usr/bin/gnome-shell 14MiB |
+-----------------------------------------------------------------------------+
define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 2
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
I'm trying to train this from a jupyter notebook (ubuntu 18 ssh):
model = Sequential()
model.add(layers.Conv1D(16, 9, activation='relu', input_shape=(800, X_train.shape[-1])))
model.add(layers.MaxPooling1D(2))
model.add(layers.Bidirectional(layers.CuDNNGRU(8, return_sequences=True)))
model.add(layers.Bidirectional(layers.CuDNNGRU(8)))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer= 'Adam', loss='binary_crossentropy', metrics=['acc'])
model.summary()
train_to_epoch = 150
start_epoch = 3
t1 = datetime.datetime.now()
print('Training start time = %s' % t1)
history = model.fit(X_train, y_train,
batch_size=128, epochs=train_to_epoch, verbose=0,
validation_data=(X_val, y_val))
print('\nTraining Duration = %s' % (datetime.datetime.now()-t1))
InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional_1/CudnnRNN}}with these attrs: [input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="gru", seed2=0, is_training=true, seed=87654321, dropout=0]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
[[bidirectional_1/CudnnRNN]]
Maybe is another version of CUDA needed?
Yes, it is the CUDA 10.1 version.

How to get the device memory or application size(Memory taken in Internal storage) using Appium

I have to get the details such as CPU, Memory(RAM), Battery, Startup time, App Size. Im able to get other things except App Size using Appium. Someone please help.
Thanks
Seems like Appium is not capable to perform this for you. However, you can check the memory usage in Android using adb:
$ adb shell dumpsys meminfo com.android.systemui
Which returns something like this:
Applications Memory Usage (in Kilobytes):
Uptime: 433213978 Realtime: 631179899
** MEMINFO in pid 1541 [com.android.systemui] **
Pss Private Private SwapPss Heap Heap Heap
Total Dirty Clean Dirty Size Alloc Free
------ ------ ------ ------ ------ ------ ------
Native Heap 106880 106780 0 73228 196200 182430 13769
Dalvik Heap 35655 35636 0 217 40741 20371 20370
Dalvik Other 3347 3344 0 112
Stack 56 56 0 16
Ashmem 4 0 0 0
Other dev 18 0 16 0
.so mmap 2998 164 352 499
.jar mmap 669 0 248 0
.apk mmap 17904 0 13620 0
.ttf mmap 372 0 180 0
.dex mmap 10041 16 10020 20
.oat mmap 3848 0 84 0
.art mmap 1641 1364 0 123
Other mmap 64 12 24 0
EGL mtrack 106434 106434 0 0
GL mtrack 55564 55564 0 0
Unknown 1952 1948 0 1565
TOTAL 423227 311318 24544 75780 236941 202801 34139
For iOS, my currently limited toolchain doesn't help much. You may have luck using Instruments, for which you can find here a command-line reference, but I haven't tried that: https://help.apple.com/instruments/mac/current/#/devb14ffaa5

Tensorflow slower on GPU than on CPU

Using Keras with Tensorflow backend, I am trying to train an LSTM network and it is taking much longer to run it on a GPU than a CPU.
I am training an LSTM network using the fit_generator function. It takes CPU ~250 seconds per epoch while it takes GPU ~900 seconds per epoch. The packages in my GPU environment include
keras-applications 1.0.8 py_0 anaconda
keras-base 2.2.4 py36_0 anaconda
keras-gpu 2.2.4 0 anaconda
keras-preprocessing 1.1.0 py_1 anaconda
...
tensorflow 1.13.1 gpu_py36h3991807_0 anaconda
tensorflow-base 1.13.1 gpu_py36h8d69cac_0 anaconda
tensorflow-estimator 1.13.0 py_0 anaconda
tensorflow-gpu 1.13.1 pypi_0 pypi
My Cuda compilation tools are of version 9.1.85 and my CUDA and Driver version are
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:0A:00.0 Off | N/A |
| 0% 39C P8 5W / 225W | 7740MiB / 7952MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 2080 On | 00000000:42:00.0 Off | N/A |
| 0% 33C P8 19W / 225W | 142MiB / 7951MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 7729MiB |
| 1 1354 G /usr/lib/xorg/Xorg 16MiB |
| 1 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 113MiB |
+-----------------------------------------------------------------------------+
When I insert this line of code
tf.Session(config = tf.configProto(log_device_placement = True)):
I see the below in my terminal
...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...
So it seems that Tensorflow is using GPU.
When I profile the code,
on GPU this is the first 10 lines
10852017 function calls (10524203 primitive calls) in 184.768 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16200 173.827 0.011 173.827 0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
6 0.926 0.154 0.926 0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
62 0.813 0.013 0.813 0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
156954 0.414 0.000 0.415 0.000 {built-in method numpy.array}
16200 0.379 0.000 1.042 0.000 training.py:643(_standardize_user_data)
24300 0.338 0.000 0.338 0.000 {method 'partition' of 'numpy.ndarray' objects}
68 0.301 0.004 0.301 0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
32458 0.223 0.000 2.122 0.000 tensorflow_backend.py:156(get_session)
3206 0.212 0.000 0.238 0.000 tf_stack.py:31(extract_stack)
76024 0.210 0.000 0.702 0.000 ops.py:5246(get_controller)
...
on CPU this is the first 10 lines
22123473 function calls (21647174 primitive calls) in 60.173 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16269 42.491 0.003 42.491 0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
16269 0.568 0.000 48.964 0.003 session.py:1042(_run)
56 0.532 0.010 0.532 0.010 {built-in method time.sleep}
153641 0.458 0.000 0.460 0.000 {built-in method numpy.core.multiarray.array}
183148/125354 0.447 0.000 1.316 0.000 python_message.py:469(init)
1226659 0.362 0.000 0.364 0.000 {built-in method builtins.getattr}
2302110/2301986 0.339 0.000 0.358 0.000 {built-in method builtins.isinstance}
8 0.285 0.036 0.285 0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
12150 0.267 0.000 0.271 0.000 callbacks.py:211(on_batch_end)
147026/49078 0.264 0.000 1.429 0.000 python_message.py:1008(ByteSize)
...
This is my code.
def train_generator(x_list, y_list):
# 0.1 validatioin split
train_length = (len(x_list)//10)*9
while True:
for i in range(train_length):
train_x = np.array([x_list[i]])
train_y = np.array([y_list[i]])
yield train_x, train_y
def val_generator(x_list, y_list):
# 0.1 validation split
val_length = len(x_list)//10
while True:
for i in range(-val_length, 0, 1):
val_x = np.array([x_list[i]])
val_y = np.array([y_list[i]])
yield val_x, val_y
with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
monitor='val_loss', verbose=1,
save_best_only=True)
history = model.fit_generator(generator=train_generator(train_x,
train_y),
steps_per_epoch=(len(train_x)//10)*9,
epochs=5,
validation_data=val_generator(train_x,
train_y),
validation_steps=len(train_x)//10,
callbacks=[checkpointer],
verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()
I expect a significant speed up when using GPU for training. How can I fix this? Can someone help me to understand what is causing the slowdown? Thank you.
Couple of observations:
Use CuDNNLSTM instead of LSTM to train on GPU, you will see considerable increase in speed.
Sometimes, for very small networks, the overhead of transferring between CPU and GPU outweighs the parallel computations made on GPU; in other words, there is more time lost on transferring the data than time gained by training on GPU.
GPUs should be used for highly intensive tasks and computations (very big LSTM/heavy CNN networks). Nevertheless, for very small MLPs and even small LSTMs you might observe that the network trains equally fast on CPU and GPU or that in some particular cases the speed on CPU is even better (very particular cases with super small networks).
UPDATE FOR TENSORFLOW >= 2.0
The imports default to using CuDNNLSTM/CuDNNGRU if the video card is detected; therefore it is not needed explicitly to import them.

My query runs faster the second time around, how do i stop that?

i am running a query in oracle 10 select A from B where C = D
B has millions of records and there is no index on C
The first time i run it it takes about 30 seconds, the second time i run the query it takes about 1 second.
Obviously it's caching something and i want it to stop that, each time i run the query i want it to take 30s - just like it was running for the first time.
i am over-simplifying the issue that i am having for the sake of making the question readable.
Thanks
Clearing the caches to measure performance is possible but very unwieldy.
A very good measure for tracking achieved performance of tuning efforts is counting the number of read blocks during query execution. One of the easiest way to do this is using sqlplus with autotrace, like so:
set autotrace traceonly
<your query>
outputs
...
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1 consistent gets
0 physical reads
0 redo size
363 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
The number of blocks read, be it from cache or from disk, is consistent gets.
Another way is running the query with increased statistics i.e. with the hint gather_plan_statistics and then looking at the query plan from the cursor cache:
auto autotrace off
set serveroutput off
<your query with hint gather_plan_statistics>
select * from table(dbms_xplan.display_cursor(null,null,'typical allstats'));
The number of blocks read is output in column buffers.
---------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | | 1 (100)| | 3 |00:00:00.01 | 3 |
| 1 | SORT AGGREGATE | | 3 | 1 | | | 3 |00:00:00.01 | 3 |
| 2 | INDEX FULL SCAN| ABCDEF | 3 | 176 | 1 (0)| 00:00:01 | 528 |00:00:00.01 | 3 |
---------------------------------------------------------------------------------------------------------------------
See this question...
It shows how to clear caches for data and execution plans, but also expands on whether it's a good idea or not.
The obvious answer is for each test case to run the query multiple times, and throw out the first result.
Its not easy to completely replicate the conditions of the first query run, because of the various caches involved: some are Oracle caches (cursor, buffer, etc.); some are OS (disk cache, depending on Oracle config); some are hardware (SAN, RAID, disk).
Rebooting the database server before each trial will probably come pretty close to consistent conditions.
It might help you
http://www.mssqltips.com/tip.asp?tip=1360
CHECKPOINT;
GO DBCC DROPCLEANBUFFERS;
GO

MySQL Performance inquiry

I have moved my drupal site from one mysql server to another one.
Old Machine has 1 cpu, 1GB Ram
New Machine has 4 cpu, 4GB Ram.
I have a huge negative difference in perfomance on this query ( 2 mins vs 2 secs )
select distinct c.client
from client_table c
LEFT JOIN reps r on ( c.client = r.client )
where r.user_id is NULL
AND c.client not in ( select distinct client from billing where first_purchase = 1 )
NEW OLD
| connect_timeout 10 |connect_timeout 5
| have_federated_engine DISABLED | have_federated_engine YES
| max_connections 100 | max_connections 400
| max_seeks_for_key 18446744073709551615 | max_seeks_for_key 4294967295
| max_write_lock_count 18446744073709551615 | max_write_lock_count 4294967295
| myisam_max_sort_file_size 9223372036853727232 | myisam_max_sort_file_size 2147483647
| max_binlog_cache_size 18446744073709547520 | max_binlog_cache_size 4294967295
| myisam_recover_options BACKUP | myisam_recover_options OFF
| range_alloc_block_size 4096 | range_alloc_block_size 2048
| table_cache 457 | table_cache 307
| version 5.0.67-0ubuntu6-log | version 5.0.51a-3ubuntu5.4-log
| version_compile_machine x86_64 | version_compile_machine i486
ONLY on NEW | relay_log |
ONLY on NEW | relay_log_index |
ONLY on NEW | relay_log_info_file | relay-log.info
ONLY on NEW innodb_adaptive_hash_index | ON
Any ideas on how to identify what is causing the problem or how to fix it?
You might need to rebuild your indexes on the new instance.
Make triple-sure you've rebuilt your indicies, they don't really carry over.
Try using the MySQL Query Profiler.
I would profile in both environments.
So how do you go about analyzing
database performance? There are three
forms of performance analysis that are
used to troubleshoot and tune database
systems:
Bottleneck analysis - focuses on
answering the questions: What is my
database server waiting on; what is a
user connection waiting on; what is a
piece of SQL code waiting on?
Workload analysis - examines the server and who
is logged on to determine the resource
usage and activity of each.
Ratio-based analysis - utilizes a
number of rule-of-thumb ratios to
gauge performance of a database, user
connection, or piece of code.