I encountered an OOM error while running tensorflow code. I am confused about these indicators, such as Limit, InUse, MaxInUse, NumAllocs, MaxAllocSize. Can anyone help explain it?thanks!
2021-12-01 21:08:29.733639: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 14 Chunks of size 221773824 totalling 2.89GiB
2021-12-01 21:08:29.733644: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 325730304 totalling 1.21GiB
2021-12-01 21:08:29.733649: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 325844992 totalling 310.75MiB
2021-12-01 21:08:29.733654: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 591396864 totalling 1.10GiB
2021-12-01 21:08:29.733658: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 9.26GiB
2021-12-01 21:08:29.733667: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 10255859712
InUse: 9942288128
MaxInUse: 9942288128
NumAllocs: 8892
MaxAllocSize: 591396864
2021-12-01 21:08:29.733905: W tensorflow/core/common_runtime/bfc_allocator.cc:271] *********************************************************_******************************************
2021-12-01 21:08:29.733938: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[4096,141,96] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Related
I'm trying to reproduce the training of the Mask RCNN in the following repository:https://github.com/maxkferg/metal-defect-detection
Code snippet for the train is the following:
# Training - Stage 1
print("Training network heads")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=40,
layers='heads')
# Training - Stage 2
# Finetune layers from ResNet stage 4 and up
print("Fine tune Resnet stage 4 and up")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=120,
layers='4+')
# # Training - Stage 3
# # Fine tune all layers
print("Fine tune all layers")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE / 10,
epochs=160,
layers='all')
Stage-1 goes smooth. But fails from the Stage-2. Giving the following:
2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 123 Chunks of size 2048 totalling 246.0KiB
2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2816 totalling 2.8KiB
2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 3072 totalling 18.0KiB
2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 387 Chunks of size 4096 totalling 1.51MiB
2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6144 totalling 6.0KiB
2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6656 totalling 6.5KiB
2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 60 Chunks of size 8192 totalling 480.0KiB
2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9216 totalling 18.0KiB
2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 12288 totalling 144.0KiB
2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 16384 totalling 32.0KiB
2020-08-17 15:53:10.690456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 21248 totalling 20.8KiB
2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 24064 totalling 23.5KiB
2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 24576 totalling 120.0KiB
2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37632 totalling 36.8KiB
2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 40960 totalling 40.0KiB
2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 49152 totalling 192.0KiB
2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 65536 totalling 384.0KiB
2020-08-17 15:53:10.694456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 81920 totalling 80.0KiB
2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 90624 totalling 88.5KiB
2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 131072 totalling 128.0KiB
2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 3 Chunks of size 147456 totalling 432.0KiB
2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 262144 totalling 3.00MiB
2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 327680 totalling 320.0KiB
2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 11 Chunks of size 524288 totalling 5.50MiB
2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 589824 totalling 2.25MiB
2020-08-17 15:53:10.698457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 194 Chunks of size 1048576 totalling 194.00MiB
2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 17 Chunks of size 2097152 totalling 34.00MiB
2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2211840 totalling 2.11MiB
2020-08-17 15:53:10.700457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 146 Chunks of size 2359296 totalling 328.50MiB
2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2360320 totalling 2.25MiB
2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2621440 totalling 2.50MiB
2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2698496 totalling 2.57MiB
2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 3670016 totalling 3.50MiB
2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 31 Chunks of size 4194304 totalling 124.00MiB
2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 4718592 totalling 27.00MiB
2020-08-17 15:53:10.704457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 8388608 totalling 40.00MiB
2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 25 Chunks of size 9437184 totalling 225.00MiB
2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9438208 totalling 18.00MiB
2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 9441280 totalling 9.00MiB
2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 16138752 totalling 15.39MiB
2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 18874368 totalling 18.00MiB
2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37748736 totalling 36.00MiB
2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 7 Chunks of size 51380224 totalling 343.00MiB
2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] Sum Total of in-use chunks: 1.41GiB
2020-08-17 15:53:10.709457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] Stats:
Limit: 1613615104
InUse: 1510723072
MaxInUse: 1510723072
NumAllocs: 3860
MaxAllocSize: 119947776
The training is running on a QuadroK420 with 2GB of RAM. Is only a problem of low RAM or I'm missing something?
There is a way to train also with my equippement?
The problem is the gpu memory of your video card.
In the first stage you were able to train smoothly, because of the fact that you trained only "the heads" of the network, which translates to a smaller number of parameters.
In the second stage, you started to get out of memory problems because you trained many more layers, resulting in out of memory.
I suggest using a video card with at least 8 GB VRAM for Computer Vision problems.
Indeed, sometimes out of memory problems can be solved by reducing the batch size, but in your case the only viable solution is to opt for a bigger/better video card.
It's most probably a RAM issue. You can try reducing your batch size to 1 or simplify your network. If either of those methods works, get something with bigger RAM.
one way to fix this sometimes is to put an up sampling layer in the model. so lower your target in the image generator and then do add an upsampling layer. Is a good way to trick it.If that works then you know that colab couldn't handle it
My Tensorflow model makes heavy use of data preprocessing that should be done on the CPU to leave the GPU open for training.
top - 09:57:54 up 16:23, 1 user, load average: 3,67, 1,57, 0,67
Tasks: 400 total, 1 running, 399 sleeping, 0 stopped, 0 zombie
%Cpu(s): 19,1 us, 2,8 sy, 0,0 ni, 78,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 32049,7 total, 314,6 free, 5162,9 used, 26572,2 buff/cache
MiB Swap: 6779,0 total, 6556,0 free, 223,0 used. 25716,1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17604 joro 20 0 22,1g 2,3g 704896 S 331,2 7,2 4:39.33 python
This is what top shows me. I would like to make this python process use at least 90% of available CPU across all cores. How can this be achieved?
GPU utilization is better, around 90%. Even though I don't know why it is not at 100%
Mon Aug 10 10:00:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 35% 41C P2 90W / 260W | 10515MiB / 11016MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1128 G /usr/lib/xorg/Xorg 102MiB |
| 0 1648 G /usr/lib/xorg/Xorg 380MiB |
| 0 1848 G /usr/bin/gnome-shell 279MiB |
| 0 10633 G ...uest-channel-token=1206236727 266MiB |
| 0 13794 G /usr/lib/firefox/firefox 6MiB |
| 0 17604 C python 9457MiB |
+-----------------------------------------------------------------------------+
All i found was a solution for tensorflow 1.0:
sess = tf.Session(config=tf.ConfigProto(
intra_op_parallelism_threads=NUM_THREADS))
I have an Intel 9900k and a RTX 2080 Ti and use Ubuntu 20.04
E: When I add the following code on top, it uses 1 core 100%
tf.config.threading.set_intra_op_parallelism_threads(1)
tf.config.threading.set_inter_op_parallelism_threads(1)
But increasing this number to 16 again only utilizes all cores ~30%
Just setting the set_intra_op_parallelism_threads and set_inter_op_parallelism_threads wasn't working for me. Incase someone else is in the same place, after a lot of struggle with the same issue, below piece of code worked for me in limiting the CPU usage of tensorflow below 500%:
import os
import tensorflow as tf
num_threads = 5
os.environ["OMP_NUM_THREADS"] = "5"
os.environ["TF_NUM_INTRAOP_THREADS"] = "5"
os.environ["TF_NUM_INTEROP_THREADS"] = "5"
tf.config.threading.set_inter_op_parallelism_threads(
num_threads
)
tf.config.threading.set_intra_op_parallelism_threads(
num_threads
)
tf.config.set_soft_device_placement(True)
There can be many issues for this, I solved it for me the following way:
Set
tf.config.threading.set_intra_op_parallelism_threads(<Your_Physical_Core_Count>) tf.config.threading.set_inter_op_parallelism_threads(<Your_Physical_Core_Count>)
both to your physical core count. You do not want Hyperthreading for highly vectorized operations as you cannot benefit from parallized operations when there aren't any gaps.
"With a high level of vectorization, the number of execution gaps is
very small and there is possibly insufficient opportunity to make up
any penalty due to increased contention in HT."
From: Saini et al, published by NASAA dvanced Supercomputing Division, 2011: The Impact of Hyper-Threading on Processor
Resource Utilization in Production Applications
EDIT: I am not sure anymore, if one of the two has to be 1. But one 100% needs to be set to Physical.
Using Keras with Tensorflow backend, I am trying to train an LSTM network and it is taking much longer to run it on a GPU than a CPU.
I am training an LSTM network using the fit_generator function. It takes CPU ~250 seconds per epoch while it takes GPU ~900 seconds per epoch. The packages in my GPU environment include
keras-applications 1.0.8 py_0 anaconda
keras-base 2.2.4 py36_0 anaconda
keras-gpu 2.2.4 0 anaconda
keras-preprocessing 1.1.0 py_1 anaconda
...
tensorflow 1.13.1 gpu_py36h3991807_0 anaconda
tensorflow-base 1.13.1 gpu_py36h8d69cac_0 anaconda
tensorflow-estimator 1.13.0 py_0 anaconda
tensorflow-gpu 1.13.1 pypi_0 pypi
My Cuda compilation tools are of version 9.1.85 and my CUDA and Driver version are
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:0A:00.0 Off | N/A |
| 0% 39C P8 5W / 225W | 7740MiB / 7952MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 2080 On | 00000000:42:00.0 Off | N/A |
| 0% 33C P8 19W / 225W | 142MiB / 7951MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 7729MiB |
| 1 1354 G /usr/lib/xorg/Xorg 16MiB |
| 1 49251 C .../whsu014/.conda/envs/whsuphd/bin/python 113MiB |
+-----------------------------------------------------------------------------+
When I insert this line of code
tf.Session(config = tf.configProto(log_device_placement = True)):
I see the below in my terminal
...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...
So it seems that Tensorflow is using GPU.
When I profile the code,
on GPU this is the first 10 lines
10852017 function calls (10524203 primitive calls) in 184.768 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16200 173.827 0.011 173.827 0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
6 0.926 0.154 0.926 0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
62 0.813 0.013 0.813 0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
156954 0.414 0.000 0.415 0.000 {built-in method numpy.array}
16200 0.379 0.000 1.042 0.000 training.py:643(_standardize_user_data)
24300 0.338 0.000 0.338 0.000 {method 'partition' of 'numpy.ndarray' objects}
68 0.301 0.004 0.301 0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
32458 0.223 0.000 2.122 0.000 tensorflow_backend.py:156(get_session)
3206 0.212 0.000 0.238 0.000 tf_stack.py:31(extract_stack)
76024 0.210 0.000 0.702 0.000 ops.py:5246(get_controller)
...
on CPU this is the first 10 lines
22123473 function calls (21647174 primitive calls) in 60.173 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
16269 42.491 0.003 42.491 0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
16269 0.568 0.000 48.964 0.003 session.py:1042(_run)
56 0.532 0.010 0.532 0.010 {built-in method time.sleep}
153641 0.458 0.000 0.460 0.000 {built-in method numpy.core.multiarray.array}
183148/125354 0.447 0.000 1.316 0.000 python_message.py:469(init)
1226659 0.362 0.000 0.364 0.000 {built-in method builtins.getattr}
2302110/2301986 0.339 0.000 0.358 0.000 {built-in method builtins.isinstance}
8 0.285 0.036 0.285 0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
12150 0.267 0.000 0.271 0.000 callbacks.py:211(on_batch_end)
147026/49078 0.264 0.000 1.429 0.000 python_message.py:1008(ByteSize)
...
This is my code.
def train_generator(x_list, y_list):
# 0.1 validatioin split
train_length = (len(x_list)//10)*9
while True:
for i in range(train_length):
train_x = np.array([x_list[i]])
train_y = np.array([y_list[i]])
yield train_x, train_y
def val_generator(x_list, y_list):
# 0.1 validation split
val_length = len(x_list)//10
while True:
for i in range(-val_length, 0, 1):
val_x = np.array([x_list[i]])
val_y = np.array([y_list[i]])
yield val_x, val_y
with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
monitor='val_loss', verbose=1,
save_best_only=True)
history = model.fit_generator(generator=train_generator(train_x,
train_y),
steps_per_epoch=(len(train_x)//10)*9,
epochs=5,
validation_data=val_generator(train_x,
train_y),
validation_steps=len(train_x)//10,
callbacks=[checkpointer],
verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()
I expect a significant speed up when using GPU for training. How can I fix this? Can someone help me to understand what is causing the slowdown? Thank you.
Couple of observations:
Use CuDNNLSTM instead of LSTM to train on GPU, you will see considerable increase in speed.
Sometimes, for very small networks, the overhead of transferring between CPU and GPU outweighs the parallel computations made on GPU; in other words, there is more time lost on transferring the data than time gained by training on GPU.
GPUs should be used for highly intensive tasks and computations (very big LSTM/heavy CNN networks). Nevertheless, for very small MLPs and even small LSTMs you might observe that the network trains equally fast on CPU and GPU or that in some particular cases the speed on CPU is even better (very particular cases with super small networks).
UPDATE FOR TENSORFLOW >= 2.0
The imports default to using CuDNNLSTM/CuDNNGRU if the video card is detected; therefore it is not needed explicitly to import them.
I would like to perform two different calculations across consecutive columns in a pandas or pyspark dataframe.
Columns are weeks and the metrics are displayed as rows.
I want to calculate the actual and percentage differences across the columns.
The input/output tables incl. the calculations used in Excel are displayed in the following image.
I want to replicate these calculations on a pandas or pyspark dataframe.
Raw Data Attached:
Metrics Week20 Week21 Week22 Week23 Week24 Week25 Week26 Week27
Sales 20301 21132 20059 23062 19610 22734 22140 20699
TRXs 739 729 690 779 701 736 762 655
Attachment Rate 4.47 4.44 4.28 4.56 4.41 4.58 4.55 4.96
AOV 27.47 28.99 29.07 29.6 27.97 30.89 29.06 31.6
Profit 5177 5389 5115 5881 5001 5797 5646 5278
Profit per TRX 7.01 7.39 7.41 7.55 7.13 7.88 7.41 8.06
in pandas you could use pct_change(axis=1) and diff(axis=1) methods:
df = df.set_index('Metrics')
# list of metrics with "actual diff"
actual = ['AOV', 'Attachment Rate']
rep = (df[~df.index.isin(actual)].pct_change(axis=1).round(2)*100).fillna(0).astype(str).add('%')
rep = pd.concat([rep,
df[df.index.isin(actual)].diff(axis=1).fillna(0)
])
In [131]: rep
Out[131]:
Week20 Week21 Week22 Week23 Week24 Week25 Week26 Week27
Metrics
Sales 0.0% 4.0% -5.0% 15.0% -15.0% 16.0% -3.0% -7.0%
TRXs 0.0% -1.0% -5.0% 13.0% -10.0% 5.0% 4.0% -14.0%
Profit 0.0% 4.0% -5.0% 15.0% -15.0% 16.0% -3.0% -7.0%
Profit per TRX 0.0% 5.0% 0.0% 2.0% -6.0% 11.0% -6.0% 9.0%
Attachment Rate 0 -0.03 -0.16 0.28 -0.15 0.17 -0.03 0.41
AOV 0 1.52 0.08 0.53 -1.63 2.92 -1.83 2.54
Recently I got a new batch of dumps to identify the HighMemory usage in 3 of our WCF Services. Which are hosted on 64Bit AppPool and Windows Server 2012.
Application one :
ProcessUp Time : 22 days
GC Heap usage : 2.69 Gb
Loaded Modules : 220 Mb
Commited Memory : 3.08 Gb
Native memory : 2 Gb
Issue identified as large GC heap usage is due to un closed WCF client proxy objects. Which are accounting for almost 2.26 Gb and rest for cache in GC heap.
Application Two :
ProcessUp Time : 9 Hours
GC Heap usage : 4.43 Gb
Cache size : 2.45 Gb
Loaded Modules : 224 Mb
Commited Memory : 5.13 Gb
Native memory heap : 2 Gb
Issue identified as most of the objects are of System.Web.CaheEntry, as they are due to large cache size. 2.2 Gb of String object on Gc heap has roots to CacheRef root objects.
Application Three :
Cache size : 950 Mb
GC heap : 1.2 Gb
Native Heap : 2 Gb
We recently upgrade to Windows Server 2012, I had old dumps as well. Those dumps does not show the native heap for the same application. It was only around 90 Mb.
I also use WinDbg to explore the Native heap with !heap -s command.
which shows very minimal native heap sizes as shown below.
I am just confused Why DebugDiag 2.0 is showing 2Gb of Native Heap in every WCF service. My understand is that !heap -s should also dump the same native heaps and it should match the debug diag reports graphs. Report also shows values in Thoushand of TBytes.
0:000> !heap -s
LFH Key : 0x53144a890e31e98b
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
000000fc42c10000 00000002 32656 31260 32552 2885 497 6 2 f LFH
000000fc42a40000 00008000 64 4 64 2 1 1 0 0
000000fc42bf0000 00001002 3228 1612 3124 43 9 3 0 0 LFH
000000fc43400000 00001002 1184 76 1080 1 5 2 0 0 LFH
000000fc43390000 00001002 1184 148 1080 41 7 2 0 0 LFH
000000fc43d80000 00001002 60 8 60 5 1 1 0 0
000000fc433f0000 00001002 60 8 60 5 1 1 0 0
000000fc442a0000 00001002 1184 196 1080 1 6 2 0 0 LFH
000000fc44470000 00041002 60 8 60 5 1 1 0 0
000001008e9f0000 00041002 164 40 60 3 1 1 0 0 LFH
000001008f450000 00001002 3124 1076 3124 1073 3 3 0 0
External fragmentation 99 % (3 free blocks)
-------------------------------------------------------------------------------------
Can anybody explain me why WinDbg command !heap -s and DebugDiag report is varing. Or I have incorrect knowledge of above command.
I also use Pykd script to dump native object stats. Which does not show much large number of objects.
Also what is mean by External fragmentation 99 % (3 free blocks)
in above output. I understand that fragmented memory has less large block of continuous memory place. But fail to relate it with Percentage.
Edit 1 :
Application 2 :
0:000> !address -summary
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 363 7ffb`04a14000 ( 127.981 Tb) 99.98%
<unknown> 952 4`e8c0c000 ( 19.637 Gb) 98.54% 0.01%
Image 2122 0`0e08d000 ( 224.551 Mb) 1.10% 0.00%
Heap 88 0`03372000 ( 51.445 Mb) 0.25% 0.00%
Stack 124 0`013c0000 ( 19.750 Mb) 0.10% 0.00%
Other 7 0`001be000 ( 1.742 Mb) 0.01% 0.00%
TEB 41 0`00052000 ( 328.000 kb) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 643 4`ebf44000 ( 19.687 Gb) 98.79% 0.02%
MEM_IMAGE 2655 0`0eb96000 ( 235.586 Mb) 1.15% 0.00%
MEM_MAPPED 37 0`00b02000 ( 11.008 Mb) 0.05% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 363 7ffb`04a14000 ( 127.981 Tb) 99.98%
MEM_RESERVE 725 3`b300d000 ( 14.797 Gb) 74.25% 0.01%
MEM_COMMIT 2610 1`485cf000 ( 5.131 Gb) 25.75% 0.00%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 868 1`3939d000 ( 4.894 Gb) 24.56% 0.00%
PAGE_EXECUTE_READ 157 0`09f10000 ( 159.063 Mb) 0.78% 0.00%
PAGE_READONLY 890 0`035ed000 ( 53.926 Mb) 0.26% 0.00%
PAGE_WRITECOPY 433 0`0149c000 ( 20.609 Mb) 0.10% 0.00%
PAGE_EXECUTE_READWRITE 148 0`0065d000 ( 6.363 Mb) 0.03% 0.00%
PAGE_EXECUTE_WRITECOPY 67 0`0017c000 ( 1.484 Mb) 0.01% 0.00%
PAGE_READWRITE|PAGE_GUARD 41 0`000b9000 ( 740.000 kb) 0.00% 0.00%
PAGE_NOACCESS 4 0`00004000 ( 16.000 kb) 0.00% 0.00%
PAGE_EXECUTE 2 0`00003000 ( 12.000 kb) 0.00% 0.00%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
Free 101`070a0000 7ef6`e77b2000 ( 126.964 Tb)
<unknown> fd`72f14000 0`d156c000 ( 3.271 Gb)
Image 7ff9`91344000 0`012e8000 ( 18.906 Mb)
Heap 100`928a0000 0`00544000 ( 5.266 Mb)
Stack fc`43240000 0`0007b000 ( 492.000 kb)
Other fc`42ea0000 0`00181000 ( 1.504 Mb)
TEB 7ff7`ee852000 0`00002000 ( 8.000 kb)
PEB 7ff7`eeaaf000 0`00001000 ( 4.000 kb)
Application Three :
0:000> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 323 7ffb`9f8ea000 ( 127.983 Tb) 99.99%
<unknown> 832 4`4bbb6000 ( 17.183 Gb) 98.15% 0.01%
Image 2057 0`0e5ab000 ( 229.668 Mb) 1.28% 0.00%
Heap 196 0`04f52000 ( 79.320 Mb) 0.44% 0.00%
Stack 127 0`01440000 ( 20.250 Mb) 0.11% 0.00%
Other 7 0`001be000 ( 1.742 Mb) 0.01% 0.00%
TEB 42 0`00054000 ( 336.000 kb) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 783 4`51099000 ( 17.266 Gb) 98.63% 0.01%
MEM_IMAGE 2444 0`0ec06000 ( 236.023 Mb) 1.32% 0.00%
MEM_MAPPED 35 0`00a67000 ( 10.402 Mb) 0.06% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 323 7ffb`9f8ea000 ( 127.983 Tb) 99.99%
MEM_RESERVE 621 3`e3504000 ( 15.552 Gb) 88.83% 0.01%
MEM_COMMIT 2641 0`7d202000 ( 1.955 Gb) 11.17% 0.00%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 919 0`6dc07000 ( 1.715 Gb) 9.80% 0.00%
PAGE_EXECUTE_READ 153 0`0a545000 ( 165.270 Mb) 0.92% 0.00%
PAGE_READONLY 734 0`02cf5000 ( 44.957 Mb) 0.25% 0.00%
PAGE_WRITECOPY 470 0`01767000 ( 23.402 Mb) 0.13% 0.00%
PAGE_EXECUTE_READWRITE 240 0`009cf000 ( 9.809 Mb) 0.05% 0.00%
PAGE_EXECUTE_WRITECOPY 76 0`001c5000 ( 1.770 Mb) 0.01% 0.00%
PAGE_READWRITE|PAGE_GUARD 42 0`000be000 ( 760.000 kb) 0.00% 0.00%
PAGE_NOACCESS 5 0`00005000 ( 20.000 kb) 0.00% 0.00%
PAGE_EXECUTE 2 0`00003000 ( 12.000 kb) 0.00% 0.00%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
Free 52`892e0000 7fa5`65548000 ( 127.646 Tb)
<unknown> 4f`4ec81000 0`e9c3f000 ( 3.653 Gb)
Image 7ff9`91344000 0`012e8000 ( 18.906 Mb)
Heap 52`8833b000 0`00fa4000 ( 15.641 Mb)
Stack 4e`37a70000 0`0007b000 ( 492.000 kb)
Other 4e`37720000 0`00181000 ( 1.504 Mb)
TEB 7ff7`ee828000 0`00002000 ( 8.000 kb)
PEB 7ff7`eea43000 0`00001000 ( 4.000 kb)