Running - Python 3.8.5 TensorFlow and TensorBoard 2.3.0 cudnn 10.1
In the example here:https://www.tensorflow.org/tensorboard/get_started
on model.fit = only 1 epoch executes, then the kernel restarts
The log file is created for 1 epoch
Why is the kernel restarting, leaving only 1 epoch cycle?
Related
I've done a lot of work with TF1 and recently I upgraded to TF2 but I'm running into issues with running TF2 on a GPU as the network isn't converging (even as the same code converges while running on a cpu).
Following the CNN tutorials on https://www.tensorflow.org/tutorials I have noticed that the models are failing to learn during training. Any ideas on what is causing this?
Another posts suggested that this may be caused by floating point errors a but I have a hard time believing things are that unstable -- especially across multiple tutorials. I had this problem occur on the following tutorials: Convolutional Neural Network (CNN), Transfer learning and fine tuning, and Transfer learning with TF hub.
I am running:
Tensorflow version 2.3.0
Cuda compilation tools release 11.2 V11.2.125
On a NVIDIA GeForce RTX 3090 or Intel i7-10700K CPU
I had some trouble installing things initially but the method described in this answer ended up working -- could that be the root issue?
To demonstrate, I copy/pasted the code from the CNN tutorial into the following script:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # minimize logs
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
RUN_ON_CPU = False
if RUN_ON_CPU:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # prevent cpu from running to see if that's the issue
print('gpu available', tf.config.list_physical_devices('GPU'))
# load dataset
(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0
# build model backbone
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# add dense layers on top
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()
# compile and train
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
validation_data=(test_images, test_labels))
plt.figure()
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.ylim([0.5, 1])
plt.legend(loc='lower right')
if RUN_ON_CPU:
plt.title('Training on CPU')
else:
plt.title('Training on GPU')
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('test loss and accuracy', test_loss, test_acc)
Which plots the following training curves depending on the RUN_ON_CPU flag:
GPU test loss and accuracy 2.302645444869995 0.10000000149011612
CPU test loss and accuracy 0.879743754863739 0.7060999870300293
The tutorial claims that the CNN should achieve a test accuracy of ~70% which the GPU doesn't come close to. To be sure I logged tf.config.list_physical_devices('GPU') and the GPU took 2-3s per epoch whereas CPU took 11-14s. Using os.environ['CUDA_VISIBLE_DEVICES'] = '-1' to turn off the GPU was the only code change between the runs.
Ok I got it working, thanks to #phe who suggested that my installation was faulty in a comment.
Here's what I did:
Uninstalled CUDA
In control panel I uninstalled everything CUDA 11.2
I deleted the folder at because the uninstallers didn't C:\Program Files\NVIDIA GPU Computing Toolkit
Re-install CUDA and cudNN (following this video)
Using this table I determined which version of CUDA and cudNN to install https://www.tensorflow.org/install/source#gpu
Download the appropriate CUDA and cudNN versions
Install CUDA
Extract downloaded cudNN zip and copy the bin, include, and lib subfolders to where CUDA installed (for me this was C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2) this should put the items from cudNN into CUDA's same folders
Add the libnvvp and bin CUDA folders to PATH Environmental variables (for me C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin and C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp)
Create an python environment with tensorflow
Install Anaconda
Create an environment with an appropriate python version (see table from step 1 of installing CUDA): conda create -n tf25 python=3.8
Activate the environment conda activate tf25
Install tensorflow using pip (don't use anaconda -- I think this is where my system got messed up): pip install tensorflow (specify a specific version if you don't want the most recent one)
Run your code using that environment
Invent AGI (or don't if you want to prevent the apocalypse :)
I am trying to run tensorflow with CPU support.
tensorflow:
Version: 1.14.0
Keras:
Version: 2.3.1
When I try to run the following piece of code :
def run_test_harness(trainX,trainY,testX,testY):
datagen=ImageDataGenerator(rescale=1.0/255.0)
train_it = datagen.flow(trainX, trainY, batch_size=1)
test_it = datagen.flow(testX, testY, batch_size=1)
model=define_model()
history = model.fit_generator(train_it, steps_per_epoch=len(train_it),
validation_data=test_it, validation_steps=len(test_it), epochs=1, verbose=0)
I get the following error as shown in image:
Image shows the error
I tried to configure bazel for the same but it was of no use. It would be helpful if someone could direct me to resources or help with the problem. Thank you
EDIT : (Warning messages)
WARNING:tensorflow:From /home/neha/valiance/kerascpu/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4070: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:From /home/neha/valiance/kerascpu/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2020-10-22 12:41:36.023849: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-22 12:41:36.326420: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2299965000 Hz
2020-10-22 12:41:36.327496: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5502350 executing computations on platform Host. Devices:
2020-10-22 12:41:36.327602: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2020-10-22 12:41:36.679930: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2020-10-22 12:41:36.890241: W tensorflow/core/framework/allocator.cc:107] Allocation of 3406823424 exceeds 10% of system memory.
^Z
[1]+ Stopped python3 model.py
You should try running your code on google colab. I think there aren't enough resources available on your PC for the task you are trying to run even though you are using a batch_size of 1.
I can't seem to find exact question about what I am about to ask here. I just started following a Tensorflow tutorial on YouTube and got stuck at the very beginning. I wrote in my spyder IDE the below code:
import tensorflow as tf
a = tf.constant(2)
b = tf.constant(3)
x = tf.add(a,b)
#writer = tf.summary.FileWriter('./graphs', tf.get_default_graph())
with tf.Session() as sess:
writer = tf.summary.FileWriter('./graphs', sess.graph)
print(sess.run(x))
writer.close()
And via anaconda terminal I activated my env (which I newly created, installed all packages required, spyder as well) I typed python tftuts.py and got 2018-10-05 11:50:49.431174: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
5
Then I typed tensorboard --logdir="./graphs" --port 6006 as suggested in tutorial I am watching.
Now, when I go to http://localhost:6006/ the page shows
I am on Win10, using python 3.6.6 in Anaconda env, tensorflow 1.10.0.
How can solve this issue?
I train the model with the shell command:
python src/facenet_train.py \
--batch_size 15 \
--gpu_memory_fraction 0.25 \
--models_base_dir trained_model_2017_05_15_10_24 \
--pretrained_model trained_model_2017_05_15_10_24/20170515-121856/model-20170515-121856.ckpt-182784 \
--model_def models.nn2 \
--logs_base_dir logs \
--data_dir /data/user_set/training/2017_05_15_10_24 \
--lfw_pairs /data/user_set/lfw_pairs.txt \
--image_size 224 \
--lfw_dir /data/user_set/lfw \
--optimizer ADAM \
--max_nrof_epochs 1000 \
--learning_rate 0.00001
but i get error infomation like this when use my own trained model:
2017-05-17 14:23:05.448285: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations.
2017-05-17 14:23:05.448318: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-05-17 14:23:05.448324: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations.
2017-05-17 14:23:05.448329: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-05-17 14:23:05.448334: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations.
2017-05-17 14:23:05.674872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0
with properties:
name: Quadro M4000
major: 5 minor: 2 memoryClockRate (GHz) 0.7725
pciBusID 0000:03:00.0
Total memory: 7.93GiB
Free memory: 2.89GiB
2017-05-17 14:23:05.674917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-05-17 14:23:05.674935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-05-17 14:23:05.674957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M4000, pci bus
id: 0000:03:00.0)
Traceback (most recent call last):
File "forward.py", line 21, in
images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
File "/home/chen/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2563, in get_tensor_by_name
return self.as_graph_element(name, allow_tensor=True, allow_operation=False)
File "/home/chen/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2414, in as_graph_element
return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
File "/home/chen/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2456, in _as_graph_element_locked
"graph." % (repr(name), repr(op_name)))
KeyError: "The name 'input:0' refers to a Tensor which does not exist. The operation, 'input', does not exist in the graph."
get feature code:
import tensorflow as tf
import facenet
w_MODEL_PATH_='/home/chen/demo_dir/facenet_tensorflow_train/trained_model_2017_05_15_10_24/20170515-121856'
with tf.Graph().as_default():
with tf.Session() as sess:
# load the model
meta_file, ckpt_file = facenet.get_model_filenames(w_MODEL_PATH_)
facenet.load_model(w_MODEL_PATH_, meta_file, ckpt_file)
# print("model_path:", w_MODEL_PATH_,"meta_file:", meta_file,"ckpt_file:", ckpt_file)
# Get input and output tensors
# ops = tf.get_default_graph().get_operations()
#
# print(ops)
images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
embeddings = tf.get_default_graph().get_tensor_by_name("embeddings:0")
phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("phase_train:0")
image_size = images_placeholder.get_shape()[1]
embedding_size = embeddings.get_shape()[1]
# print(image_size)
paths = ['one.png', 'two.png']
# Run forward pass to calculate embeddings
images = facenet.load_data(paths, do_random_crop=False, do_random_flip=False, image_size=image_size,
do_prewhiten=True)
# print("images:", idx, images)
feed_dict = {images_placeholder: images, phase_train_placeholder: False}
# print(idx,"embeddings:", embeddings)
emb_array = sess.run(embeddings, feed_dict=feed_dict)
# print(idx, "emb_array:", emb_array)
print(emb_array)
I don't know how to use my own trained model, please help.
If you are talking about the last part then use this code to see what operations your model has.
for i in tf.get_default_graph().get_operations():
print(i.name)
If you are talking about the optimizations.
You are getting this error because you need to compile tensorflow on your own machine. It is very easy to do.
You can read the documentation for the full list of options, but essentially you need to do a few steps.
https://www.tensorflow.org/install/install_sources
git clone tensorflow
repo install
bazel a tensorflow build system
configure tensorflow
build tensorflow
install tensorflow in your environment if that is anaconda or virtualenv if you are using python
that is it, of course other required libraries will need to be installed. It is pretty easy to do on Ubuntu.
Alternatively you could try the conda forge version of tensorflow-gpu if you are using anaconda, but I cannot verify it is also compiled with optimizations for your cpu.
https://conda-forge.org/
install anaconda
add the conda forge repo url
update conda
install tensorflow-gpu
So I'm trying to get into tensorflow and liking it so far.
Today I upgraded to cuda 8, cudnn 5.1 and tensorflow 0.12.1. Using a Maxwell Titan X GPU.
Using the following short code of loading the pretrained vgg16:
import tensorflow as tf
from tensorflow.contrib import slim
from tensorflow.contrib.slim import nets
tf.reset_default_graph()
input_images = tf.placeholder(tf.float32, [None, 224, 224, 3], 'image')
preds = nets.vgg.vgg_16(input_images, is_training=False)[0]
saver = tf.train.Saver()
config = tf.ConfigProto(log_device_placement=True,
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction = 0.5))
sess = tf.InteractiveSession(config=config)
saver.restore(sess, './vgg_16.ckpt')
_in = np.random.randn(16, 224, 224, 3).astype(np.float32)
I then time the forward pass :
%timeit sess.run(preds, feed_dict={input_images: _in})
I get 160ms per batch (forward pass only), which seems 2.5x slower than the respective configuration in torch according to this benchmark (and also slower than MatconvNet).
The operations seem correctly assigned to the gpu, and the cuda libraries properly found, what else am I missing?
Edit : cudnn and cuda properly found
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:04:00.0
Total memory: 11.92GiB
Free memory: 11.81GiB
Also the feeding does not seem to be the problem since replacing input_images by tf.random_uniform((16, 224, 224, 3), maxval=255)does not change the timing.
Edit 2: So I compared to the pytorch version running on the same machine and I get (batches of 16x224x224x3) :
Resnet-50 : pytorch 48ms vs tf 58 ms (OK)
VGG16 : pytorch 65ms vs tf 160ms (not OK)
Tested recently on cuda 9.0, tensorflow 1.9 and pytorch 0.4.1, the differences are now negligible for the same operations.
See the proper timing here.