How to stop TFLearn from printing everything - tflearn

When I run TFLearn, my console prints out every single iteration as shown below. Snapshot step doesn't do anything. Can i just delete the line of code that does this somewhere or a parameter to turn it off?
Training Step: 7790 | total loss: 0.00591 | time: 9.970s
| Adam | epoch: 001 | loss: 0.00591 - acc: 0.9988 -- iter: 03200/55000
Training Step: 7791 | total loss: 0.00540 | time: 10.025s
| Adam | epoch: 001 | loss: 0.00540 - acc: 0.9989 -- iter: 03264/55000
Training Step: 7792 | total loss: 0.00505 | time: 10.089s
| Adam | epoch: 001 | loss: 0.00505 - acc: 0.9990 -- iter: 03328/55000
Training Step: 7793 | total loss: 0.00480 | time: 10.155s
| Adam | epoch: 001 | loss: 0.00480 - acc: 0.9991 -- iter: 03392/55000
Training Step: 7794 | total loss: 0.00503 | time: 10.215s
| Adam | epoch: 001 | loss: 0.00503 - acc: 0.9992 -- iter: 03456/55000
Training Step: 7795 | total loss: 0.00973 | time: 10.274s
| Adam | epoch: 001 | loss: 0.00973 - acc: 0.9962 -- iter: 03520/55000
Training Step: 7796 | total loss: 0.00879 | time: 10.337s
| Adam | epoch: 001 | loss: 0.00879 - acc: 0.9965 -- iter: 03584/55000
Training Step: 7797 | total loss: 0.00824 | time: 10.406s
| Adam | epoch: 001 | loss: 0.00824 - acc: 0.9969 -- iter: 03648/55000
Training Step: 7798 | total loss: 0.00759 | time: 10.464s
| Adam | epoch: 001 | loss: 0.00759 - acc: 0.9972 -- iter: 03712/55000
Training Step: 7799 | total loss: 0.00690 | time: 10.523s
| Adam | epoch: 001 | loss: 0.00690 - acc: 0.9975 -- iter: 03776/55000
EDIT: it is probably a windows issue and cmd

As of right now there seems to be no way to disable this in TFLearn.
Your best bet seems to be editing the source code.

Related

Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

Parent and Part Relationship Cost Calculation using SQL

I have a Child and Parent relationship table with cost.
I need to roll up the child part costs into parent.
Attached is the image for the table.
Reqd. column shows the cost rollup I need.
But I am not able to do it using sql queries.
Also, recursive queries won't help since total line items are more than 100k.
Any suggestion or ideas will be very helpful.
I don't understand you are saying you can't use CTE & SQL query. So what technical base do you want to rely on?
By SQL it is totally doable with the functions of aggregate & joins
EDIT:
Without your schema, try to adapt this query for your table :
SELECT j.ID, j.Parent_ID, SUM(t.Cost)
FROM MyTableCost as t
INNER join MyTableCost AS j on j.ID = t.Parent_ID
GROUP BY j.ID, j.Parent_ID;
One self-join seems to be enough.
But I don't see the logic for rolling up id 4.
select t0.parent_id, t0.part_id, t0.cost
, coalesce(sum(t1.cost), t0.cost) as reqd_cost
from your_weird_hierarchy_table t0
left join your_weird_hierarchy_table t1
on t1.parent_id = t0.part_id
group by t0.parent_id, t0.part_id, t0.cost;
parent_id | part_id | cost | reqd_cost
--------: | ------: | -----: | --------:
1 | 2 | 1.0000 | 4.0000
1 | 3 | 2.0000 | 9.0000
1 | 4 | 1.0000 | 1.0000
2 | 3 | 3.0000 | 9.0000
2 | 4 | 1.0000 | 1.0000
3 | 4 | 5.0000 | 5.0000
3 | 5 | 3.0000 | 3.0000
3 | 6 | 1.0000 | 1.0000
db<>fiddle here

Training an object detector on dataset layers of increasing ambiguity

I have a number of Raspberry Pi cameras focused on bird feeders,
continually running a TensorFlow Object Detection graph (SSD MNet2) to detect birds.
Over time I've built a dataset of +10k image over 11 species, retraining the graph frequently.
I intend to cap the number of items in the dataset to 10k items (perhaps arbitrarily).
There is a flow of data through the dataset so that it continually improves.
New candidate detections are triaged by a judge (me) as follows:
Add as new training/evaluation item.
The detection is judged representative of a category.
After adjustment, the image and detection can be added to the ground truth
Add as counter example item.
The detection is false, but can be converted to an unclassified counter example.
After adjustment, the image and detection can be added to the ground truth
Discard item
Not useful for training.
Also note that some existing data is retired when sufficient better data is available.
To date, all the items in the ground truth are delivered to training with a weight of 1.0.
See: https://github.com/tensorflow/models/blob/master/research/object_detection/data_decoders/tf_example_decoder.py
def default_groundtruth_weights():
return tf.ones(
[tf.shape(tensor_dict[fields.InputDataFields.groundtruth_boxes])[0]],
dtype=tf.float32)
But this is obviously not very true.
I know by inspection that some of the items are not so good, but at any one time they're the best examples available.
Over time, eventually, bad items get replaced with better items.
Ranked training records
I have wondered about the impact on training and whether the situation would be improved by ranking the dataset, by some value of ideality,
and then training in successive ranks, so that the model initialises on the most ideal data and subsequently learns less and less ideal data.
What I'm imagining trying to avoid is the model paying too much attention to bad data and not enough to good data, especially during the initial epochs of training.
Where bad and good data mean how well the data items contribute to the veracity and visualization (via Lucid) of the trained models.
Weighted Dataset
Setting a weight (between 0 and 1) on an item means the loss calculated for that item is reduced (by the weight factor);
I assume it means "pay less attention to this item by this much".
See: Class weights for balancing data in TensorFlow Object Detection API
I've visited every item in my dataset to retrospectively set a weight.
I did this by running some recent models over the dataset images (admittedly, from which the models were trained) and then matching detections.
The weight given to each item was calculated by averaging scores from the model detections (and rounding to one decimal place to make bands).
The entire dataset was then reviewed to increase or reduce weights as judged necessary.
The results are shown in the following table:
| Class\Weight Bin | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | Total|
| blackbird | | | 34 | 84 | 212 | 305 | 115 | 750 |
| blue tit | | | 47 | 94 | 211 | 435 | 241 | 1028 |
| collared dove | | | 17 | 52 | 236 | 302 | 101 | 708 |
| dunnock | | | 50 | 140 | 260 | 236 | 228 | 914 |
| goldfinch | | | 60 | 103 | 220 | 392 | 164 | 939 |
| great tit | | 35 | 42 | 71 | 234 | 384 | 201 | 967 |
| mouse | 40 | 29 | 35 | 50 | 87 | 142 | | 383 |
| robin | | 43 | 44 | 97 | 175 | 207 | 52 | 618 |
| sparrow | | 31 | 51 | 75 | 278 | 475 | 220 | 1130 |
| starling | | 19 | 28 | 39 | 97 | 227 | 73 | 483 |
| wood pigeon | | | 10 | 34 | 82 | 265 | 560 | 951 |
| | | | | | | | | 8871 |
The first training results look promising, in that the model is training well.
But I haven't reviewed the visualizations yet.
Is setting an appropriate weight on each dataset item equivalent to layering the delivery of ranked training records?
First, try removing ambiguous data from the dataset and train the model, and compare its results with previous model.
If that does not help, then go with class weights for balancing data.

TensorFlow with 4-GPU doesn't speed up the training

My code first:
from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot
from numpy import where
from keras.utils import to_categorical
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import multi_gpu_model
X, y = make_blobs(n_samples=1000000, centers=3, n_features=3, cluster_std=2, random_state=2)
y = to_categorical(y)
n_train = 500000
trainX, testX = X[:n_train, :], X[n_train:, :]
trainY, testY = y[:n_train], y[n_train:]
model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
p_model = multi_gpu_model(model, gpus=4)
opt = SGD(lr=0.01, momentum=0.9)
p_model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
history = p_model.fit(trainX, trainY, validation_data=(testX, testY), epochs=20, verbose=1, batch_size=32)
_, train_acc = p_model.evaluate(trainX, trainY, verbose=0)
_, test_acc = p_model.evaluate(testX, testY, verbose=0)
print("Train: %.3f, Test: %.3f" % (train_acc, test_acc))
The way to use the 4 powerful GPUs is this line below:
p_model = multi_gpu_model(model, gpus=4)
While the training is going, I can see the following (GPUs are fully utilized?):
(tf_gpu) [martin#A08-R32-I196-3-FZ2LTP2 mlm]$ nvidia-smi
Wed Jan 23 09:08:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 49W / 250W | 21817MiB / 22919MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:04:00.0 Off | 0 |
| N/A 34C P0 50W / 250W | 21817MiB / 22919MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:83:00.0 Off | 0 |
| N/A 28C P0 48W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:84:00.0 Off | 0 |
| N/A 36C P0 51W / 250W | 21817MiB / 22919MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 114918 C python 21807MiB |
| 1 114918 C python 21807MiB |
| 2 114918 C python 21807MiB |
| 3 114918 C python 21807MiB |
+-----------------------------------------------------------------------------+
However, compared to with single-GPU run, or even compared to with my Mac desktop run, it doesn't speed up at all. The total training takes about 20 minutes, almost the same as the time used by single-gpu training, and much slower than the training on my personal Mac. Why is that?
Increase your batch_size more than 32. Keep increasing until you have full GPU utilization. Yes, this might affect your model but it does significantly increases your performance. You will have to find a sweet spot for that batch_size.

Specify gpu in Tensorflow code: /gpu:0 is always working?

I have 3 graphics cards in my workstation, one of them is Quadro K620, and the other two are Titan X. Now I would like to run my tensorflow code in one of the graphics card, so that I can leave the others idle for another task.
However, regardless of setting tf.device('/gpu:0') or tf.device('/gpu:1'), I found the 1st Titan X graphics card is always working, I don't know why.
import argparse
import os
import time
import tensorflow as tf
import numpy as np
import cv2
from Dataset import Dataset
from Net import Net
FLAGS = None
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--foldername', type=str, default='./data-large/')
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--num_epoches', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.5)
FLAGS = parser.parse_args()
net = Net(FLAGS.batch_size, FLAGS.learning_rate)
with tf.Graph().as_default():
# Dataset is a class for encapsulate the input pipeline
dataset = Dataset(foldername=FLAGS.foldername,
batch_size=FLAGS.batch_size,
num_epoches=FLAGS.num_epoches)
images, labels = dataset.samples_train
## The following code defines the network and train
with tf.device('/gpu:0'): # <==== THIS LINE
logits = net.inference(images)
loss = net.loss(logits, labels)
train_op = net.training(loss)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess = tf.Session()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
start_time = time.time()
try:
step = 0
while not coord.should_stop():
_, loss_value = sess.run([train_op, loss])
step = step + 1
if step % 100 == 0:
format_str = ('step %d, loss = %.2f, time: %.2f seconds')
print(format_str % (step, loss_value, (time.time() - start_time)))
start_time = time.time()
except tf.errors.OutOfRangeError:
print('done')
finally:
coord.request_stop()
coord.join(threads)
sess.close()
Regarding to the line "<=== THIS LINE:"
If I set tf.device('/gpu:0'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 404MiB / 1993MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 39C P2 100W / 250W | 11691MiB / 12206MiB | 8% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 43C P2 71W / 250W | 111MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing the 1st Titan X card is working.
If I set tf.device('/gpu:1'), the monitor says:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 Off | 0000:03:00.0 On | N/A |
| 34% 45C P0 2W / 30W | 411MiB / 1993MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 52C P2 73W / 250W | 11628MiB / 12206MiB | 12% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:81:00.0 Off | N/A |
| 22% 42C P2 71W / 250W | 11628MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
showing that the two Titan X cards are working, not the 2nd Titan X alone.
So any reason behind this and how to specify the gpu I want my program to run in?
Just a guess, but the default behavior for a tf.train.Optimizer object (which I expect is created in net.training(loss)) when you call minimize() is colocate_gradients_with_ops=False. This may lead to the backpropagation ops being placed on the default device, which will be /gpu:0.
To work out if this is happening, you can iterate over sess.graph_def and look for nodes that either have /gpu:0 in the NodeDef.device field, or have an empty device field (in which case they will be placed on /gpu:0 by default).
Another option for checking what devices are being used is to use the output_partition_graphs=True option when running your step. This shows what devices TensorFlow is actually using (instead of, in sess.graph_def, what devices your program is requesting), and should show exactly what nodes are running on /gpu:0.