Why cifar10 benchmark shows slow performance on every n x 100 step? - tensorflow

I have tried to get performance comparison result between source built and google provided .whl files for tensorflow-gpu runs. I have tried more than tens of bench mark tests, and I always get slow performance on every n x 100 step like 0, 100, 200, .... I cannot figure out the reason. Who, one of you, expert of tensorflow, can answer for me?
I am running ubuntu(18.04). fedora(27, 28), Windows, and CUDA 9.0/9.1/9.2
I've tested with tf1.6, 1.7, 1.8, 1.9.
My GPU is 1080ti/11GB.
My cpu is intel 4690k with 32G dram.
attached one sample
.
Tnank you very much in advance.
Dae-Chul Jo
dcjo00#gmail.com

It could be for some different reasons:
Every 100 steps you are saving the model
Every 100 steps you are testing validation data
Every 100 steps you are saving logs to tensorboard
These are my first guesses in order of probability, if you provide code I could study it more deeply.
Hope it helps! :)
EDIT: it ended up being:
tf.train.MonitoredTrainingSession has a default of saving summaries every 100 steps. Which was proposal 3.

Related

Train / Test split % for Object Detection - what's the current recommendation?

Using the Tensorflow Object Detection API, what's the current recommendation / best practice around the train / test split percentage for labeled examples? I've seen a lot of conflicting info, anywhere from 70/30 to 95/5. Any recent real world experience is appreciated.
Traditional advice is ~70-75% training and the rest test data. More recent articles indeed suggest a different split. I read 95/2.5/2.5 (train / test / dev for hyperparameter tuning) a lot these days.
I guess your optimal split depends on the amount of available data and the bias/variance characteristics. Poor performance on training data may be caused by underfitting and need more training data. If your model is fitting well or even overfitting, you should be able to allocate some of the training data away to test data.
If you're stuck in the middle, you may also consider cross validation as a computationally expensive but data friendly option.
It depends on the size of the dataset as Andrew Ng suggests:
(train/ dev or Val /test)
If the size of the dataset is 100 to 10K ~ 60/20/20
If the size of the dataset is 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
Note that these are not fixed and just suggestions.
The goal of the test set mentioned here is to give you an unbiased performance measurement of your work. In some works, it is OK not to have only two sets set (then they will call it train/test, though test set here is actually working as dev set ratio can be 70/30 )

Darkflow accurate on demo but not on code

I trained my own model with darkflow yolov2 for just one class, and the results are pretty good when running this on the terminal with a threshold configuration of 0.55
python3 flow --model cfg/yolov2-tiny-voc-1c.cfg --load 5250 --demo BARCELONA_WALK.mp4
but then I convert the checkpoint on pb and meta files to use on code
and when I specify the threshold on the code like this
options = {"model": "cfg/yolov2-tiny-voc-1c.cfg",
"pbload": "built_graph/yolov2-tiny-voc-1c.pb",
"metaload": "built_graph/yolov2-tiny-voc-1c.meta",
"threshold": 0.55,
"gpu": 0.9}
it detects nothing from my image samples, but when the threshold is 0.5 or lower it detects like 280 objects and the ones with confidence greater than 0.5 are like 190, so, why is the neural network not working the same way when using the code and when running demo from terminal if I'm using the same weights and the same threshold?
SOLVED!!! On my options I had to put "pbLoad" and "metaLoad" instead of "pbload" and "metaload" too bad that it didn't throw any errors but anyways, I realized it may be the Uppercases when reading this post. I hope it helps someone in the future!!

Understanding tensorflow inter/intra parallelism threads

I would like to understand a little more about these two parameters: intra and inter op parallelism threads
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
I read this post which has a pretty good explanation: TensorFlow: inter- and intra-op parallelism configuration
But I am seeking confirmations and also asking new questions below. And I am running my task in keras 2.0.9, tensorflow 1.3.0:
when both are set to 1, does it mean that, on a computer with 4 cores for example, there will be only 1 thread shared by the four cores?
why using 1 thread does not seem to affect my task very much in terms of speed? My network has the following structure: dropout, conv1d, maxpooling, lstm, globalmaxpooling,dropout, dense. The post cited above says that if there are a lot of matrix multiplication and subtraction operations, using a multiple thread setting can help. I do not know much about the math underneath but I'd imagine there are quite a lot of such matrix operations in my model? However, setting both params from 0 to 1 only sees a 1 minute slowdown over a 10 minute task.
why multi-thread could be a source of non-reproducible results? See Results not reproducible with Keras and TensorFlow in Python. This is the main reason I need to use single threads as I am doing scientific experiments. And surely tensorflow has been improving over the time, why this is not addressed in the release?
Many thanks in advance
When both parameters are set to 1, there will be 1 thread running on 1 of the 4 cores. The core on which it runs might change but it will always be 1 at a time.
When running something in parallel there is always a trade-off between lost time on communication and gained time through parallelization. Depending on the used hardware and the specific task (like the size of the matrices) the speedup will change. Sometimes running something in parallel will be even slower than using one core.
For example when using floats on a cpu, (a + b) + c will not be equal to a + (b + c) because of the floating point precision. Using multiple parallel threads means that operations like a + b + c will not always be computed in the same order, leading to different results on each run. However those differences are extremely small and will not effect the overall result in most cases. Completely reproducible results are usually only needed for debugging. Enforcing complete reproducibility would slow down multi-threading a lot.
Answer to question 1 is "No".
Setting both the parameters to 1 (intra_op_parallelism_threads=1, inter_op_parallelism_threads=1) will generate N threads, where N is the count of cores. I've tested it multiple times on different versions of TensorFlow. This is true even for latest version of TensorFlow. There are multiple questions on how to reduce the number of threads to 1 but with no clear answer. Some examples are
How to stop TensorFlow from multi-threading
https://github.com/usnistgov/frvt/issues/12
Changing the number of threads in TensorFlow on Cifar10
Importing TensorFlow spawns threads
https://github.com/tensorflow/tensorflow/issues/13853

TensorBoard doesn't show all data points

I was running a very long training (reinforcement learning with 20M steps) and writing summary every 10k steps. In between step 4M and 6M, I saw 2 peaks in my TensorBoard scalar chart for game score, then I let it run and went to sleep. In the morning, it was running at about step 12M, but the peaks between step 4M and 6M that I saw earlier disappeared from the chart. I tried to zoom in and found out that TensorBoard (randomly?) skipped some of the data points. I also tried to export the data but some data point including the peaks are also missing in the exported .csv.
I looked for answers and found this in TensorFlow github page:
TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag in tensorboard/backend/server.py.
Has anyone ever modified this server.py file? Where can I find the file and if I installed TensorFlow from source, do I have to recompile it after I modified the file?
You don't have to change the source code for this, there is a flag called --samples_per_plugin.
Quoting from the help command
--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly
specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard
randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most
users should not need to set this flag.
(default: '')
So if you want to have a slider of 100 images, use:
tensorboard --samples_per_plugin images=100
The comment is out of date - it can actually be modified in tensorboard/backend/application.py, in the "Default Size Guidance". By default, it stores 1000 scalars. You can increase that limit arbitrarily, or set it to 0 to store every scalar.
You don't need to recompile TensorBoard, or even download it from source. You could just modify this file in your TensorBoard yourself.
If you install TensorFlow using pip in virtualenv (ubuntu, mac), then within your virtualenv directory the path to application.py should be something like lib/python2.7/site-packages/tensorflow/tensorboard/backend. If you modify that file, you should get the new setting in your tensorboard (when you run tensorboard in that virtualenv). If you're like me, you'll put a print statement too so you can be sure that you're running modified code :)

Word2Vec: Any way to train model fastly?

I use Gensim Word2Vec to train word sets in my database.
I have about 400,000 phrase(Each phrase is short. Total 700MB) in my PostgreSQL database.
This is how I train these data using Django ORM:
post_vector_list = []
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
post_vector_list.append(post_vector)
word2vec_model = gensim.models.Word2Vec(post_vector_list, window=10, min_count=2, size=300)
But this job getting a lot of time and feels like not efficient.
Especially, creating post_vector_list part took a lot of time and space..
I want to improve speed of training but have no idea how to do.
Want to get your advices. Thanks.
To optimize such code, you need to collect good information about where the time is spent.
Is most of the time spent preparing post_vector_list?
If so, you will want to make sure my_tokenizer (whose code is not shown) is as efficient as possible. You may want to try to minimize the number of extend()s and append()s that are done on large lists. You might have to even take a look at your DB's configuration or options to speed up the DB-to-Object mapping started inside Post.objects.all().
Is most of the time spent in the call to Word2Vec()?
If so, other steps may help:
ensure you're using gensim's Cython-optimized routines – if not, you should be seeing a logged warning (and training will be up to 100X slower)
consider using a workers=4 or workers=8 optional argument to use more threads, if your machine has at least 4 or 8 CPU cores
consider using a larger min_count, which speeds training somewhat (and since vectors for words where there are only a few examples typically aren't very good anyway, doesn't lose much and can even improve the quality of the surviving words)
consider using a smaller window, since training takes longer for larger windows
consider using a smaller vector_size (previously called size), since training takes longer for larger-size vectors
consider using a more-aggressive (smaller) value for the optional sample argument, which randomly skips more of the most-frequent words. The default is 1e-04, but values of 1e-05 or 1e-06 (especially on larger corpuses) can offer additional speedup, and even often improve the final vectors (by spending relatively less training time on words with an excess of usage examples)
consider using a lower-than-default (5) value for the optional epochs parameter (previously called iter). (I wouldn't recommend this unless the corpus is very large – so it already has many redundant, equally-good examples of the same words throughout.)
you could use a python generator instead of loading all the data into the list. Gensim works with python generators too. The code will look something like this
class Post_Vectors(object):
def __init__(self, Post):
self.Post = Post
def __iter__(self):
for post in Post.objects.all():
post_vector = my_tokenizer(post.category.name)
post_vector.extend(my_tokenizer(post.title))
post_vector.extend(my_tokenizer(post.contents))
yield post_vector
post_vectors = Post_Vectors(Post)
word2vec_model = gensim.models.Word2Vec(post_vectors, window=10, min_count=2, size=300, workers=??)
For the gensim speedup, if you have a multi-core CPU, you could use the workers parameter. (By default it is 3)