Tensorflow: How to Optimize the Prediction on GCloud-ML?

Tensorflow: How to Optimize the Prediction on GCloud-ML? - tensorflow

I have a model published on GCloud-ML and it is working fine. I can do the Online predicitons and get the right results. My issue is the Performance. Each predict (inference) takes around 3.5 seconds and it is not good for my case. I'm using automatically scale and my bucket is US-Central. My images have around 100Kb and I'm in Brazil (in the GCloud Console I can see tha latency --> 1.5 seconds).
I already tried the optimazed_for_inference.py but it didn't work (I can't generate the saved_model from the optimazed graph. It is possible?).
I need to get the result at least in 2 seconds. My doubt is: Is it possible to do that?
or is it normal to get results in 3 / 4 seconds using gcloud-ml predict?
Thanks! Any ideas are well coming! If you need more information to help me, please add a comment!
Thanks again!

Related

Using the cloud service to trasform a picture using a neural algorithm?

Yesterday I tried to transform a picture in the artistic style using CNNs based on A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge using a recent Torch implemenation,as it is explained here :
https://github.com/mbartoli/neural-animation
it started the conversion correctly,the problem is that the process is very time consuming,after 1 hour of elaboration a simple picture was not fully transformed. And I have to trasform 1615 pictures. What's the solution here ? Can I use the Google Cloud Platform to make this operation faster ? Or some other kind of Cloud service ? Using my Home PC is not the right solution. If I can use the cloud power,how can I configure everything ? let me know,thanks.

Using (Google Cloud Platform) GCP here would seem to be a good use case. If we were to boil it down to what you have ... you have an application which is CPU intensive that takes a long time to run. Depending on the nature of the application, it may run faster for any single given instance by having more CPUs and/or more RAM. GCP allows YOU to choose the size of the machine on which your application runs. You can choose from VERY small to VERY large. The distinction is how much you are willing to pay. Remember, you only pay for what you use. If an application takes an hour to run on a machine with price X but takes 30 minutes on a different machine with price 2X then the cost will still only be X but you will have a result in 30 minutes rather than an hour. You would switch off the machine after the 30 minutes to prevent charging.
Since you also said that you have MANY images to process, this is where you can take advantage of horizontal scale. Instead of having just one machine where the application takes an hour on each machine and all results are serialized.... you can create an array of machines where each machine is processing one picture. So if you had 50 machines, at the end of one hour you would have 50 images processed instead of one.
As for how to get this all going ... I'm afraid that is a much bigger story and one where a read of the GCP documentation will help tremendously. I suggest you read and have a play and THEN if you have specific questions, the community can try and provide specific answers.

How to disable summary for Tensorflow Estimator?

I'm using Tensorflow-GPU 1.8 API on Windows 10. For many projects I use the tf.Estimator's, which really work great. It takes care of a bunch of steps including writting summaries for Tensorboard. But right now the 'events.out.tfevents' file getting way to big and I am running into "out of space" errors. For that reason I want to disable the summary writting or at least reduce the amount of summaries written.
Going along with that mission I found out about the RunConfig you can pass over at construction of tf.Estimator. Apparently the parameter 'save_summary_steps' (which by default is 200) controls the way summaries are wrtitten out. Unfortunately changing this parameter seems to have no effect at all. It won't disable (using None value) the summary or reducing (choosing higher values, e.g. 3000) the file size of 'events.out.tfevents'.
I hope you guys can help me out here. Any help is appreciated.
Cheers,
Tobs.

I've observed the following behavior. It doesn't make sense to me so I hope we get a better answer:
When the input_fn gets data from tf.data.TFRecordDataset then the number of steps between saving events is the minimum of save_summary_steps and (number of training examples divided by batch size). That means it does it a minimum of once per epoch.
When the input_fn gets data from tf.TextLineReader, it follows save_summary_steps as you'd expect and I can give it a large value for infrequent updates.

Reduce the decision time using video stream [affdex]

As you guys might know, some detections take time using Affectiva SDK (linux in my case). For example, the gender recognition might take around 2-3 secs to output a result, and this is correct and it's the expected behavior, just like affectiva mentions:
The ROC score of the classifier is 0.95 and the average length of time taken to reach a decision is 3.4 seconds
So, I was wondering if its possible to reduce this time somehow in the SDK. I understand that this could generate a lot of false positives, but I'm testing an scenario where faces disappear quite quickly. If this is not possible, I might have to change to photo analysis instead...
best!

At the moment there is no way to tune gender detection times with video feeds. The only way to reduce your detection time is to use the Photo Detector:
http://developer.affectiva.com/v3_2/cpp/analyze-photo/

Textsum - Incorrect decode results compared to ref file

This issue is seen when performing training against my own dataset which was converted to binary via data_convert_example.py. After a week of training I get decode results that don't make sense when comparing the decode and ref files.
If anyone has been successful and gotten results similar to what is posted in the Textsum readme using their own data, I would love to know what has worked for you...environment, tf build, number of articles.
I currently have not had luck with 0.11, but have gotten some results with 0.9 however the decode results are similar to those shown below which I have no idea where they are even coming from.
I currently am running Ubuntu 16.04, TF 0.9, CUDA 7.5 and CuDnn 4. I tried TF 0.11 but was dealing with other issues so I went back to 0.9. It does seem that the decode results are being generated from valid articles, but the reference file and decode file indicies have NO correlation.
If anyone can provide any help or direction, it would be greatly appreciated. Otherwise, should I figure anything out, I will post here.
A few final questions. Regarding the vocab file referenced. Does it at all need to be sorted by word frequency at all? I never performed anything along these lines when generating it and just wasn't sure if this would throw something off as well.
Finally, I made the assumption in generating the data that the training data articles should be broken down into smaller batches. I separated out the articles into multiple files of 100 articles each. These were then named data-0, data-1, etc. I assume this was a correct assumption on my part? I also kept all the vocab in one file which has not seemed to throw any errors.
Are the above assumptions correct as well?
Below are some ref and decode results which you can see are quite odd and seem to have no correlation.
DECODE:
output=Wild Boy Goes About How I Can't Be Really Go For Love
output=State Department defends the campaign of Iran
output=John Deere sails profit - Business Insider
output=to roll for the Perseid meteor shower
output=Man in New York City in Germany
REFERENCE:
output=Battle Chasers: Nightwar Combines Joe Mad's Stellar Art With Solid RPG Gameplay
output=Obama Meets a Goal That Could Literally Destroy America
output=WOW! 10 stunning photos of presidents daughter Zahra Buhari
output=Koko the gorilla jams out on bass with Flea from Red Hot Chili Peppers
output=Brenham police officer refused service at McDonald's

Going to answer this one myself. Seems the issue here was the lack of training data. In the end I did end up sorting my vocab file, however it seems this is not necessary. The reason this was done, was to allow the end user to limit the vocab words to something like 200k words should they wish.
The biggest reason for the problems above were simply the lack of data. When I ran the training in the original post, I was working with 40k+ articles. I thought this was enough but clearly it wasn't and this was even more evident when I got deeper into the code and gained a better understanding as to what was going on. In the end I increased the number of articles to over 1.3 million, I trained for about a week and a half on my 980GTX and got the average loss to about 1.6 to 2.2 I was seeing MUCH better results.
I am learning this as I go, but I stopped at the above average loss because some reading I performed stated that when you perform "eval" against your "test" data, your average loss should be close to what you are seeing in training. This helps to determine whether you are getting close to over-fitting when these are far apart. Again take this with a grain of salt, as I am learning but it seems to make sense logically to me.
One last note that I learned the hard way is this. Make sure you upgrade to the latest 0.11 Tensorflow version. I originally trained using 0.9 but when I went to figure out how to export the model for tensorflow, I found that there was no export.py file in that repo. When I upgrades to 0.11, I then found that the checkpoint file structure seems to have changed in 0.11 and I needed to take another 2 weeks to train. So I would recommend just upgrading as they have resolved a number of the problems I was seeing during the RC. I still did have to set the is_tuple=false but that aside, all has worked out well. Hope this helps someone.

Using tensorflow to identify lego bricks?

having read this article about a guy who uses tensorflow to sort cucumber into nine different classes I was wondering if this type of process could be applied to a large number of classes. My idea would be to use it to identify Lego parts.
At the moment, a site like Bricklink describes more than 40,000 different parts so it would be a bit different than the cucumber example but I am wondering if it sounds suitable. There is no easy way to get hundreds of pictures for each part but does the following process sound feasible :
take pictures of a part ;
try to identify the part using tensorflow ;
if it does not identify the correct part, take more pictures and feed the neural network with them ;
go on with the next part.
That way, each time we encounter a new piece we "teach" the network its reference so that it can better be recognized the next time. Like that and after hundreds of iterations monitored by a human, could we imagine tensorflow to be able to recognize the parts? At least the most common ones?
My question might sound stupid but I am not into neural networks so any advice is welcome. At the moment I have not found any way to identify a lego part based on pictures and this "cucumber example" sounds promising so I am looking for some feedback.
Thanks.

You can read about the work of Jacques Mattheij, he actually uses a customized version of Xception1 running on https://keras.io/.
The introduction is Sorting 2 Metric Tons of Lego.
In Sorting 2 Tons of Lego, The software Side you can read:
The hard challenge to deal with next was to get a training set large
enough to make working with 1000+ classes possible. At first this
seemed like an insurmountable problem. I could not figure out how to
make enough images and to label them by hand in acceptable time, even
the most optimistic calculations had me working for 6 months or longer
full-time in order to make a data set that would allow the machine to
work with many classes of parts rather than just a couple.
In the end the solution was staring me in the face for at least a week
before I finally clued in: it doesn’t matter. All that matters is that
the machine labels its own images most of the time and then all I need
to do is correct its mistakes. As it gets better there will be fewer
mistakes. This very rapidly expanded the number of training images.
The first day I managed to hand-label about 500 parts. The next day
the machine added 2000 more, with about half of those labeled wrong.
The resulting 2500 parts where the basis for the next round of
training 3 days later, which resulted in 4000 more parts, 90% of which
were labeled right! So I only had to correct some 400 parts, rinse,
repeat… So, by the end of two weeks there was a dataset of 20K images,
all labeled correctly.
This is far from enough, some classes are severely under-represented
so I need to increase the number of images for those, perhaps I’ll
just run a single batch consisting of nothing but those parts through
the machine. No need for corrections, they’ll all be labeled
identically.
A recent update is Sorting 2 Tons of Lego, Many Questions, Results.
1CHOLLET, François. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint arXiv:1610.02357, 2016.

I have started this using IBM Watson's Visual Recognition.
I had six different bricks to be recognized on the transport belt background.
I am actually thinking about tensorflow, since I can have it running locally.
The codelab : TensorFlow for Poets, describes almost exactly what you want to achieve,
For a demo of the Watson version:
https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/Lego_bricks_recognition_with_Watosn_lego_and_raspberry_pi?lang=en

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas