I am trying to deploy a Tomato Classification Project on Heroku but when I deploy it I got error because the size of tensorflow along with other files exceeds 500 MB. I also tried using tensorflow-cpu but it does not have keras.load_model() which I need to load model.
Is tere any solution to overcome this problem?
Try using tensorflow-2.1.0, it should work.
Related
I am trying to standardize our deployment workflow for machine vision systems. So we were thinking of the following workflow.
Deployment workflow
So, we want to create the prototype for the same, so we followed the workflow. So, there is no problem with GCP operation whatsoever but when we try to export models, which we train on the vertexAI it will give three models as mentioned in the workflow which is:
SaveModel
TFLite
TFJS
and we try these models to convert into the ONNX model but we failed due to different errors.
SaveModel - Always getting the same error with any parameter which is as follows
Error in savemodel
I tried to track the error and I identified that the model is not loading inside the TensorFlow only which is wired since it is exported from the GCP vertexAI which leverages the power of TensorFlow.
TFLite - Successfully converted but again the problem with the opset of ONNX but with 15 opset it gets successfully converted but then NVIDIA tensorRT ONNXparser doesn't recognize the model during ONNX to TRT conversion.
TFJS - yet not tried.
So we are blocked here due to these problems.
We can run these models exported directly from the vertexAI on the Jetson Nano device but the problem is TF-TRT and TensorFlow is not memory-optimized on the GPU so the system gets frozen after 3 to 4 hours of running.
We try this workflow with google teachable machine once and it workout well all steps are working perfectly fine so I am really confused How I conclude this full workflow since it's working on a teachable machine which is created by Google and not working on vertexAI model which is again developed by same Company.
Or am I doing Something wrong in this workflow?
For the background we are developing this workflow inside C++ framework for the realtime application in industrial environment.
I am trying to set up a deep learning VM on Google Cloud but I keep running into the same issue over and over again.
I will follow all the steps, set up a N1-highmem-8 (8 vCPU, 52gb Memory) instance, add a single T4 GPU and select the Deep Learning Image: TensorFlow 2.4 m69 CUDA 110 image. That's it.
After that, I will ssh into the vm, run the script that installs all the NVIDIA drivers and... when I begin using it, by simply running
from tensorflow.keras.layers import Input, Dense
i = Input((100,))
x = Dense(500)(i)
I keep getting failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error. By that point I haven't installed anything and haven't done anything custom, just the vanilla image from GCP.
What is more concerning is that, even if I delete the vm and then create a new one with the same config, some times the error won't happen immediately and sometimes it's present off the bat.
Has anyone encountered this? I've googled around to see if anyone has faced this issue and while I came across suggestions, all of them are old and have not worked for me. More over, suggestions on NVIDIA support forums tell me to re-install everything and the whole point of me using a pre-built GCP image specifically for deep learning is so that I don't have to enter the hell of installing and resolving issues with NVIDIA drivers.
The issue is fixed with the M74 image, but you are using M69. So follow one of the two fixes provided in the Google Cloud public forum.
we can mitigate the issue by:
Fix #1: Use the latest DLVM image (M74 or later) in a new VM instance: They have released a fix for the newest DLVM image in M74 so you will no longer be affected by this issue.
Fix #2: Patch your existing instance running images older than M74.
Run the following via an SSH session on the affected instance:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.
I am using Google Cloud (4 CPU,15 GB RAM) to host tensorflow serving (branch 0.5.1). The model is a pre-trained Resnet which I imported using Keras and converted to .pb format using SavedModelBuilder. I followed Tensorflow Serving installation and compilation steps as mentioned in the installation docs.Did a bazel build using :
bazel build tensorflow_serving/...
Doing inference on an image from my local machine using a python client, gave me results in approximately 23 secs.This I was able to fine tune a bit by following the advice here. Replaced the bazel build to the below command to use CPU optimization. This brought the response time down to 12 secs.
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma
--copt=-msse4.2 //tensorflow_serving/model_servers:tensorflow_model_server
Other stuff I tried which resulted in no difference to response times..
1. Increased 4 CPU to 8 CPU machine
2. Tried on a GPU Tesla K80 + 4 CPU machine
I haven't tried batch optimization , as I am currently just testing it out with a single inference request. The configuration doesnt user docker or Kubernettes.
Appreciate any pointers which can help in bringing down the inference times. Thanks !
Solved and closing this issue. Now am able to get a sub second prediction time. There were multiple problems.
One was the image upload/download times which was playing a role.
The second was when I was running using the GPU, tensorflow serving wasnt compiling using GPU Support. The GPU issue got resolved using two approaches outlined in these links - https://github.com/tensorflow/serving/issues/318 and https://github.com/tensorflow/tensorflow/issues/4841
Environment info
Operating System: El Capitan, 10.11.1
I'm doing this tutorial: https://petewarden.com/2016/09/27/tensorflow-for-mobile-poets/
Trying to classify images using tensorflow on iOS app.
When I try to build my net using bazel:
bazel build tensorflow/examples/label_image:label_image
I get these errors:
https://gist.github.com/galharth/36b8f6eeb12f847ab120b2642083a732
From the related github issue https://github.com/tensorflow/tensorflow/issues/6487 I think we narrowed it down to a lack of resources on the virtual machine. Bazel tends to get flakey with only 2GB of RAM allocated to it.
Recently, tensorflow had add the distribute training module, what's the distribute pre-requirement? I mean the environment like this,
tensorflow >= 0.8 kubernates shared file system, gcloud?
And it had release the example code:
Is there any way to run tensorflow cluster example, when only have hdfs and without any shared file system, where will model file store in?
Each computer will need to have tensorflow installed, (and in my experience, they should all be the same version. I had a few issues mixing versions 8 and 9).
Once that is set up, each computer will need access to the code it is to run (main.py for example). We use an NFS to share this, but you could just as easily git pull on each machine to get the latest copy of your code.
Then you just need to start them up. We would just ssh to each machine in our most basic setup, but if you have a cluster like kubernates, then it may be different for you.
As for checkpoints, I believe only the chief worker writes to checkpoint files if that's what your last question was asking.
Let me know if you have further questions.