Distributed traning fail on two machine : InvalidArgumentError - tensorflow
I have two machine, each of which has 4 GPUs. I use
with tf.device('/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)):
to dictate device, but failed with these error log:
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'init_all_tables': Could not satisfy explicit device specification '/job:worker/replica:1/task:4/device:GPU:0' because no devices matching that specification are registered in this process; available devices: /job:ps/replica:0/task:0/cpu:0,
/job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/gpu:0, /job:worker/replica:0/task:0/gpu:1, /job:worker/replica:0/task:0/gpu:2, /job:worker/replica:0/task:0/gpu:3, /job:worker/replica:0/task:1/cpu:0, /job:worker/replica:0/task:1/gpu:0, /job:worker/replica:0/task:1/gpu:1, /job:worker/replica:0/task:1/gpu:2, /job:worker/replica:0/task:1/gpu:3, /job:worker/replica:0/task:2/cpu:0, /job:worker/replica:0/task:2/gpu:0, /job:worker/replica:0/task:2/gpu:1, /job:worker/replica:0/task:2/gpu:2, /job:worker/replica:0/task:2/gpu:3, /job:worker/replica:0/task:4/cpu:0, /job:worker/replica:0/task:4/gpu:0, /job:worker/replica:0/task:4/gpu:1, /job:worker/replica:0/task:4/gpu:2, /job:worker/replica:0/task:4/gpu:3, /job:worker/replica:0/task:5/cpu:0, /job:worker/replica:0/task:5/gpu:0, /job:worker/replica:0/task:5/gpu:1, /job:worker/replica:0/task:5/gpu:2, /job:worker/replica:0/task:5/gpu:3, /job:worker/replica:0/task:6/cpu:0, /job:worker/replica:0/task:6/gpu:0, /job:worker/replica:0/task:6/gpu:1, /job:worker/replica:0/task:6/gpu:2, /job:worker/replica:0/task:6/gpu:3, /job:worker/replica:0/task:7/cpu:0, /job:worker/replica:0/task:7/gpu:0, /job:worker/replica:0/task:7/gpu:1, /job:worker/replica:0/task:7/gpu:2, /job:worker/replica:0/task:7/gpu:3
it seems like tensorflow can't find machine B ? but I have totally same hardware and software configuration on both machine.
the start script:
# machine 10.10.12.28
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=0 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=1 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=2 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=0 \
--task_id=3 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
CUDA_VISIBLE_DEVICES='' ~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
-task_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
# machine 10.10.12.29
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=4 \
--gpu_device_id=0 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=5 \
--gpu_device_id=1 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=6 \
--gpu_device_id=2 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
~/models/inception/bazel-bin/inception/imagenet_distributed_train \
--batch_size=32 \
--data_dir=/data1/imagenet1k \
--job_name='worker' \
--replica_id=1 \
--task_id=7 \
--gpu_device_id=3 \
--ps_hosts='10.10.102.28:2220' \
--worker_hosts='10.10.102.28:2221,10.10.102.28:2222,10.10.102.28:2223,10.10.102.29:2224,10.10.102.29:2221,10.10.102.29:2222,10.10.102.29:2223,10.10.102.29:2224' &
TL;DR: Don't ever use '/replica:%d' in your device specification.
The problem seems to be in your device string:
'/job:worker/replica:%d/task:%d/gpu:%d' % (FLAGS.replica_id, FLAGS.task_id, FLAGS.gpu_device_id)
The device specification '/replica:%d' is not supported in the open-source version of TensorFlow (but it is retained for some backwards compatibility reasons). The replica ID should be 0 for all tasks. You can solve this immediately by passing 0 as the --replica_id for each task, but you should really remove that flag from your version of the code.
Related
Image Classification Graph model making wrong predictions
I'm using the make_image_classifier python script to retrain a mobilenetv2 on a new set of images. My end goal is to make predictions in tfjs in the browser. This is exactly what i'm doing: Step 1: Retrain the model make_image_classifier \ --image_dir input_data \ --tfhub_module https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4 \ --image_size 224 \ --saved_model_dir ./trained_model \ --labels_output_file class_labels.txt \ --tflite_output_file new_mobile_model.tflite Step 2: Convert the tf saved model to a graph model using tensorflowjs_converter tensorflowjs_converter \ --input_format=tf_saved_model \ --output_format=tfjs_graph_model \ --signature_name=serving_default \ --saved_model_tags=serve \ trained_model/ \ web_model/ Step 3: load the new model in the browser, preprocess an image input and ask the model to make a prediction const model = tf.loadGraphModel('model.json').then(function(m){ var img = document.getElementById("img"); var processed=preprocessImage(img, "mobilenet") window.prediction=m.predict(processed) window.prediction.print(); }) }) function preprocessImage(image,modelName){ let tensor=tf.browser.fromPixels(image) .resizeNearestNeighbor([224,224]) .toFloat(); console.log('tensor pro', tensor); if(modelName==undefined) { return tensor.expandDims(); } if(modelName=="mobilenet") { let offset=tf.scalar(127.5); console.log('offset',offset); return tensor.sub(offset) .div(offset) .expandDims(); } else { throw new Error("Unknown Model error"); } } I'm getting invalid results. I checked the predictions made by the initial model and they are correct so what I'm thinking is either the conversion is not happening properly or I'm not preprocessing the image in the same manner that the initial script is. Help. P.S: When running the converter, I'm getting the following message. Not sure if its directly relevant to what I'm experiencing. tensorflow/core/graph/graph_constructor.cc:750 Node 'StatefulPartitionedCall' has 71 outputs but the _output_shapes attribute specifies shapes for 605 outputs. Output shapes may be inaccurate.
make_image_classifier creates a saved_model specified to tensorflow lite. If you rather want to convert mobilenet to tensorflow.js, the command to used has been given in this answer. Instead of using make_image_classifier, you would need to use retrain.py which can be downloded by the following curl -LO https://github.com/tensorflow/hub/raw/master/examples/image_retraining/retrain.py
An error occurred while training deeplabv3++ using CityScapes dataset:“data split name train not recognized”
I followed the following steps: 1.CityScapes data set preparation 2.Generate TFRecords of CityScapes 3.Download the pre-training model 4.Run official instruction python deeplab/train.py \ --logtostderr \ --training_number_of_steps=1000 \ --train_split="train" \ --model_variant="xception_65" \ --atrous_rates=6 \ --atrous_rates=12 \ --atrous_rates=18 \ --output_stride=16 \ --decoder_output_stride=4 \ --train_crop_size=513 \ --train_crop_size=513 \ --train_batch_size=1 \ --dataset="cityscapes" \ --tf_initial_checkpoint='/root/newP/official_tf/models-master/research/deeplab/backbone/deeplabv3_cityscapes_train/model.ckpt' \ --train_logdir='/root/newP/official_tf/models-master/research/deeplab/exp/train_on_train_set/train' \ --dataset_dir='/root/dataset/cityscapesScripts/tfrecord' An error occurred while training deeplabv3++ using CityScapes dataset “data split name train not recognized”. I found the problem after debugging: "train" no longer exit in "_CITYSCAPES_INFORMATION.splits_to_sizes". Content in the code: _CITYSCAPES_INFORMATION = DatasetDescriptor( splits_to_sizes={'train_fine': 2975, 'train_coarse': 22973, 'trainval_fine': 3475, 'trainval_coarse': 23473, 'val_fine': 500, 'test_fine': 1525}, num_classes=19, ignore_label=255, ) I tried several others "train_fine","train_coarse".A new error occurred: "Total size of new array must be unchanged for image_pooling/weights lh_shape: [(1, 1, 2048, 256)], rh_shape: [(1, 1, 320, 256)]". May I ask what modifications I should do?
I found that the latest version of the pretraining model had a problem, and I could run it directly when I was not using the pretraining model. [https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md][1]
Adding sqlite Qt5 plugin in Yocto
Following this answer, I'm trying to add the sqlite (sqlite3) Qt5 plugin I forgot to enable during the last Yocto build. Here what I did: Under my own custom layer (meta-custom-layer/recipes-core) I added a file qtbase_%.bbappend. Inside I put: PACKAGECONFIG_append = " sql-sqlite" PACKAGECONFIG[sql-sqlite] = "-sql-sqlite,-no-sql-sqlite,sqlite3" Then I deleted the tmp folder and issued bitbake qtbase. I didn't removed the sstate-cache because I added something rather removed or changed. After parsing the recipes it successfully rebuilt the tmp folder but I cannot find anything related to the requested plugin (it should be libqsqlite.so). I didn't understand the answer provided in the link above? What is the right method to add this plugin? UPDATE To be usre there's nothing else to tune, here the contents of the image bb file: SUMMARY = "blabla" LICENSE = "Proprietary" include recipes-st/images/st-image.inc inherit core-image distro_features_check CONFLICT_DISTRO_FEATURES = "x11 wayland" IMAGE_LINGUAS = "en-us" IMAGE_FEATURES += "splash package-management ssh-server-dropbear" IMAGE_ROOTFS_MAXSIZE = "" IMAGE_QT_MANDATORY_PART = " \ qtbase \ qtbase-plugins \ qtbase-tools \ " IMAGE_QT_OPTIONAL_PART = " \ qtserialport \ " CORE_IMAGE_EXTRA_INSTALL += " \ systemd-networkd-configuration \ \ packagegroup-framework-tools-core-base \ packagegroup-framework-tools-kernel-base \ packagegroup-framework-tools-network-base \ packagegroup-framework-tools-python2-base \ packagegroup-framework-tools-python3-base \ \ packagegroup-framework-tools-core \ packagegroup-framework-tools-kernel \ packagegroup-framework-tools-network \ packagegroup-framework-tools-python2 \ packagegroup-framework-tools-python3 \ \ packagegroup-core-eclipse-debug \ \ ${IMAGE_QT_MANDATORY_PART} \ ${IMAGE_QT_OPTIONAL_PART} \ " and here the contents of the RDEPENDS_${PN} var in layers/meta-qt5/recipes-qt/packagegroups/packagegroup-qt5-toolchain-target.bb: RDEPENDS_${PN} += " \ packagegroup-core-standalone-sdk-target \ libsqlite3-dev \ qtbase-dev \ qtbase-mkspecs \ qtbase-plugins \ qtbase-staticdev \ qtconnectivity-dev \ qtconnectivity-mkspecs \ qtmqtt-dev \ qtmqtt-mkspecs \ qtserialport-dev \ qtserialport-mkspecs \ qtserialbus-dev \ qtserialbus-mkspecs \ qtsystems-dev \ qtsystems-mkspecs \ qttools-dev \ qttools-mkspecs \ qttools-staticdev \ qtwebsockets-dev \ qtwebsockets-mkspecs \ qtwebchannel-dev \ qtwebchannel-mkspecs \ "
The PACKAGECONFIG is already there: PACKAGECONFIG[sql-sqlite] = "-sql-sqlite -system-sqlite,-no-sql-sqlite,sqlite3" Your problem is most likely due to you redefining in (wrongfully as you can see). You do you have to define new PACKAGECONFIG. Just enable it with: PACKAGECONFIG_append = " sql-sqlite"
Converts my own image data to TFRecords
now I am practicing to convert my own image data to TFRrcords for tensorflow.I am really new with tensorflow so I just modified the build_image_data.py which I got from Github. This is some parts of the original code: bazel-bin/inception/build_image_data \ --train_directory="${TRAIN_DIR}" \ --validation_directory="${VALIDATION_DIR}" \ --output_directory="${OUTPUT_DIRECTORY}" \ --labels_file="${LABELS_FILE}" \ --train_shards=128 \ --validation_shards=24 \ --num_threads=8 And I replace them with : # convert the data. bazel-bin/inception/build_image_data \ --train_directory=("C:/Dataset/Training data") --validation_directory=("C:/Dataset/Test data") --output_directory=("C:/Dataset/Trf") --labels_file="C:/Dataset/Labels file" --train_shards=128 --validation_shards=24 --num_threads=8 But I got an error as follows: File "<ipython-input-12-4e5ff554c85f>", line 90 bazel-bin/inception/build_image_data --train_directory=("C:/Dataset/Training data") ^ SyntaxError: can't assign to operator Someone could help me, please. Thanks.
Just remove parentheses around path: bazel-bin/inception/build_image_data \ --train_directory="C:/Dataset/Training data" --validation_directory="C:/Dataset/Test data" --output_directory="C:/Dataset/Trf" --labels_file="C:/Dataset/Labels file" --train_shards=128 --validation_shards=24 --num_threads=8
How to send audio API HIKVISION for two-way-audio?
I need help to understand why when sending audio to the camera you hear ugly, very fast. The camera is configured audio codec G711Ulaw The process I am doing is the following: I download a wav audio and converted to the codec that the camera is configured, these are all evidence conversions. ffmpeg -i padrino.wav -acodec pcm_mulaw -ar 8000 -ac 1 -b:a 32k output.wav ffmpeg -i padrino.wav -acodec pcm_mulaw -ar 8000 -ac 2 -b:a 32000 output.wav ffmpeg -i padrino.wav -f mulaw -acodec pcm_mulaw -ac 1 output.wav ffmpeg -i padrino.wav -ar 8000 -ac 1 -ab 64k -f mulaw output.ulaw Turned on the two-way-audio, within the "data.xml" is the xml that enables two-way-audio: curl -H "application/xml" -X PUT -d #data.xml USER:PASS#IPCAM/ISAPI/System/...hannels/1/open I send through a curl curl -H "application/binary" -X PUT -d #output.ulaw USER:PASS#IPCAM/ISAPI/System/...ls/1/audioData or curl -H "application/binary" -X PUT -d #output.wav USER:PASS#IPCAM/ISAPI/System/...ls/1/audioData This is heard in camera but as I explained at the beginning is heard wrong, I distorted, very fast. What am I doing wrong? regards
I have found out why this is - its nothing to do with the encoding. I have written a C# app to test this and if you send the data at the rate expected (8000 samples per second) then it plays correctly. I send the audio data in packets (160 bytes currently but experimenting with optimum values but does not seem to matter much as long as the delay is correct) and delay for the appropriate amount of time before sending again so that the correct amount of samples are sent in a second.
I found this interesting a project on github that helped me create this simple app that can send the audio to the camera using python: import urllib.request import requests import socket import time class SocketGrabber: """ A horrible hack, so as to allow us to recover the socket we still need from urllib """ def __init__(self): self.sock = None def __enter__(self): self._temp = socket.socket.close socket.socket.close = lambda sock: self._close(sock) return self def __exit__(self, type, value, tb): socket.socket.close = self._temp if tb is not None: self.sock = None def _close(self, sock): if sock._closed: return if self.sock == sock: return if self.sock is not None: self._temp(self.sock) self.sock = sock audio_file = "output.ulaw" ip = "IPCAM" username = "USER" password = "PASS" index = 1 base = f"http://{ip}" chunksize = 128 sleep_time = 1.0 / 64 base_url = f"http://{username}:{password}#{ip}" req = requests.put( f"{base_url}/ISAPI/System/TwoWayAudio/channels/{index}/open") mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm() mgr.add_password(None, [base], username, password) auth = urllib.request.HTTPDigestAuthHandler(mgr) opener = urllib.request.build_opener(auth) audiopath = f"{base}/ISAPI/System/TwoWayAudio/channels/{index}/audioData" with SocketGrabber() as sockgrab: req = urllib.request.Request(audiopath, method='PUT') resp = opener.open(req) output = sockgrab.sock def frames_yield(ulaw_data, chunksize=128): for i in range(0, len(ulaw_data), chunksize): for x in [ulaw_data[i:i + chunksize]]: tosend = x + (b'\xff' * (chunksize - len(x))) time.sleep(sleep_time) yield tosend with open(audio_file, 'rb') as file_obj: ulaw_data = file_obj.read() for dataframe in frames_yield(ulaw_data, chunksize): output.send(dataframe)