How to increase object tracking speed (Yolov4 DeepSort) - tensorflow

I recently started to get interested in computer vision technology and went through a couple of tutorials to solve a few problems in my business. There are 4 buildings in the area and the problem is to control the number of people entering and exiting them. And also it is necessary to take into account the movement of service personnel.
I tried using the following repository to solve these problems:
https://github.com/theAIGuysCode/yolov4-deepsort
And it seems to me that this will solve my problem. But now there is a question of processing speed of video recordings from CCTV cameras. I tried to run a ten second video fragment and the script completed in 211 seconds. Which in my opinion is very long.
What can I do to improve the processing speed?
Tell me where to look for the answer.
Error when trying to install openvino
Building wheels for collected packages: tokenizers
Building wheel for tokenizers (pyproject.toml) ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3.6 /home/baurzhan/.local/lib/python3.6/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmpr56p_xyt
cwd: /tmp/pip-install-pfzy6bre/tokenizers_dce0e65cae1e4e7c9325570d12cd6d63
Complete output (51 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/tokenizers
copying py_src/tokenizers/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers
creating build/lib.linux-x86_64-3.6/tokenizers/models
copying py_src/tokenizers/models/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/models
creating build/lib.linux-x86_64-3.6/tokenizers/decoders
copying py_src/tokenizers/decoders/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/decoders
creating build/lib.linux-x86_64-3.6/tokenizers/normalizers
copying py_src/tokenizers/normalizers/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/normalizers
creating build/lib.linux-x86_64-3.6/tokenizers/pre_tokenizers
copying py_src/tokenizers/pre_tokenizers/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/pre_tokenizers
creating build/lib.linux-x86_64-3.6/tokenizers/processors
copying py_src/tokenizers/processors/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/processors
creating build/lib.linux-x86_64-3.6/tokenizers/trainers
copying py_src/tokenizers/trainers/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/trainers
creating build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/sentencepiece_unigram.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/base_tokenizer.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/bert_wordpiece.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/sentencepiece_bpe.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/byte_level_bpe.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
copying py_src/tokenizers/implementations/char_level_bpe.py -> build/lib.linux-x86_64-3.6/tokenizers/implementations
creating build/lib.linux-x86_64-3.6/tokenizers/tools
copying py_src/tokenizers/tools/visualizer.py -> build/lib.linux-x86_64-3.6/tokenizers/tools
copying py_src/tokenizers/tools/__init__.py -> build/lib.linux-x86_64-3.6/tokenizers/tools
copying py_src/tokenizers/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers
copying py_src/tokenizers/models/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/models
copying py_src/tokenizers/decoders/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/decoders
copying py_src/tokenizers/normalizers/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/normalizers
copying py_src/tokenizers/pre_tokenizers/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/pre_tokenizers
copying py_src/tokenizers/processors/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/processors
copying py_src/tokenizers/trainers/__init__.pyi -> build/lib.linux-x86_64-3.6/tokenizers/trainers
copying py_src/tokenizers/tools/visualizer-styles.css -> build/lib.linux-x86_64-3.6/tokenizers/tools
running build_ext
running build_rust
error: can't find Rust compiler
If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
To update pip, run:
pip install --upgrade pip
and then retry package installation.
If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
----------------------------------------
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

Have you had a chance to check the spent time in sections of your code in order to find out where the bottleneck is?
There are ways to improve grabbing&capturing frames from the camera, ways to improve decoding a compressed frame (e.g. from h.264 to raw-format) (and to prevent from copying the pixel-data from GPU-video-codec to the inference-enginve via GPU-Zero-Copy-mechanisms).
Then the inference itself - there are many different pre-trained object-/person-/pedestrian-detection models. Models can be optimized (e.g. via OpenVINO's tools) to run great on a specific underlying accelerator (like CPU, GPU, VPU, FPGA), like using INT8-optimized CPU-instruction-sets.
Models can be compressed, checked for sparsity.
Then the post-processing - like filtering the many detected objects (filtering confidence, filtering for overlapping via NMS) and then tracking.

DISCLAIMER, I am the creator and maintainer of https://github.com/mikel-brostrom/Yolov5_StrongSORT_OSNet.
I see no export capabilities in that repository to any CPU friendly framework such as: onnx or openvino. Achieving fast inferences on CPU is conditioned by having models that run fast on CPU.
You can easily achieve 10FPS on CPU using my repo. Some random results run on CPU (Intel® Core™ i7-8850H CPU # 2.60GHz × 12) with openvino models:
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.079s), StrongSORT:(0.025s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.080s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.075s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.078s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.080s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.112s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.083s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.078s), StrongSORT:(0.022s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.078s), StrongSORT:(0.024s)
0: 640x640 2 persons, 1 chair, 1 refrigerator, Done. YOLO:(0.085s), StrongSORT:(0.023s)

Related

the alternative for NCCL on window 10

So I am on windows 10 and am using multiple GPUs now in order to run the training of some machine learning model and this model is about GAN algorithm you can check the full code over here :
Here, I get to the point where there is need to reduce the sum from different GPU devices as following:
if len(devices) > 1:
with tf.name_scope('SumAcrossGPUs'), tf.device(None):
for var_idx, grad_shape in enumerate(self._grad_shapes):
g = [dev_grads[dev][var_idx][0] for dev in devices]
if np.prod(grad_shape): # nccl does not support zero-sized tensors
g = tf.contrib.nccl.all_sum(g)
for dev, gg in zip(devices, g):
dev_grads[dev][var_idx] = (gg, dev_grads[dev][var_idx][1])
Now in this part I get an error regarding NCCL, which I noticed that is not supported on windows it needs linux, therefore I am stuck here...what is the "work around solution" here??..how can I manage to use NCCL on windows or an alternative to the code above..is there any simple way to do that?...thanks in advance.
Note: I have checked out some stackoverflow issues already. However, no answer exist which can solve my problem.

How to run TensorFlow on multiple nodes with several CPUs each

I want to run linear regression with TensorFlow on very large datasets. I have a cluster with 9 nodes and 36 CPUs each. What is the best way to distribute the computations across all the resources available?
According to this course https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed for the parallelisation. I then tried to run my script my_code.py (on a "small" dataset with 120 million data points and 2 feature columns to test the code) on nodes 2 and 3 as follows:
python my_code.py \
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"
where i is the number of the node (either 2 or 3); whereas on node 1 I do the same but with --job_name=ps and --task_index=0. However this way it seems that only one CPU per node is used. Do I need to specify each CPU individually?
Thank you in advance.
As far as I understand, the best thing to do is to use all the CPUs on the same node together as a single worker, in order to make the most of the shared memory. So for example in the case above, one would have to specify manually only 9 workers and make sure that each of them corresponds to one node where all the 36 CPUs are used. The commands to do this depend on the specific cluster used.

TensorBoard doesn't show all data points

I was running a very long training (reinforcement learning with 20M steps) and writing summary every 10k steps. In between step 4M and 6M, I saw 2 peaks in my TensorBoard scalar chart for game score, then I let it run and went to sleep. In the morning, it was running at about step 12M, but the peaks between step 4M and 6M that I saw earlier disappeared from the chart. I tried to zoom in and found out that TensorBoard (randomly?) skipped some of the data points. I also tried to export the data but some data point including the peaks are also missing in the exported .csv.
I looked for answers and found this in TensorFlow github page:
TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag in tensorboard/backend/server.py.
Has anyone ever modified this server.py file? Where can I find the file and if I installed TensorFlow from source, do I have to recompile it after I modified the file?
You don't have to change the source code for this, there is a flag called --samples_per_plugin.
Quoting from the help command
--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly
specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard
randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most
users should not need to set this flag.
(default: '')
So if you want to have a slider of 100 images, use:
tensorboard --samples_per_plugin images=100
The comment is out of date - it can actually be modified in tensorboard/backend/application.py, in the "Default Size Guidance". By default, it stores 1000 scalars. You can increase that limit arbitrarily, or set it to 0 to store every scalar.
You don't need to recompile TensorBoard, or even download it from source. You could just modify this file in your TensorBoard yourself.
If you install TensorFlow using pip in virtualenv (ubuntu, mac), then within your virtualenv directory the path to application.py should be something like lib/python2.7/site-packages/tensorflow/tensorboard/backend. If you modify that file, you should get the new setting in your tensorboard (when you run tensorboard in that virtualenv). If you're like me, you'll put a print statement too so you can be sure that you're running modified code :)

Graph dependencies in tensorflow: how to validate that dependencies exist or not?

op1=tf.image.random_brightness(placeholder_img3d_float32, max_delta=...)
op2=tf.image.random_contrast(placeholder_img3d_float32, lower=..., upper=...)
op3=tf.image.per_image_standardization(placeholder_img3d_float32)
If I defined these 3 ops, and then I run:
sess.run(op1, ...)
sess.run(op2, ...)
sess.run(op3, ...)
vs. running: sess.run([op1, op2, op3], ...)
Would I have executed all 3 ops 3 times? Or are they all independent, thus the 3 runs each ran just the op I requested?
How should I validate graph dependency questions like this?
Update:
The tensorboard graph of those 3 ops looks like there are no dependencies between them, but the local_placeholder shown in the top right has 5 outputs, at least one that feeds each of the 3 ops here. Does that mean that when I feed the placeholder it will run the 3 ops, or are the lack of dependencies shown in the graph telling me that although the placeholder is common, there are no dependencies and only the op call with be processed?
In a session you can give the command to run all 3 operations same time. But inside of the tensorflow will automatically looks for dependencies.
Let's say your 3rd operation depends on 2nd operation and 2nd operations depends on 1st operation and you need to run 3rd operation first, then session object will try to run the first operation first and try to fill dependencies and then come to other steps.
In the tensorflow graph you can observe the dependencies nicely. Each gray line will show you the data flow between two operations. And dotted line will show the dependencies for each variables.

tensorflow distributed training w/ estimator + experiment framework

Hi I have a wield situation when trying to use estimator + experiment class for distributed training.
Here's an example: https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
This is a simple case that uses
DNNClassifier from TF official tutorial
Experiment framework
1 worker and 1 ps on the same host with different ports.
What happens is
1) when I start ps job, it looks good:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2) when I start worker job, the job silently exits, leaving no log at all.
Eagerly seeking help.
I have same problem and I finally get the solution.
The problem is in config._environment
config = {"cluster": {'ps': ['127.0.0.1:9000'],
'worker': ['127.0.0.1:9001']}}
if args.type == "worker":
config["task"] = {'type': 'worker', 'index': 0}
else:
config["task"] = {'type': 'ps', 'index': 0}
os.environ['TF_CONFIG'] = json.dumps(config)
config = run_config.RunConfig()
config._environment = run_config.Environment.CLOUD
Set config._environment as Environment.CLOUD.
Then you can have distributed training system.
I hope it makes you happy :)
I have the same issue, it's due to some internal tensorflow code I guess, I've opened a question on SO already for this: TensorFlow: minimalist program fails on distributed mode.
I also opened a pull request: https://github.com/tensorflow/tensorflow/issues/8796.
There are two options to solve your issue. As this is due to your ClusterSpec having implicit local environment, you could try set another one (either google or cloud), but I cannot assure you that the rest of your work won't be impacted. So I prefered to have a glance at the code and try fix it myself for local mode, which is why I explain bellow.
You'll see explanations of why it fails in those posts more precisely, the fact is Google has been pretty silent so far so what I did is that I patched their source code (in tensorflow/contrib/learn/python/learn/experiment.py):
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
(this part prevents server from starting in local mode, which is yours if you set none in your cluster spec, so you should simply comment config.environment != run_config.Environment.LOCAL and and that should work).