How do I perform sequence labeling in caffe using LSTM - sequence

I have looked at the LRCN example (http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-sequences.pdf) which use LSTM for classification. For video classification a majority voting is done. Why is that? I would assume one waits until the end of a sequence?
In my toy, example binary counting, I inputted the labels in two different ways.
First I labelled every timestep with the sequence label. Secondly I labelled every timestep with an ignore_label but the last one. For simplicity I used a sequence length of 50 and a batchsize of 50 as well.
Both approaches lead to a network, where when I deploy it, I receive the same output for every timestep.
Edit:
The toy example works, if instead of classifying a whole sequence, one predicts the next number. Thus for each number a label exists. This is no solution for a real-world sequence classification task. Using the post by Kaparthy (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) I have created following network:
name: "BasicLstm"
layer {
name: "data"
type: "HDF5Data"
top: "data"
top: "cont"
top: "label"
include {
phase: TRAIN
}
hdf5_data_param {
source: "./path_to_txt.txt"
batch_size: 2000
}
}
layer {
name: "lstm1"
type: "LSTM"
bottom: "data"
bottom: "cont"
top: "lstm1"
recurrent_param {
num_output: 5
weight_filler {
type: "uniform"
min: -0.08
max: 0.08
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "lstm2"
type: "LSTM"
bottom: "lstm1"
bottom: "cont"
top: "lstm2"
recurrent_param {
num_output: 4
weight_filler {
type: "uniform"
min: -0.08
max: 0.08
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "predict"
type: "InnerProduct"
bottom: "lstm2"
top: "predict"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 39
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
axis: 2
}
}
layer {
name: "softmax_loss"
type: "SoftmaxWithLoss"
bottom: "predict"
bottom: "label"
top: "loss"
loss_weight: 20
softmax_param {
axis: 2
}
loss_param {
ignore_label: -1
}
}
The important parts of the solver: I already have played a little bit with lr_policy: INV but in the end I tried it with fixed
net: "Basic.prototxt"
test_initialization: false
base_lr: 0.001
momentum: 0.9
lr_policy: "fixed"
display: 50
max_iter: 1000000
solver_mode: GPU
No sequence ranges over 2000.
I have put 10 sequences side by side.
I embedded my data in a one hot vector of size 132.
My data HDF5 file has these dimensions: XX*10*132*1
My data has a label at the end of each sequence. Every other label is -1 and will be ignored during backpropagation.
To increase efficiency I packed multiple short sequences together (they are below 2000 timesteps).
For classification I have used the python interface. When I classify a sequence following arises:
net.blobs['data'].reshape(726, 1, 132, 1)
net.blobs['cont'].reshape(726, 1)
net.blobs['data'].data[...] = data
net.blobs['cont'].data[...] = cont
output = net.forward()
output_prob = output['prob']
for i in range(726):
plt.plot(output_prob[i][0])
In the image one can see that for every timestep the same probabilities have been calculated.

Related

First tensorflow object detection model - from image to .pb (finally to myriad X blob for Oak-D) - rookie questions

This is my first model, i´m new to python and this is my second post on Stackoverflow so please let me know if there is anything i should elaborate and keep in mind there could be a easy solution to my problems.
EDIT - I have found a way to test the model on images after training (when its a pg file) and this gives a great result so i´m right now focusing the question on the last prat, the conversion to ir and blob EDIT
Problem to solve with model;
Easiest way to explain it is an example. Imagine a parking lot with 8 parking spots pared two by two. simple sketch over parking lot
the idea is to know what parking spots that are occupied or not. as the camera is located "from the side" i cant really mark the parking spots as if for example spot 6 is occupied the ground that is spot 5 is not visible.
my plan is that when a car is coming down the road the car is labled "car" and when the car parks the car is re-labeled as "car_6" (if the car parks on parkingspot 6).
do i need to use some sort of hierarchy to make the change from "car" to "car_6" without problems?
(The parking-spot is an example to explain the problem so there will be no problems with permissions to record videos ). The objects that will be detected in the model are about the same size of a car but round.
After some advice i have heard that ssd_mobilenet_v2 is a good model for the problem. As speed is not really that important i think all models would be able to get the job done but i have worked a bit with the mobilenet not and learned a bit so i would like to stay with this model.
I'm using Tensorflow 2 in windows 10.
What i have done so far;
Labeled (with labelimg) around 250 pictures (.XML) and made them to csv and then to .record.
the pictures are 320*320 in jpg-format
The label map is a file created as a txt and then saved and turned to pbtxt (just by adding the pb in the file name in windows).
(filepath) is used to make the links a bit shorter, i have a very excessive directory system for this.
This is my training settings:
D:/(filepath)/object detection> python model_main_tf2.py --pipeline_config_path D:\(filepath)\pre-trained-models\ssd_mobilenet_v2_320x320_coco17_tpu-8\pipeline.config --model_dir D:\(filepath)\models\model_out
with the pipeline config:
model {
ssd {
num_classes: 9
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
}
}
feature_extractor {
type: "ssd_mobilenet_v2_keras"
depth_multiplier: 1.0
min_depth: 16
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 0.005
}
}
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.029
}
}
activation: RELU_6
batch_norm {
decay: 0.9700000286102295
center: true
scale: true
epsilon: 0.0010000000474974513
train: true
}
}
override_base_feature_extractor_hyperparams: true
}
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.03
}
}
activation: RELU_6
batch_norm {
decay: 0.9
center: true
scale: true
epsilon: 1.0
train: true
}
}
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 1
box_code_size: 4
apply_sigmoid_to_scores: false
class_prediction_bias_init: 0.2
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 0.5
iou_threshold: 0.6000000238418579
max_detections_per_class: 100
max_total_detections: 100
use_static_shapes: false
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
delta: 1.0
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 2.0
alpha: 0.75
}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
}
}
train_config {
batch_size: 4
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_rgb_to_gray {
}
}
sync_replicas: true
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.008
total_steps: 50000
warmup_learning_rate: 0.002
warmup_steps: 1000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
fine_tune_checkpoint: "D:/Visual_Studio/TensorflowTake3/models/research/object_detection/downloaded_ssd_mobilenet_v2_320x320/checkpoint/ckpt-0"
num_steps: 30000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
fine_tune_checkpoint_version: V2
}
train_input_reader {
label_map_path: "D:/Visual_Studio/TensorflowTake3/label_map/label_map.pbtxt"
tf_record_input_reader {
input_path: "D:/Visual_Studio/TensorflowTake3/record_file/tf_cropped/tf_cropped_train.record"
}
}
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader {
label_map_path: "D:/Visual_Studio/TensorflowTake3/label_map/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "D:/Visual_Studio/TensorflowTake3/record_file/tf_cropped/tf_cropped_test.record"
}
}
(i hope the code is readable, i had some problems with the coding)
}```
I have tried a number of different versions of this file. As i have a relative small number of pictures and the "simplicity" of the pictures is quite high i have tried using num_steps from 1000 up to 50000 rendering in pretty much the same result.
I have changed the data_augmentation_options to
random_horizontal_flip in hope of not changing the pictures to much as the model might could benefit from that as the final environment is very static.
The model will be applied on a Oak-D camera, used inside with not much natural light so the change in color for the final model will be minimal.
After training this script is used to export:
python .\exporter_main_v2.py --input_type image_tensor --pipeline_config_path D:\(filepath)\ssd_mobilenet_v2_320x320_coco17_tpu-8\pipeline.config --trained_checkpoint_dir D:\(filepath)\model_out --output_directory D:\(filepath)\my_model
This is used for creating a IR:
python mo_tf.py
--saved_model_dir D:\(filepath)\saved_model\
--data_type=FP16
--scale_values [255,255,255] --output_dir D:\(filepath)\20220324_night
--tensorflow_use_custom_operations_config C:\Users\(filepath)\envs\openvino\Lib\site-packages\mo\extensions\front\tf\ssd_support_api_v2.4.json
--tensorflow_object_detection_api_pipeline_config D:\(filepath)\model\pipeline.config
--log_level DEBUG --tensorboard_logdir D:\(filepath)\logdir
For the final conversion i use the online blob converter at http://blobconverter.luxonis.com/ with Choose model source "open vino model" myriadX compile params: -ip U8 (default) and shaves 6.
When i test the model with the oak-d i use a script from the DepthAI
https://docs.luxonis.com/projects/api/en/develop/samples/MobileNet/video_mobilenet/
This is the RGB camera in Oak-D run against video that i have captured of the final environment.
The problem i get when running:
most often have boundingboxes all over the screen and they are not "focusing" on anything special, the are just everywhere. in this case, when i change the n.setConfidenceThreshold(0.5) there is a point at 0.2 that either gives me no boundingboxes or to many boundingboxes not focusing on anything.
When i tried with 50000 steps i had a row of bounding boxes in the bottom of the screen (in a for the model "dead space") and not reacting at all on the cars coming into screen.
As i feel i´m stuck, not knowing what to do all ideas are welcome. I´m trying to research what to do but as i don´t know what the problem might be its tricky to know where to start.
The original photos captured for the labeling is in 16/9 aspect ratio, would it be possible to keep this ratio throughout the process? If so the positions for the different parking-spaces would be intact through the training (hopefully) enhancing the models ability to by geographical localization know where the different parking spots (and the road) is located.
I have found a way to test the model after training on images and this gives a great result.
My idea in the long run is to get a model working to be able to have something to "experiment" on to learn more.
I have read somewhere that someone having the same problem solved it by changing the label_map in the .record file. As i understand it the .record file is more or less a file consisting of tensors, would it be possible to check this file to see the order in the label_map or have i understood this wrong?
As mentioned above all advice/ideas are welcome, and keep in mind i´m new to this. This means that i could have done just a simple mistake.
Sorry for this long post and please let me know if there is anything i should elaborate/explain further.
best,
Martin
From OpenVINO perspective you may refer to this Object Detection Python Demo since this demo is rather similar to your aim. Models that are similar/close to this kind of implementation are also available.
The thing that you need to figure out is how are you going to train the model to differentiate whether the empty spot is taken or not. You'll need to decide stimuli that differentiate those, for example, in IoT, infrared sensor in the parking lot determines empty/occupied lot through a threshold value. Meanwhile, in AI, it's up to you as the developer to decide how to determine that concept. This might give you some idea. (They even share their source code).
Another thing to note is, the OpenCV OAK-D is not a part of the OpenVINO Toolkit. Hence, anything that specific to this, you'll need to reach their discussion forum or community.

How to make the label size in QML according to the size of the text?

I can't make the height of the label exactly the height of the font, there is a gap at the bottom. Can you please tell me how to remove it? my label image
Label
{
id: autoLabel
leftInset: 0
topInset: 0
bottomInset: 0
background: Rectangle
{
anchors.bottomMargin: 0
color: "white"
border.width: 0
}
Text
{
id: autoText
anchors.fill: autoLabel
anchors.leftMargin: 0
anchors.topMargin: 0
anchors.bottomMargin: 0
color: PendulumStyle.primaryTextColor
text: "AUTO"
font.family: PendulumStyle.fontFamily
font.pixelSize: 35
font.styleName: "Bold"
padding: 0
}
width: autoText.contentWidth
height: autoText.contentHeight
x: mainRectangle.x + 30
y: checkBox.y - checkBox.height / 2
}
The Label actually IS the size of your font. The text you're using just doesn't show it. Fonts have a concept of an ascent and descent above and below the baseline.
The ascent is the distance from the baseline to the top of the tallest character, and the descent is the distance from the baseline to the bottom of the lowest character. (Those might not be the technical definitions, but at least how I think of them. i.e. There may still be padding, etc.) So therefore, the total height of a font should be (ascent + descent).
In your case, you've used the word "AUTO". None of those characters go below the baseline. But the font height stays the same no matter what text you use.
If you still want your Rectangle to just fit around the word "AUTO", then it should just use the ascent height, and ignore the descent. To do that, QML provides a FontMetrics object that can help you.
Label
{
id: autoLabel
width: autoText.contentWidth
height: fm.ascent
FontMetrics {
id: fm
font: autoText.font
}
background: ...
Text
{
id: autoText
text: "AUTO"
font.family: PendulumStyle.fontFamily
font.pixelSize: 35
font.styleName: "Bold"
}
}

how to set the scrollbar position to max? its not 1.0 as stated in the docs

UPDATE
Fixed by calculating the needed position using the max content.x position and a 0.0-1.0 factor
Component.onCompleted:
{
var maxContentX = content.width - frame.width
var factor = 1.0 // 0, 0.5, 1.0
var newPosition = (maxContentX*factor)/content.width
hbar.position = newPosition
}
Qt-Docs: ScrollBar QML Type
Position Property
This property holds the position of the scroll bar, scaled to 0.0 -
1.0.
but the position never reaches 1.0 - because the position only holds 1 minus bar size
i don't understand whats the sense is of scaling the position between 0 and 1 but then make it directly related to bar size
are there any meaningful usages for position values above (1 minus bar-size)?
is there a way to get the bar size?
any idea how to calculate the correct mid/max values for the position?
i want to build a timer that toggles between 0%, 50% and 100% position
Qt-Sample: Non-attached Scroll Bars
import QtQuick 2.15
import QtQuick.Controls 2.15
Item
{
height: 300
width: 300
Column
{
Text { text: "vbar.position: "+vbar.position }
Text { text: "hbar.position: "+hbar.position }
}
Rectangle {
id: frame
clip: true
width: 160
height: 160
border.color: "black"
anchors.centerIn: parent
Text {
id: content
text: "ABC"
font.pixelSize: 160
x: -hbar.position * width
y: -vbar.position * height
}
ScrollBar {
id: vbar
hoverEnabled: true
active: hovered || pressed
orientation: Qt.Vertical
size: frame.height / content.height
anchors.top: parent.top
anchors.right: parent.right
anchors.bottom: parent.bottom
}
ScrollBar {
id: hbar
hoverEnabled: true
active: hovered || pressed
orientation: Qt.Horizontal
size: frame.width / content.width
anchors.left: parent.left
anchors.right: parent.right
anchors.bottom: parent.bottom
}
}
}
I don't think I can answer the "why" part of your question. But to get the bar size, I believe you can use the Scrollbar's contentItem property. To get 0%, 50%, and 100%, you could probably do something like this (for a horizontal scrollbar):
// 0%
position: 0
// 50%
position: (width - contentItem.width) / width / 2
// 100%
position: (width - contentItem.width) / width

Tensorflow retrained graph in C# (Tensorflowsharp)

I'am just trying to use a retrained inception model in Tensorflow sharp in Unity.
The retrained model was prepared with optimize_for_inference and is working like a charm in python.
But it is pretty inaccurate in c#.
the code works like this:
First i get the Picture
//webcamtexture transformed to picture in jpg
var pic = _texture.EncodeToJpg();
//added Picture to queue for the object detection thread
_detectedObjects.addTens(pic);
After that a thread will handle each collected picture
public void HandlePicture(byte[] picture)
{
var tensor = ImageUtil.CreateTensorFromImageFile(picture);
var runner = session.GetRunner();
runner.AddInput(g_input, tensor).Fetch(g_output);
var output = runner.Run();
var bestIdx = 0;
float best = 0;
var result = output[0];
var rshape = result.Shape;
var probabilities = ((float[][])result.GetValue(jagged: true))[0];
for (int r = 0; r < probabilities.Length; r++)
{
if (probabilities[r] > best)
{
bestIdx = r;
best = probabilities[r];
}
}
Debug.Log("Tensorflow thinks this is: " + labels[bestIdx] + " Prob : " + best * 100);
}
so my guess is:
1.it has something to do with retrained graphs (because i can't find any application/test it is used and working).
2.It has something to do with how i handle the picture transform into a tensor?! (but if that is wrong i could need help there, the code further down)
to transform the picture i'am also using a graph like it is used in the tensorsharp example
public static class ImageUtil
{
// Convert the image in filename to a Tensor suitable as input to the Inception model.
public static TFTensor CreateTensorFromImageFile(byte[] contents, TFDataType destinationDataType = TFDataType.Float)
{
// DecodeJpeg uses a scalar String-valued tensor as input.
var tensor = TFTensor.CreateString(contents);
TFGraph graph;
TFOutput input, output;
// Construct a graph to normalize the image
ConstructGraphToNormalizeImage(out graph, out input, out output, destinationDataType);
// Execute that graph to normalize this one image
using (var session = new TFSession(graph))
{
var normalized = session.Run(
inputs: new[] { input },
inputValues: new[] { tensor },
outputs: new[] { output });
return normalized[0];
}
}
// The inception model takes as input the image described by a Tensor in a very
// specific normalized format (a particular image size, shape of the input tensor,
// normalized pixel values etc.).
//
// This function constructs a graph of TensorFlow operations which takes as
// input a JPEG-encoded string and returns a tensor suitable as input to the
// inception model.
private static void ConstructGraphToNormalizeImage(out TFGraph graph, out TFOutput input, out TFOutput output, TFDataType destinationDataType = TFDataType.Float)
{
// Some constants specific to the pre-trained model at:
// https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip
//
// - The model was trained after with images scaled to 224x224 pixels.
// - The colors, represented as R, G, B in 1-byte each were converted to
// float using (value - Mean)/Scale.
const int W = 299;
const int H = 299;
const float Mean = 128;
const float Scale = 1;
graph = new TFGraph();
input = graph.Placeholder(TFDataType.String);
output = graph.Cast(graph.Div(
x: graph.Sub(
x: graph.ResizeBilinear(
images: graph.ExpandDims(
input: graph.Cast(
graph.DecodeJpeg(contents: input, channels: 3), DstT: TFDataType.Float),
dim: graph.Const(0, "make_batch")),
size: graph.Const(new int[] { W, H }, "size")),
y: graph.Const(Mean, "mean")),
y: graph.Const(Scale, "scale")), destinationDataType);
}
}

Tensorflow Serving Batching parameters

How to do performance tuning of batching using max_batch_size, batch_timeout_micros, num_batch_threads and other parameters? Tried using these parameters with the Query client, it doesn't work.
In the below example I have 100 images and I want to batch in size of 10. The query runs for all images instead of 10.
bazel-bin/tensorflow_serving/example/demo_batch --server=localhost:9000 --max_batch_size=10
Also, for batch scheduling how to make it run every 10 secs after the first batch is done? Thanks.
I have met the same problem like you.
And I checked the source code of tf-serving, these parameters are in an protobuf file which defined in:
serving/tensorflow_serving/servables/tensorflow/session_bundle_config.proto
And I found the example conf file in:
serving/tensorflow_serving/servables/tensorflow/testdata/batching_config.txt
And I believe you could follow the batching_config.txt format, the parameters config should be work.
Hope it helps.
max_batch_size { value: 1024 }
batch_timeout_micros { value: 0 }
max_enqueued_batches { value: 1000000 }
num_batch_threads { value: 8 }
allowed_batch_sizes : 1
allowed_batch_sizes : 2
allowed_batch_sizes : 8
allowed_batch_sizes : 32
allowed_batch_sizes : 128
allowed_batch_sizes : 256
allowed_batch_sizes : 512
allowed_batch_sizes : 1024