3d image registration with supervised UNet - tensorflow

Same with normal image registration, my data is pre and post-image and the ground truth. My dataset has been limited to certain organ and anything outside the organ are 0. I found that the UNet even made the prediction on the slice which only has 0. I am not sure why this happens. My model is simple 7 layer UNet, nothing new at all.
My expectation is that UNet should not make any prediction at the 0 area and keep it 0.

Related

Solving Imbalance Classification on Video Transcript dataset

I am currently working on a problem that requires segmenting a video lecture transcript based on the topics present within the video. My dataset consists of sentence wise labels where 1 indicates the beginning of a new segment(ie. topic) and 0 indicates the same segment. Thus the problem can be framed as a Binary Classification problem where the model takes a sentence as input and makes a binary prediction on it . However, due of the very nature of the problem, the dataset is highly imbalanced (90% 0s and 10% 1s). As a consequence, while training the model, I have noticed that my model becomes biased and starts making all 0s predictions.
I have tried resolving this issue through using class_weights in model.fit(). However, this hasnt been of much help. If I increase the penalty on 1s class, my model starts predicting all 1s. If I lower the value, the model again starts predicting all 0s. Does someone have any ideas as to how I should resolve this issue?
There are other oversampling and undesampling techniques(eg: SMOTE), however I dont think they are suitable in my use case since that would disrupt the continuity in the video transcript.
PS: I am sharing a screenshot of my model's architecture for reference.
Basically, the model takes BERT tokenized input sentences and encodes them using the Universal Sentence Encoder. This encoding is then passed to a classification layer which finally returns a tensor of shape [BATCH_SIZE, 1]. I am using BinaryCrossentropy as a loss function.
Model Architecture
Fitting the model using model.fit()

How to clip tensorflow model predictions to a range?

I am using the tensorflow website's tutorial for time series modeling on my own data. The models are pretty good, but my target value is always positive and sometimes the model predicts a negative value. Is there a way to clip the model predictions to a range?

Keras Masking layer for LSTM input to mask features instead of timesteps

I gather that Masking layers in Keras are commonly used for handling data inputs with varying timesteps. Based on the documentation, I understand that if all of the features for a given timestep equal the mask value, then that timestep will be skipped in downstream layers.
For my problem, I am instead interested in using masking for features, where the data input shape to the network is (batch_size, num_timesteps, num_features). Essentially, I want to be able to predict a timeseries one step into the future with num_features features, but assuming that I won't always have all the features from the previous timestep to base my prediction on.
For example, one could predict RGB values one timestep into the future for a pixel in a video stream based on partial data from a previous timestep. At every timestep the output should be all RGB, but some timesteps you may get only RG, or only RB, or only BG, but you never know which partial data you'll have at each timestep to make your prediction. This is why I want to somehow be able to indicate a feature as masked during training to accommodate this kind of prediction.
It may be that Masking in Keras is not the correct mechanism to achieve this. What is the correct type of network layer that would give me this behavior?

Detection Text from natural images

I write a code in tensorflow by using convolution neural network to detect the text from images. I used TFRecords file to read the street view text dataset, then, I resized the images to 128 for height and width.
I used 9-conv layer with zero padding and three max_pool layer with window size of (2×2) and stride of 2. Since I use just three pooling layer, the last layer shape will be (16×16). the last conv layer has '256' filters.
I used too, two regression fully connected layers (tf.nn.sigmoid) and tf.losses.mean_squared_error as a loss function.
My question is
is this architecture enough for detection process?? I know there is something call NMS for detection. Also what is the label in this case??
In general and this not a rule , it's just based on my experience, you should start with a smaller net 2 or 3 conv layer, and say what happens, if you get some good result focus more on the winning topology and adapt the hyperparameters ( learnrat, batchsize and so one ) , if you don't get good result at all go deep meaning add conv layer. and evaluate again. 12 conv is really huge , your problem complexity should be huge too ! otherwise you wil reach a good accuracy but waste a lot computer power and time for nothing ! and by the way use pyramid form meaning start wider and finish tiny

patch-wise training and fully convolutional training in FCN

In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.