Convolutional neural networks and 3D images - tensorflow

I am interested in applying CNNs to 3D images (i.e. medical data).
Does TensorFlow already incorporate this functionality?

TensorFlow now supports 3D convolution and 3D pooling in the master branch.
You can use them with 5D tensors as input with shape: [batch_size, depth, height, width, channels].

No, the current implementations are made for 2D images (functions like nn.conv2d). They support multiple channels (eg RGB) and it is possible to express 3D images as a multichannel 2D image (each z-slice is a channel), but this isn't always ideal. Additionally for using these sort of approaches you need substantial amounts of image data which is typically difficult to come by in the medical area.
Update: both TensorFlow and Theano (subsequently Keras, Lasagne, etc) now all support 3D operations as stated above. It is important to note that 3D operations are much more computationally and memory intensive than a similar 2D operation.

The TensorFlow implementation for 3D Convolutional Neural Networks has been provided with the following open source projects:
Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks
Using 3D Convolutional Neural Networks for Speaker Verification

If you want to use CNN with 3D images, a possible alternative is to use this Caffe PR.
You will need to convert your data to HDF5 format.

Related

Why does the Gramian Matrix work for VGG16 but not for EfficientNet or MobileNet?

A Neural Algorithm of Artistic Style uses the Gramian Matrix of the intermediate feature vectors of the VGG16 classification network trained on ImageNet. Back then, that was probably a good choice because VGG16 was one of the best-performing classification. Nowadays, there are much more efficient classification networks that surpass VGG in classification performance while requiring fewer parameters and FLOPS, for example EfficientNet and MobileNetv2.
But when I tried this out in practice, the Gramian Matrix for VGG16 features appears representative of the image style in that its L2 distance for stylistically similar images is smaller than the L2 distance to stylistically unrelated images. For the Gramian Matrix calculated from EfficientNet and MobileNetv2 features, that does not appear to be the case. The L2 distance between very similar images and between very dissimilar images only varies by about 5%.
From the network structure, VGG, EfficientNet, and MobileNet all have convolutions with batch normalization and ReLU in between, so the building blocks are the same. Then which design decision is unique to VGG so that its Gramian Matrix captures the style, while EfficientNet's and MobileNet's do not?
By now, I figured it out: The Gramian Matrix needs partially correlated features to work correctly. Newer networks are trained with a Dropout regularizer, which will reduce the inter-feature correlation.

How to classify sound using FFT and neural network? Should I use CNN or RNN?

I am doing a personal project for educational purpose to learn Keras and machine learning. For start, I would like to classify if a sound is a clap or stomp.
I am using a microcontroller that is sound triggered and samples sound # 20usec. And the microcontroller will send this raw ADC data to the PC for Python processing. I am currently taking 1000 points and get the FFT using numpy (using rfft and getting its absolute value).
Now, I would like to feed the captured FFT signals for clap or stomp as a training data to classify them using neural network. I had been researching for the whole day regarding this and some articles say the Convolutional Neural Network should be used and some say Recurrent Neural Network should be used.
I looked at Convolutional Neural Network and it raised another question, if I should be using Keras' 1-D or 2-D Conv.
You need to process the FFT signals to classify whether the sound is a clap or a stomp.
For Convolutional Neural Networks ( CNN):
CNNs can extract features from fixed length inputs. 1D CNNs with Max-Pooling work the best on signal data ( I have personally used them over accelerometer data).
You can use them if your input is fixed length and has significant features.
For Recurrent Neural Networks:
Should be used when the signal has a temporal feature.
Temporal features ( for example ) could be thought in this way for the recognition of a clap. A clap has immediate high-raised sound followed by a soft sound ( when the clap ends ). An RNN will learn these two features ( mentioned above ) in a sequence. And also clapping is a sequential action ( it consists of various activities in sequence ).
RNNs and LSTMs can be the best choice if they receive excellent features.
An hybrid Conv LSTM:
This NN is a hybrid of CNN and LSTMs ( RNN ). They use CNNs for feature extraction and then this sequence is learned by LSTMs. The features extracted by the CNNs also contain temporal features.
This could be super easy if you are using Keras.
Tip:
As audio classification is performed, I will also suggest the use of MFCC to extract features.
I think you should try all the 3 approaches and see which suits the best. Most probably RNNs and ConvLSTMs will work for your use case.
Hope it helps.
Since the train/test system is not an embedded system in this case, do take a look at VGGish (https://github.com/tensorflow/models/tree/master/research/audioset - also refers to the paper and dataset including clapping), that uses the below to compute a set of features:
VGGish was trained with audio features computed as follows:
All audio is resampled to 16 kHz mono.
A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.
A mel spectrogram is computed by mapping the spectrogram to 64 mel bins
covering the range 125-7500 Hz.
A stabilized log mel spectrogram is
computed by applying log(mel-spectrum + 0.01) where the offset is
used to avoid taking a logarithm of zero.
These features are then
framed into non-overlapping examples of 0.96 seconds, where each
example covers 64 mel bands and 96 frames of 10 ms each.
Note - clapping is already covered (https://research.google.com/audioset/dataset/clapping.html)

training image datasets for object detection

Which version of YOLO-tensorflow (customised cnn like googlenet) is preferred for traffic science?
If the training datasets are blurred and are with noise is that okay to train or what are the steps to be considered for training dataset images?
You may need to curate your own dataset using frames from a traffic camera and manually tagging images with cars where the passengers' seatbelts are or are not buckled, as this is a very specialized task. From there, you can do data augmentation (perhaps using the Keras ImageDataGenerator class). If a human can identify a seatbelt in an image that is blurred or noisy, a model can learn from it. From there, you can use transfer learning from a pre-trained CNN model like Inception (this is a helpful tutorial for how to do that), or train your own binary classifier with your tagged images, where your inputs are frames of traffic camera video.
I'd suggest that after learning the basics of CNNs with these models, only then should you dive into a more complicated model like yolo.

What to expect from deep learning object detection on black and white pictures?

With TensorFlow, I want to train an object detection model with my own images based on ssd_inception_v2_coco model. The problem I have is that all my pictures are black and white. What performance can I expect? Should I try to colorize my B&W pictures first? Or at the opposite, should I try to retrain base network with images "uncolorized"? Are there general guidelines for B&W processing of images for deep learning object detection?
I wouldn't go through the trouble of colorizing if you are planning on using a pretrained model. I would expect that explicitly colorizing your images as a pre-processing step would help very little (if at all) since in theory the features that a colorizing network learns can also be learned by the detection network.
If you are planning on pretraining your detection network that was trained on an RGB dataset, make sure you either (i) replace the first convolution in the network with a convolutional layer that expects a single-channel input, or (ii) pad your image with two all-zero channels.
You may get slightly worse detection performance simply because you lose two thirds of the image's pixel information when using BW instead of RGB.

Manipulating pretrained layers of convnet in Tensorflow

I am learning convolutional networks in Tensorflow. I wonder if there is any tutorials of using TF to investigate a pre-trained convnet model, like these excellent tutorials for Caffe: this and this. I mean, how to access middle layers, get its learned parameters and blobs, to customize input shape to accept arbitrary image size or batch size, etc.
It's not quite the same thing, but there's a codelab here that shows you how to remove the top layer of a pretrained network and train up a new one on your own data:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/index.html?index=..%2F..%2Findex#0
It might give you some ideas on how to approach this in TensorFlow.