I have a question about tflearn and the usage of CNN with it. I have a classification problem with n data variables (float) and m classes. I tried to implement this by using
https://github.com/tflearn/tflearn/blob/master/examples/nlp/cnn_sentence_classification.py this with just my dataset. But they used an embedding, which doesn't work for me (I have uncountable many possible values for the input values, since they are float). If I just drop the line network = tflearn.embedding(network, input_dim=10000, output_dim=128), I don't have a 3-d tensor as input for the following conv_1d layer. Could anyone help me? So how can I bring my data into the right shape to apply some convolutions?
Add an extra dimension at the end, it needs a channel dimension, like so (changed the second like, the first and third I copied for easier understanding):
network = input_data(shape=[None, 100], name='input')
network = tf.expand_dims(network, 2)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
I am a newbie in ML. I have a set of timeseries data with Date and Temp cols., that I want to use for anomaly detection. I used the MinMax scaler on the data and I got an array normal_train_data with shape (200, 0).
Then I used the autoencoder which uses
keras.layers.Dense(128, activation ='sigmoid').
After that, when I call
history = model.fit(normal_train_data, normal_train_data, epochs= 50, batch_size=128, validation_data=(train_data_scaled[:,1:], train_data_scaled[:,1:]) ...)
I get the error:
ValueaError: Dimensions must be equal but are 128 and 0 with input shapes: [?,128], [?,0].
As far as I understand the input has shape (200,0) and the output(1,128).
Can you help me to fix this error please? Thankyou
I tried to use tf.keras.layers.Flatten() in the encoder part. I am not sure if it's ok to use Dense layer or should I choose another.
I'm currently trying to train a custom model with tensorflow to detect 17 landmarks/keypoints on each of 2 hands shown in an image (fingertips, first knuckles, bottom knuckles, wrist, and palm), for 34 points (and therefore 68 total values to predict for x & y). However, I cannot get the model to converge, with the output instead being an array of points that are pretty much the same for every prediction.
I started off with a dataset that has images like this:
each annotated to have the red dots correlate to each keypoint. To expand the dataset to try to get a more robust model, I took photos of the hands with various backgrounds, angles, positions, poses, lighting conditions, reflectivity, etc, as exemplified by these further images:
I have about 3000 images created now, with the landmarks stored inside a csv as such:
I have a train-test split of .67 train .33 test, with the images randomly selected to each. I load the images with all 3 color channels, and scale the both the color values & keypoint coordinates between 0 & 1.
I've tried a couple different approaches, each involving a CNN. The first keeps the images as they are, and uses a neural network model built as such:
model = Sequential()
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu', input_shape = (225,400,3)))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
filters_convs = [(128, 2), (256, 3), (512, 3), (512,3)]
for n_filters, n_convs in filters_convs:
for _ in np.arange(n_convs):
model.add(Conv2D(filters = n_filters, kernel_size = (3,3), padding = 'same', activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2), strides = 2))
model.add(Dense(128, activation="relu"))
model.add(Dense(96, activation="relu"))
model.add(Dense(72, activation="relu"))
model.add(Dense(68, activation="sigmoid"))
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
I've modified the various hyperparameters, yet nothing seems to make any noticeable difference.
The other thing I've tried is resizing the images to fit within a 224x224x3 array to use with a VGG-16 network, as such:
vgg = VGG16(weights="imagenet", include_top=False,
input_tensor=Input(shape=(224, 224, 3)))
vgg.trainable = False
flatten = vgg.output
flatten = Flatten()(flatten)
points = Dense(256, activation="relu")(flatten)
points = Dense(128, activation="relu")(points)
points = Dense(96, activation="relu")(points)
points = Dense(68, activation="sigmoid")(points)
model = Model(inputs=vgg.input, outputs=points)
opt = Adam(learning_rate=.0001)
model.compile(loss="mse", optimizer=opt, metrics=['mae'])
This model has similar results to the first. No matter what I seem to do, I seem to get the same results, in that my mse loss minimizes around .009, with an mae around .07, no matter how many epochs I run:
Furthermore, when I run predictions based off the model it seems that the predicted output is basically the same for every image, with only slight variation between each. It seems the model predicts an array of coordinates that looks somewhat like what a splayed hand might, in the general areas hands might be most likely to be found. A catch-all solution to minimize deviation as opposed to a custom solution for each image. These images illustrate this, with the green being predicted points, and the red being the actual points for the left hand:
So, I was wondering what might be causing this, be it the model, the data, or both, because nothing I've tried with either modifying the model or augmenting the data seems to have done any good. I've even tried reducing the complexity to predict for one hand only, to predict a bounding box for each hand, and to predict a single keypoint, but no matter what I try, the results are pretty inaccurate.
Thus, any suggestions for what I could do to help the model converge to create more accurate & custom predictions for each image of hands it sees would be very greatly appreciated.
Usually, neural networks will have a very hard time to predict exact coordinates of landmarks. A better approach is probably a fully convolutional network. This would work as follows:
You omit the dense layers at the end and thus end up with an output of (m, n, n_filters) with m and n being the dimensions of your downsampled feature maps (since you use maxpooling at some earlier stage in the network they will be lower resolution than your input image).
You set n_filters for the last (output-)layer to the number of different landmarks you want to detect plus one more to indicate no landmark.
You remove some of the max pooling such that your final output has a fairly high resolution (so the earlier referenced m and n are bigger). Now your output has shape mxnx(n_landmarks+1) and each of the nxm (n_landmark+1)-dimensional vectors indicate which landmark is present as the position in the image that corresponds to the position in the mxn grid. So the activation for your last output convolutional layer needs to be a softmax to represent probabilities.
Now you can train your network to predict the landmarks locally without having to use dense layers.
This is a very simple architecture and for optimal results a more sophisticated architecture might be needed, but I think this should give you a first idea of a better approach than using the dense layers for the prediction.
And for the explanation why your network does predict the same values every time: This is probably, because your network is just not able to learn what you want it to learn because it is not suited to do so. If this is the case, the network will just learn to predict a value, that is fairly good for most of the images (so basically the "average" position of each landmark for all of your images).
I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?
I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.
Would the code below represent one or two layers? I'm confused because isn't there also supposed to be an input layer in a neural net?
input_layer = slim.fully_connected(input, 6000, activation_fn=tf.nn.relu)
output = slim.fully_connected(input_layer, num_output)
Does that contain a hidden layer? I'm just trying to be able to visualize the net. Thanks in advance!
You have a neural network with one hidden layer. In your code, input corresponds to the 'Input' layer in the above image. input_layer is what the image calls 'Hidden'. output is what the image calls 'Output'.
Remember that the "input layer" of a neural network isn't a traditional fully-connected layer since it's just raw data without an activation. It's a bit of a misnomer. Those neurons in the picture above in the input layer are not the same as the neurons in the hidden layer or output layer.
From tensorflow-slim:
Furthermore, TF-Slim's slim.stack operator allows a caller to repeatedly apply the same operation with different arguments to create a stack or tower of layers. slim.stack also creates a new tf.variable_scope for each operation created. For example, a simple way to create a Multi-Layer Perceptron (MLP):
# Verbose way:
x = slim.fully_connected(x, 32, scope='fc/fc_1')
x = slim.fully_connected(x, 64, scope='fc/fc_2')
x = slim.fully_connected(x, 128, scope='fc/fc_3')
# Equivalent, TF-Slim way using slim.stack:
slim.stack(x, slim.fully_connected, [32, 64, 128], scope='fc')
So the network mentioned here is a [32, 64,128] network - a layer with a hidden size of 64.
weights = tf.placeholder("float",[5,5,1,1])
imagein = tf.placeholder("float",[1,32,32,1])
conv = tf.nn.conv2d(imagein,weights,strides=[1,1,1,1],padding="SAME")
deconv = tf.nn.conv2d_transpose(conv, weights, [1,32,32,1], [1,1,1,1],padding="SAME")
dw = np.random.rand(5,5,1,1)
noise = np.random.rand(1,32,32,1)
sess = tf.InteractiveSession()
convolved = conv.eval(feed_dict={imagein: noise, weights: dw})
deconvolved = deconv.eval(feed_dict={imagein: noise, weights: dw})
I've been trying to figure out conv2d_transpose in order to reverse a convolution in Tensorflow. My understanding is that "deconvolved" should contain the same data as "noise" after applying a normal convolution and then its transpose, but "deconvolved" just contains some completely different image. Is there something wrong with my code, or is the theory incorrect?
There's a reason it's called conv2d_transpose rather than deconv2d: it isn't deconvolution. Convolution isn't an orthogonal transformation, so it's inverse (deconvolution) isn't the same as its transpose (conv2d_transpose).
Your confusion is understandable: calling the transpose of convolution "deconvolution" has been standard neural network practice for years. I am happy than we were able to fix the name to be mathematically correct in TensorFlow; more details here: