Having trouble introducing a constant matrix to multiplication in Tensorflow - tensorflow

I am trying to implement a layer that is not fully connected. I have a matrix that specifies the connectivity I desire in the variable connectivity_matrix, which is a numpy array of ones and zeros.
The way I am currently trying to impliment the layer is by pairwise multiplying the weights, by this connectivity matrix F:
Is this the correct way to do this in tensorflow? Here is what I have so far
import numpy as np
import tensorflow as tf
import tflearn
num_input = 10
num_layer1 = 313
num_output = 700
# For example:
connectivity_matrix = np.array(np.random.choice([0, 1], size=(num_layer1, num_output)), dtype='float32')
input = tflearn.input_data(shape=[None, num_input])
# Here is where I specify the connectivity in tensorflow
connectivity = tf.constant(connectivity_matrix, shape=[num_layer1, num_output])
# One basic, fully connected layer
layer1 = tflearn.fully_connected(input, num_layer1, activation='relu')
# Here is where I want to have a non-fully connected layer
W = tf.Variable(tf.random_uniform([num_layer1, num_output]))
b = tf.Variable(tf.zeros([num_output]))
# so take a fully connected W, and do a pairwise multiplication with my tf_connectivity matrix
W_filtered = tf.mul(connectivity, W)
output = tf.matmul(layer1, W_filtered) + b

Masking out unwanted connections in each iteration should work, but I am not sure what the convergence properties are like. It may okay for a small enough learning rate?
Another approach would be to penalize unwanted weights in the cost function. You would use a mask matrix with 1's at unwanted connections, and 0's at wanted ones (or have a smoother transition). This would be multiplied by weights, squared/scaled and added to the cost function. This should converge more smoothly.
P.S.: If you've made progress on this, it would be great to hear your comments as I am also working on this problem.

Related

Why can't I classify my data perfectly on this simple problem using a NN?

I have a set of observations made of 10 features, each of these features being a real number in the interval (0,2). Say I wanted to train a simple neural network to classify whether the average of those features is above or below 1.0.
Unless I'm missing something, it should be enough with a two-layer network with one neuron on each layer. The activation functions would be a linear one (i.e. no activation function) on the first layer and a sigmoid on the output layer. An example of a NN with this architecture that would work is one that calculates the average on the first layer (i.e. all weights = 0.1 and bias=0) and asseses whether that is above or below 1.0 in the second layer (i.e. weight = 1.0 and bias = -1.0).
When I implement this using TensorFlow (see code below), I obviously get a very high accuracy quite quickly, but never get to 100% accuracy... I would like some help to understand conceptually why this is the case. I don't see why the backppropagation algorithm does not reach a set of optimal weights (may be this is related with the loss function I'm using, which has local minmums?). Also I would like to know whether a 100% accuracy is achievable if I use different activations and/or loss function.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
X = [np.random.random(10)*2.0 for _ in range(10000)]
X = np.array(X)
y = X.mean(axis=1) >= 1.0
y = y.astype('int')
train_ratio = 0.8
train_len = int(X.shape[0]*0.8)
X_train, X_test = X[:train_len,:], X[train_len:,:]
y_train, y_test = y[:train_len], y[train_len:]
def create_classifier(lr = 0.001):
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units=1))
classifier.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))#, input_shape=input_shape))
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
metrics=[tf.keras.metrics.BinaryAccuracy()],
classifier.compile(optimizer=optimizer, loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), metrics=metrics)
return classifier
classifier = create_classifier(lr = 0.1)
history = classifier.fit(X_train, y_train, batch_size=1000, validation_split=0.1, epochs=2000)
Ignoring the fact that a neural network is an odd approach for this problem, and answering your specific question - it looks like your learning rate might be too high which could explain the fluctuations around the optimal point.

tf.keras.layers.BatchNormalization with trainable=False appears to not update its internal moving mean and variance

I am trying to find out, how exactly does BatchNormalization layer behave in TensorFlow. I came up with the following piece of code which to the best of my knowledge should be a perfectly valid keras model, however the mean and variance of BatchNormalization doesn't appear to be updated.
From docs https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization
in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
I expect the model to return a different value with each subsequent predict call.
What I see, however, are the exact same values returned 10 times.
Can anyone explain to me why does the BatchNormalization layer not update its internal values?
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(3, 5) * 5 + 0.3
bn = tf.keras.layers.BatchNormalization(trainable=False, epsilon=1e-9)
z = input = tf.keras.layers.Input([5])
z = bn(z)
model = tf.keras.Model(inputs=input, outputs=z)
for i in range(10):
print(x)
print(model.predict(x))
print()
I use TensorFlow 2.1.0
Okay, I found the mistake in my assumptions. The moving average is being updated during training not during inference as I thought. This makes perfect sense, as updating the moving averages during inference would likely result in an unstable production model (for example a long sequence of highly pathological input samples [e.g. such that their generating distribution differs drastically from the one on which the network was trained] could potentially bias the network and result in worse performance on valid input samples).
The trainable parameter is useful when you're fine-tuning a pretrained model and want to freeze some of the layers of the network even during training. Because when you call model.predict(x) (or even model(x) or model(x, training=False)), the layer automatically uses the moving averages instead of batch averages.
The code below demonstrates this clearly
import tensorflow as tf
import numpy as np
if __name__ == '__main__':
np.random.seed(1)
x = np.random.randn(10, 5) * 5 + 0.3
z = input = tf.keras.layers.Input([5])
z = tf.keras.layers.BatchNormalization(trainable=True, epsilon=1e-9, momentum=0.99)(z)
model = tf.keras.Model(inputs=input, outputs=z)
# a dummy loss function
model.compile(loss=lambda x, y: (x - y) ** 2)
# a dummy fit just to update the batchnorm moving averages
model.fit(x, x, batch_size=3, epochs=10)
# first predict uses the moving averages from training
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# outputs the same thing as previous predict
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here calling the model with training=True results in update of moving averages
# furthermore, it uses the batch mean and variance as in training,
# so the result is very different
pred = model(x, training=True).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
# here we see again that the moving averages are used but they differ slightly after
# the previous call, as expected
pred = model(x).numpy()
print(pred.mean(axis=0))
print(pred.var(axis=0))
print()
In the end, I found that the documentation (https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) mentions this:
When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.
Hopefully this will help someone with similar misunderstanding in the future.

Neural network only converges when data cloud is close to 0

I am new to tensorflow and am learning the basics at the moment so please bear with me.
My problem concerns strange non-convergent behaviour of neural networks when presented with the supposedly simple task of finding a regression function for a small training set consisting only of m = 100 data points {(x_1, y_1), (x_2, y_2),...,(x_100, y_100)}, where x_i and y_i are real numbers.
I first constructed a function that automatically generates a computational graph corresponding to a classical fully connected feedforward neural network:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import math
def neural_network_constructor(arch_list = [1,3,3,1],
act_func = tf.nn.sigmoid,
w_initializer = tf.contrib.layers.xavier_initializer(),
b_initializer = tf.zeros_initializer(),
loss_function = tf.losses.mean_squared_error,
training_method = tf.train.GradientDescentOptimizer(0.5)):
n_input = arch_list[0]
n_output = arch_list[-1]
X = tf.placeholder(dtype = tf.float32, shape = [None, n_input])
layer = tf.contrib.layers.fully_connected(
inputs = X,
num_outputs = arch_list[1],
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
for N in arch_list[2:-1]:
layer = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = N,
activation_fn = act_func,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Phi = tf.contrib.layers.fully_connected(
inputs = layer,
num_outputs = n_output,
activation_fn = tf.identity,
weights_initializer = w_initializer,
biases_initializer = b_initializer)
Y = tf.placeholder(tf.float32, [None, n_output])
loss = loss_function(Y, Phi)
train_step = training_method.minimize(loss)
return [X, Phi, Y, train_step]
With the above default values for the arguments, this function would construct a computational graph corresponding to a neural network with 1 input neuron, 2 hidden layers with 3 neurons each and 1 output neuron. The activation function is per default the sigmoid function. X corresponds to the input tensor, Y to the labels of the training data and Phi to the feedforward output of the neural network. The operation train_step performs one gradient-descent step when executed in the session environment.
So far, so good. If I now test a particular neural network (constructed with this function and the exact default values for the arguments given above) by making it learn a simple regression function for artificial data extracted from a sinewave, strange things happen:
Before training, the network seems to be a flat line. After 100.000 training iterations, it manages to partially learn the function, but only the part which is closer to 0. After this, it becomes flat again. Further training does not decrease the loss function anymore.
This get even stranger, when I take the exact same data set, but shift all x-values by adding 500:
Here, the network completely refuses to learn. I cannot understand why this is happening. I have tried changing the architecture of the network and its learning rate, but have observed similar effects: the closer the x-values of the data cloud are to the origin, the easier the network can learn. After a certain distance to the origin, learning stops completely. Changing the activation function from sigmoid to ReLu has only made things worse; here, the network tends to just converge to the average, no matter what position the data cloud is in.
Is there something wrong with my implementation of the neural-network-constructor? Or does this have something do do with initialization values? I have tried to get a deeper understanding of this problem now for quite a while and would greatly appreciate some advice. What could be the cause of this? All thoughts on why this behaviour is occuring are very much welcome!
Thanks,
Joker

Time series translation into Keras input

I have timeseries data (ECG). I have annotations for blocks of 30seconds.
each block has 1000 data points. We have 500 of those data blocks.
The target, the annotations are e.g. in range 1 to 5.
To be clear please see Figure
About X-DATA
How translate that into the Keras notation for input data [Samples,timesteps, features]?
My guess:
Samples=Blocks (500)
timesteps=values(1000)
features= ECG as itselve (1)
resulting in [500,1000,1]
About Y-Data(target)
My target or y data would result in
[500,1,1]
after one hot encoding it would be
[500,5,1]
The problem is that Keras expect the X and y data to be of same dimensions. But increasing my ydata to 1000 per timestep would not make sense to me.
Thanks for your help
p.s. cannot answer directly as I am with my parent in law. Thanks in advance
I think you're thinking about y incorrectly. From my understanding based on you're graph.
y actually is (500, 5) after one hot encoding. That is, for every block there is a single outcome.
Also there is no need for X and y to have the same dimensions in Keras (unless you have a seq2seq requirement which is not the case here).
What we do want is the model to give us a probability distribution over
the possible labels for each block, and that we'll achieve using a softmax
on the last (Dense) layer.
Here is how I simulated your problem:
import numpy as np
from keras.models import Model
from keras.layers import Dense, LSTM
# using eye doesn't capture one-hot but works for the example
series = np.random.rand(500, 1000, 1)
labels = np.eye(500, 5)
inp = Input(shape=(1000, 1))
lstm = LSTM(128)(inp)
out = Dense(5, activation='softmax')(lstm)
model = Model(inputs=[inp], outputs=[out])
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(series, labels)

Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend

I'm trying to use Keras to implement part of an algorithm that requires weight clipping, i.e. limiting the weight values after a gradient update. I haven't found any solutions through web searches so far.
For background, this has to do with the WGANs algorithm:
https://arxiv.org/pdf/1701.07875.pdf
If you look at algorithm 1 on page 8, you'll see the following:
I've highlighted the lines that I'm trying to implement in Keras: after computing a gradient to use to update the weights in the network, I want to make sure that all the weights are clipped between some values [-c, c] that I can set.
How could I go about doing this in Keras?
For reference I am using the TensorFlow backend. I don't mind digging into things and adding messy quick-fixes for now.
While creating the optimizer object set param clipvalue. It will do precisely what you want.
# all parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
rsmprop = RMSprop(clipvalue=0.5)
and then use this object to for model compiling
model.compile(loss='mse', optimizer=rsmprop)
For more reference check: here.
Also, I prefer to use clipnorm over clipvalue because with clipnorm the optimization remains stable. For example say you have 2 parameters and the gradients came out to be [0.1, 3]. By using clipvalue the gradients will become [0.1, 0.5] ie there are chances that the direction of steepest decent can get changed drastically. While clipnorm don't have similar problem as all the gradients will be appropriately scaled and the direction will be preserved and all the while ensuring the constraint on the magnitude of the gradient.
Edit: The question asks weights clipping not gradient clipping:
Gradiant clipping on weights is not part of keras code. But maxnorm on weights constraints is. Check here.
Having said that it can be easily implemented. Here is a very small example:
from keras.constraints import Constraint
from keras import backend as K
class WeightClip(Constraint):
'''Clips the weights incident to each hidden unit to be inside a range
'''
def __init__(self, c=2):
self.c = c
def __call__(self, p):
return K.clip(p, -self.c, self.c)
def get_config(self):
return {'name': self.__class__.__name__,
'c': self.c}
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(30, input_dim=100, W_constraint = WeightClip(2)))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop')
X = np.random.random((1000,100))
Y = np.random.random((1000,1))
model.fit(X,Y)
I have tested the running of the above code, but not the validity of the constraints. You can do so by getting the model weights after training using model.get_weights() or model.layers[idx].get_weights() and checking whether its abiding the constraints.
Note: The constrain is not added to all the model weights .. but just to the weights of the specific layer its used and also W_constraint adds constrain to W param and b_constraint to b (bias) param