tflearn multi layer perceptron with unexpected prediction - tensorflow

I would like to rebuild a MLP I implemented first with scikit-learn's MLPRegressor with tflearn.
sklearn.neural_network.MLPRegressor implementation:
train_data = pd.read_csv('train_data.csv', delimiter = ';', decimal = ',', header = 0)
test_data = pd.read_csv('test_data.csv', delimiter = ';', decimal = ',', header = 0)
X_train = np.array(train_data.drop(['output'], 1))
X_scaler = StandardScaler()
X_scaler.fit(X_train)
X_train = X_scaler.transform(X_train)
Y_train = np.array(train_data['output'])
clf = MLPRegressor(activation = 'tanh', solver='lbfgs', alpha=0.0001, hidden_layer_sizes=(3))
clf.fit(X_train, Y_train)
prediction = clf.predict(X_train)
The model worked and I got an accuracy of 0.85. Now I would like to build a similar MLP with tflearn. I started with the following code:
train_data = pd.read_csv('train_data.csv', delimiter = ';', decimal = ',', header = 0)
test_data = pd.read_csv('test_data.csv', delimiter = ';', decimal = ',', header = 0)
X_train = np.array(train_data.drop(['output'], 1))
X_scaler = StandardScaler()
X_scaler.fit(X_train)
X_train = X_scaler.transform(X_train)
Y_train = np.array(train_data['output'])
Y_scaler = StandardScaler()
Y_scaler.fit(Y_train)
Y_train = Y_scaler.transform(Y_train.reshape((-1,1)))
net = tfl.input_data(shape=[None, 6])
net = tfl.fully_connected(net, 3, activation='tanh')
net = tfl.fully_connected(net, 1, activation='sigmoid')
net = tfl.regression(net, optimizer='sgd', loss='mean_square', learning_rate=3.)
clf = tfl.DNN(net)
clf.fit(X_train, Y_train, n_epoch=200, show_metric=True)
prediction = clf.predict(X_train)
At some point I definitely configured something the wrong way because the prediction is way off. The range of Y_train is between 20 and 88 and the prediction shows numbers around 0.005. In the tflearn documentation I just found examples for classification.
UPDATE 1:
I realized that the regression layer uses by default 'categorical_crossentropy' as loss-function which is for classification problems. So I selected 'mean_square' instead. I also tried to normalize Y_train. The prediction still not even matches the range of Y_train. Any thoughts?
FINAL UPDATE:
Take a look at the accepted answer.

One step should be not to scale the output.
I am also working on regression problem and I scale only the inputs and it work fine with some neural networks. Although if I use tflearn I get wrong predictions.

I made a couple of actually really dumb mistakes.
First of all I scalled the output to the interval 0 to 1 but used in the output-layer the activatuion function tanh which delivers values from -1 to 1. So I had to use either an activation function that outputs values between 0 and 1 (like e.g. sigmoid) or linear without any scaling applied.
Secondly and most importantly, for my data I chose a pretty bad combination for learning rate and n_epoch. I didn't specify any learning rate and the default one is 0.1, I think. In any case it was too small (I end up using 3.0). At the same time the epoch count (10) was also far too small, with 200 it worked fine.
I also explicitly chose sgd as optimizer (default: adam), which turned out to work a lot better.
I updated the code in my question.

Related

VAE reconstruction loss (MSE) not decreasing, but KL Divergence is

I've been trying to create an LSTM VAE to reconstruct multivariate time-series data on Tensorflow. To start off I attempted to adapt (changed to Functional API, changed layers) the approach taken here and came up with the following code:
input_shape = 13
latent_dim = 2
prior = tfd.Independent(tfd.Normal(loc=tf.zeros(latent_dim), scale=1), reinterpreted_batch_ndims=1)
input_enc = Input(shape=[512, input_shape])
lstm1 = LSTM(latent_dim * 16, return_sequences=True)(input_enc)
lstm2 = LSTM(latent_dim * 8, return_sequences=True)(lstm1)
lstm3 = LSTM(latent_dim * 4, return_sequences=True)(lstm2)
lstm4 = LSTM(latent_dim * 2, return_sequences=True)(lstm3)
lstm5 = LSTM(latent_dim, return_sequences=True)(lstm4)
lat = Dense(tfpl.MultivariateNormalTriL.params_size(latent_dim))(lstm5)
reg = tfpl.MultivariateNormalTriL(latent_dim, activity_regularizer= tfpl.KLDivergenceRegularizer(prior, weight=1.0))(lat)
lstm6 = LSTM(latent_dim, return_sequences=True)(reg)
lstm7 = LSTM(latent_dim * 2, return_sequences=True)(lstm6)
lstm8 = LSTM(latent_dim * 4, return_sequences=True)(lstm7)
lstm9 = LSTM(latent_dim * 8, return_sequences=True)(lstm8)
lstm10 = LSTM(latent_dim * 16, return_sequences=True)(lstm9)
output_dec = TimeDistributed(Dense(input_shape))(lstm10)
enc = Model(input_enc, reg)
vae = Model(input_enc, output_dec)
vae.compile(optimizer='adam',
loss='mse',
metrics='mse'
)
es = callbacks.EarlyStopping(monitor='val_loss',
mode='min',
verbose=1,
patience=5,
restore_best_weights=True,
)
vae.fit(tf_train,
epochs=1000,
callbacks=[es],
validation_data=tf_val,
shuffle=True
)
By observing the MSE as a metric I've noticed that it does not change during training, only the divergence does down. Then I set the activity_regularizer argument to None and, indeed, the MSE did go down. So it seems that the KL Divergence is preventing the reconstruction error from being optimised for.
Why is that? Am I doing anything obviously wrong?
Any help greatly appreciated!
(I'm aware the latent dimension is rather small, I set it to two to easily visualise it, though this behaviour still occurs with larger latent dimensions, hence I don't think the problem lies there.)
Could it be that you are using an Autoencoder and in the loss there is a KL Divergence term? In a (Beta-) VAE the loss is Loss = MSE + beta * KL .
Since beta = 1 would be a normal VAE you could try to make beta smaller then one. This should give more wheight to the MSE and less to the KL divergence. This should help the reconstruction but is bad if you would like to have a disentangled latent space.

How to correctly ignore padded or missing timesteps at decoding time in multi-feature sequences with LSTM autonecoder

I am trying to learn a latent representation for text sequence (multiple features (3)) by doing reconstruction USING AUTOENCODER. As some of the sequences are shorter than the maximum pad length or a number of time steps I am considering (seq_length=15), I am not sure if reconstruction will learn to ignore the timesteps or not for calculating loss or accuracies.
I followed suggestions from this answer to crop the outputs but my losses are nan and several of accuracies as well.
input1 = keras.Input(shape=(seq_length,),name='input_1')
input2 = keras.Input(shape=(seq_length,),name='input_2')
input3 = keras.Input(shape=(seq_length,),name='input_3')
input1_emb = layers.Embedding(70,32,input_length=seq_length,mask_zero=True)(input1)
input2_emb = layers.Embedding(462,192,input_length=seq_length,mask_zero=True)(input2)
input3_emb = layers.Embedding(84,36,input_length=seq_length,mask_zero=True)(input3)
merged = layers.Concatenate()([input1_emb, input2_emb,input3_emb])
activ_func = 'tanh'
encoded = layers.LSTM(120,activation=activ_func,input_shape=(seq_length,),return_sequences=True)(merged) #
encoded = layers.LSTM(60,activation=activ_func,return_sequences=True)(encoded)
encoded = layers.LSTM(15,activation=activ_func)(encoded)
# Decoder reconstruct inputs
decoded1 = layers.RepeatVector(seq_length)(encoded)
decoded1 = layers.LSTM(60, activation= activ_func , return_sequences=True)(decoded1)
decoded1 = layers.LSTM(120, activation= activ_func , return_sequences=True,name='decoder1_last')(decoded1)
Decoder one has an output shape of (None, 15, 120).
input_copy_1 = layers.TimeDistributed(layers.Dense(70, activation='softmax'))(decoded1)
input_copy_2 = layers.TimeDistributed(layers.Dense(462, activation='softmax'))(decoded1)
input_copy_3 = layers.TimeDistributed(layers.Dense(84, activation='softmax'))(decoded1)
For each output, I am trying to crop the O padded timesteps as suggested by this answer. padding has 0 where actual input was missing (had zero due to padding) and 1 otherwise
#tf.function
def cropOutputs(x):
#x[0] is softmax of respective feature (time distributed) on top of decoder
#x[1] is the actual input feature
padding = tf.cast( tf.not_equal(x[1][1],0), dtype=tf.keras.backend.floatx())
print(padding)
return x[0]*tf.tile(tf.expand_dims(padding, axis=-1),tf.constant([1,x[0].shape[2]], tf.int32))
Applying crop function to all three outputs.
input_copy_1 = layers.Lambda(cropOutputs, name='input_copy_1', output_shape=(None, 15, 70))([input_copy_1,input1])
input_copy_2 = layers.Lambda(cropOutputs, name='input_copy_2', output_shape=(None, 15, 462))([input_copy_2,input2])
input_copy_3 = layers.Lambda(cropOutputs, name='input_copy_3', output_shape=(None, 15, 84))([input_copy_3,input3])
My logic is to crop timesteps of each feature (all 3 features for sequence have the same length, meaning they miss timesteps together). But for timestep, they have been applied softmax as per their feature size (70,462,84) so I have to zero out timestep by making a multi-dimensional mask array of zeros or ones equal to this feature size with help of mask padding, and multiply by respective softmax representation using this using multi-dimensional mask array.
I am not sure I am doing this right or not as I have Nan losses for these inputs as well as other accuracies have that I am learning jointly with this task (it happens only with this cropping thing).
If it helps someone, I end up cropping the padded entries from the loss directly (taking some keras code pointer from these answers).
#tf.function
def masked_cc_loss(y_true, y_pred):
mask = tf.keras.backend.all(tf.equal(y_true, masked_val_hotencoded), axis=-1)
mask = 1 - tf.cast(mask, tf.keras.backend.floatx())
loss = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred) * mask
return tf.keras.backend.sum(loss) / tf.keras.backend.sum(mask) # averaging by the number of unmasked entries

tensorflow keras model for solving simple equation

I'm trying to understand how to create a simple tensorflow 2.2 keras model that can predict a simple function value:
f(a, b, c, d) = a < b : max(a/2, c/3) : max (b/2, d/3)
I know this exact question can be reduced to a categorical classification but my intention is to find a good way to build a model that can estimate the value and build more and more functions based on that with a more and more complex conditions later on.
For start I am stumbled upon understanding why a simple function works that hard.
For using with tensorflow on a created model I have:
def generate_input(multiplier):
return np.random.rand(1024 * multiplier, 4) * 1000
def generate_output(input):
def compute_func(row):
return max(row[0]/2, row[2]/3) if row[0] < row[1] else max(row[1]/2, row[3]/3)
return np.apply_along_axis(compute_func, 1, input)
for epochs in range(0, 512):
# print('Generating data...')
train_input = generate_input(1000)
train_output = generate_output(train_input)
# print('Training...')
fit_history = model.fit(
train_input, train_output,
epochs=1,
batch_size=1024
)
I have tried with different models that are less or more complex but I still didn't got a good conversion.
For example a simple liniar one:
input = Input(shape=(4,))
layer = Dense(8, activation=tanh)(input)
layer = Dense(16, activation=tanh)(layer)
layer = Dense(32, activation=tanh)(layer)
layer = Dense(64, activation=tanh)(layer)
layer = Dense(128, activation=tanh)(layer)
layer = Dense(32, activation=tanh)(layer)
layer = Dense(8, activation=tanh)(layer)
output = Dense(1)(layer)
model = Model(inputs=input, outputs=output)
model.compile(optimizer=Adam(), loss=mean_squared_error)
Can you give point to the direction one should follow in order to solve this type of conditional functions?
Or do I miss some pre-processing?
In my honest opinion, you have a pretty deep model, and therefore, you do not have enough data to train. I do not think you will need that much deep architecture.
Your problem definition is not what I would have done. You actually do not desire to generate the max value at the output, but you want the max value to get selected, right? If it is the case, I would go with a multiclass classification instead of a regression problem in my design. That's saying, I would go with an output = Dense(4)(layer,activation=softmax) as the last layer and in my optimizer, I would use a categorical cross-entropy. Of course, in the output generation, you need to manage to return an array of 3 zeros and one 1, something like this:
def compute_func(row):
ret_value=[0,0,0,0]
if row[0] < row[1]:
if row[0] < row[2]:
ret_value[2]=1
else:
ret_value[0]=1
else:
if row[1]< row[3]:
ret_value[3]=1
else:
ret_value[1]=1
return ret_value

Tensorflow Quantum: PQC not optimizing

I have followed the tutorial available at: https://www.tensorflow.org/quantum/tutorials/mnist. I have modified this tutorial to the simplest example I could think of: an input set in which x increases linearly from 0 to 1 and y = x < 0.3. I then use a PQC with a single Rx gate with a symbol, and a readout using a Z gate.
When retrieving the optimized symbol and adjusting it manually, I can easily find a value that provides 100% accuracy, but when I let the Adam optimizer run, it converges to either always predict 1 or always predict -1. Does anybody spot what I do wrong? (and I apologize for not being able to break down the code to a smaller example)
import tensorflow as tf
import tensorflow_quantum as tfq
import cirq
import sympy
import numpy as np
# used to embed classical data in quantum circuits
def convert_to_circuit_cont(image):
"""Encode truncated classical image into quantum datapoint."""
values = np.ndarray.flatten(image)
qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
for i, value in enumerate(values):
if value:
circuit.append(cirq.rx(value).on(qubits[i]))
return circuit
# define classical dataset
length = 1000
np.random.seed(42)
# create a linearly increasing set for x from 0 to 1 in 1/length steps
x_train_sorted = np.asarray([[x/length] for x in range(0,length)], dtype=np.float32)
# p is used to shuffle x and y similarly
p = np.random.permutation(len(x_train_sorted))
x_train = x_train_sorted[p]
# y = x < 0.3 in {-1, 1} for Hinge loss
y_train_sorted = np.asarray([1 if (x/length)<0.30 else -1 for x in range(0,length)])
y_train = y_train_sorted[p]
# test == train for this example
x_test = x_train_sorted[:]
y_test = y_train_sorted[:]
# convert classical data into quantum circuits
x_train_circ = [convert_to_circuit_cont(x) for x in x_train]
x_test_circ = [convert_to_circuit_cont(x) for x in x_test]
x_train_tfcirc = tfq.convert_to_tensor(x_train_circ)
x_test_tfcirc = tfq.convert_to_tensor(x_test_circ)
# define the PQC circuit, consisting out of 1 qubit with 1 gate (Rx) and 1 parameter
def create_quantum_model():
data_qubits = cirq.GridQubit.rect(1, 1)
circuit = cirq.Circuit()
a = sympy.Symbol("a")
circuit.append(cirq.rx(a).on(data_qubits[0])),
return circuit, cirq.Z(data_qubits[0])
model_circuit, model_readout = create_quantum_model()
# Build the Keras model.
model = tf.keras.Sequential([
# The input is the data-circuit, encoded as a tf.string
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# used for logging progress during optimization
def hinge_accuracy(y_true, y_pred):
y_true = tf.squeeze(y_true) > 0.0
y_pred = tf.squeeze(y_pred) > 0.0
result = tf.cast(y_true == y_pred, tf.float32)
return tf.reduce_mean(result)
# compile the model with Hinge loss and Adam, as done in the example. Have tried with various learning_rates
model.compile(
loss = tf.keras.losses.Hinge(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
metrics=[hinge_accuracy])
EPOCHS = 20
BATCH_SIZE = 32
NUM_EXAMPLES = 1000
# fit the model
qnn_history = model.fit(
x_train_tfcirc, y_train,
batch_size=32,
epochs=EPOCHS,
verbose=1,
validation_data=(x_test_tfcirc, y_test),
use_multiprocessing=False)
results = model.predict(x_test_tfcirc)
results_mapped = [-1 if x<=0 else 1 for x in results[:,0]]
print(np.sum(np.equal(results_mapped, y_test)))
After 20 epochs of optimization, I get the following:
1000/1000 [==============================] - 0s 410us/sample - loss: 0.5589 - hinge_accuracy: 0.6982 - val_loss: 0.5530 - val_hinge_accuracy: 0.7070
This results in 700 samples out of 1000 predicted correctly. When looking at the mapped results, this is because all results are predicted as -1. When looking at the raw results, they linearly increase from -0.5484014 to -0.99996257.
When retrieving the weight with w = model.layers[0].get_weights(), subtracting 0.8, and setting it again with model.layers[0].set_weights(w), I get 920/1000 correct. Fine-tuning this process allows me to achieve 1000/1000.
Update 1:
I have also printed the update of the weight over the various epochs:
4.916246, 4.242602, 3.3765688, 2.6855211, 2.3405066, 2.206207, 2.1734586, 2.1656137, 2.1510274, 2.1634471, 2.1683235, 2.188944, 2.1510284, 2.1591303, 2.1632445, 2.1542525, 2.1677444, 2.1702878, 2.163104, 2.1635907
I set the weight to 1.36, a value which gives 908/1000 (as opposed to 700/100). The optimizer moves away from it:
1.7992111, 2.0727847, 2.1370323, 2.15711, 2.1686404, 2.1603785, 2.183334, 2.1563332, 2.156857, 2.169908, 2.1658351, 2.170673, 2.1575692, 2.1505954, 2.1561477, 2.1754034, 2.1545155, 2.1635509, 2.1464484, 2.1707492
One thing that I noticed is that the value for the hinge accuracy was 0.75 with the weight 1.36, which is higher than the 0.7 for 2.17. If this is the case, I am either in an unlucky part of the optimization landscape where the global minimum does not correspond to the minimum of the loss landscape, or the loss value is determined incorrectly. This is what I will be investigating next.
The minima of the Hinge loss function for this examples does not correspond with the maxima of number of correctly classified examples. Please see plot of these w.r.t. the value of the parameter. Given that the optimizer works towards the minima of the loss, not the maxima of the number of classified examples, the code (and framework/optimizer) do what they are supposed to do. Alternatively, one could use a different loss function to try to find a better fit. For example binarized l1 loss. This function would have the same global optimum, but would likely have a very flat landscape.

Applying Keras Model to give an output value for each row

I am learning keras and would like to understand how I can apply a classifier (sequential) to all rows in my data set and not just the x% left for test validation.
The confusion I am having is, when I define my data split, I will have a portion for train and test. How would I apply model to full data set to show me the predicted values for each row? The end goal I have is to produce an concatenate the predicted value for every customer in the data set.
dataset = pd.read_csv('BankCustomers.csv')
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
feature_train, feature_test, label_train, label_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
sc = StandardScaler()
feature_train = sc.fit_transform(feature_train)
feature_test = sc.transform(feature_test)
For completeness the classifier looks like below.
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(activation="relu", input_dim=11, units=6, kernel_initializer="uniform"))
# Adding the second hidden layer
classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform"))
# Adding the output layer
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(feature_train, label_train, batch_size = 10, nb_epoch = 100)
The course I am doing will suggest ways to get accuracy and predictions for the test set like below, but not the full batch.
# Predicting the Test set results
label_pred = classifier.predict(feature_test)
label_pred = (label_pred > 0.5) # FALSE/TRUE depending on above or below 50%
cm = confusion_matrix(label_test, label_pred)
accuracy=accuracy_score(label_test,label_pred)
I tried concatenating the model applied to both training and test data, but i then was unsure how to ascertain which index matched up with the original data set (i.e. I don't know which of the 20% test data is relative to the original set).
I apologise in advance if this question is superfluous, I have been looking for answers on stack and via the course but so far no luck.
You can utilize pandas indexes to sort your results back to the original order.
Predict on each feature_train and feature_test (not sure why you'd want to predict on feature_train though.)
Add a new column to each feature_train and feature_test, which would contain the predictions
feature_train["predictions"] = pd.Series(classifier.predict(feature_train))
feature_test["predictions"] = pd.Series(classifier.predict(feature_test))
If you look at the indexes of each data frame above, you can see they're shuffled (because of the train_test_split).
You can now concatenate them, use sort_index, and retrieve the predictions column, which would have the predictions according to the order that appeared in your initial dataframe (X)
pd.concat([feature_train, feature_test], axis=0).sort_index()