Different results when using Manual KFold-Cross validation vs. KerasClassifier-KFold Cross Validation - tensorflow

I've been struggling to understand why two similar Kfold-cross validations result in two different averages.
When I use a manual KFold approach (with Tensorflow and Keras)
cvscores = []
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=3)
for train, test in kfold.split(X, y):
model = create_baseline()
model.fit(X[train], y[train], epochs=50, batch_size=32, verbose=0)
scores = model.evaluate(X[test], y[test], verbose=0)
#print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
I get
65.89% (+/- 3.77%)
When I use the KerasClassifier wrapper from scikit
estimator = KerasClassifier(build_fn=create_baseline, epochs=50, batch_size=32, verbose=0)
kfold = StratifiedKFold(n_splits=10,shuffle=True, random_state=3)
results = cross_val_score(estimator, X, y, cv=kfold, scoring='accuracy')
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
I get
63.82% (5.37%)
Additionally, when using KerasClassifier the following warning appears
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/wrappers/scikit_learn.py:241: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`, if your model does multi-class classification (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`, if your model does binary classification (e.g. if it uses a `sigmoid` last-layer activation).
Do the results differ because KerasClassifier uses predict_classes() while the manual Tensorflow/Keras approach uses just predict()? If so, which approach is more reasonable?
My model looks like this
def create_baseline():
model = tf.keras.models.Sequential()
model.add(Dense(8, activation='relu', input_shape=(12,)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

The two CV-results do not look too different, they are both within each others standard deviation.
You fixed the seed for the StratifiedKFold class, that's good. However there is additional randomness you should take control of and that comes from the weight initialization. Make sure you initialize your model for each CV-run with different weights, but use the same 10 initializations for both cross-validations, manual and automatic. You can pass an initializer to each layer, they have a seed argument as well. In general you should fix all possible seeds (np.random.seed(3), tf.set_random_seed(3)).
What happens if you run cross_val_score() or your manual version twice? Do you get the same results / numbers?

Related

tf.keras Functional model gives different results on the same data

I have defined my Functional model like this:
base_model = VGG16(include_top=False, input_shape=(224,224,3), pooling='avg')
inputs = tf.keras.Input(shape=(224,224,3))
x = preprocess_input(inputs)
x = base_model(x, training=False)
x = tf.keras.layers.Dropout(0.2)(x, training=True)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
The problem is when I call .evaluate() or .predict() I get slightly different results everytime when using the exact same batch (with shuffle=False in my dataset, and all the random seeds initialized).
I tried reconstructing the model without some of the layers and I found the culprit to be these 2 layers constructed by the line x=preprocess_input(inputs), which give randomness to the results:
model summary
Note: preprocess_input is a vgg16 preprocessing function at tf.keras.applications.vgg16.preprocess_input.
However, if I redefine my Functional model as Sequential:
new_model = tf.keras.Sequential()
new_model.add(model.layers[0]) #input layer
new_model.add(tf.keras.layers.Lambda(preprocess_input))
new_model.add(model.layers[3]) #vgg16
new_model.add(model.layers[4]) #dropout
new_model.add(model.layers[5]) #dense
The problem is gone and I get consistent results from .evaluate() or .predict().
What could potentially cause the Functional model to behave like this?
EDIT
As xdurch0 pointed out, it was the dropout layer at fault for different results. The functional model applied dropout during .predict() and .evaluate() methods.

Shouldn't same neural network weights produce same results?

So I am working with different deep learning frameworks as part of my research and have observed something weird (at least I cannot explain the cause of it).
I trained a fairly simple MLP model (on mnist dataset) in Tensorflow, extracted trained weights, created the same model architecture in PyTorch and applied the trained weights to PyTorch model. Now my expectation is to get same test accuracy from both Tensorflow and PyTorch models but this isn't the case. I get different results.
So my question is: If a model is trained to some optimal value, shouldn't the trained weights produce same results every time testing is done on the same dataset (regardless of the framework used)?
PyTorch Model:
class Net(nn.Module):
def __init__(self) -> None:
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 24)
self.fc2 = nn.Linear(24, 10)
def forward(self, x: Tensor) -> Tensor:
x = torch.flatten(x, 1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
Tensorflow Model:
def build_model() -> tf.keras.Model:
# Build model layers
model = models.Sequential()
# Flatten Layer
model.add(layers.Flatten(input_shape=(28,28)))
# Fully connected layer
model.add(layers.Dense(24, activation='relu'))
model.add(layers.Dense(10))
# compile the model
model.compile(
optimizer='sgd',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
# return newly built model
return model
To extract weights from Tensorflow model and apply them to Pytorch model I use following functions:
Extract Weights:
def get_weights(model):
# fetch latest weights
weights = model.get_weights()
# transpose weights
t_weights = []
for w in weights:
t_weights.append(np.transpose(w))
# return
return t_weights
Apply Weights:
def set_weights(model, weights):
"""Set model weights from a list of NumPy ndarrays."""
state_dict = OrderedDict(
{k: torch.Tensor(v) for k, v in zip(model.state_dict().keys(), weights)}
)
self.load_state_dict(state_dict, strict=True)
Providing solution in answer section for the benefit of community. From comments
If you are using the same weights in the same manner then results
should be the same, though float rounding error should also be
accounted. Also it doesn't matter if model is trained at all. You can
think of your model architecture as a chain of matrix multiplications
with element-wise nonlinearities in between. How big is the
difference? Are you comparing model outputs, our metrics computed over
dataset? As a suggestion, intialize model with some random values in
Keras, do a forward pass for a single batch (paraphrased from jdehesa and Taras Sereda)

Tune a pre-existing model with Keras Tuner

I am looking at Keras Tuner as a way of doing hyperparameter optimization, but all of the examples I have seen show an entirely fresh model being defined. For example, from the Keras Tuner Hello World:
def build_model(hp):
model = keras.Sequential()
model.add(layers.Flatten(input_shape=(28, 28)))
for i in range(hp.Int('num_layers', 2, 20)):
model.add(layers.Dense(units=hp.Int('units_' + str(i), 32, 512, 32),
activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.compile(
optimizer=keras.optimizers.Adam(
hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
I already have a model that I would like to tune, but does that mean I have to rewrite it with the hyperparameters spliced in to the body, as above, or can I simply pass the hyperameters in to the model at the top? For example like this:
def build_model(hp):
model = MyExistingModel(
batch_size=hp['batch_size'],
seq_len=hp['seq_len'],
rnn_hidden_units=hp['hidden_units'],
rnn_type='gru',
num_rnn_layers=hp['num_rnn_layers']
)
optimizer = optimizer_factory['adam'](
learning_rate=hp['learning_rate'],
momentum=0.9,
)
model.compile(
optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
return model
The above seems to work, as far as I can see. The model initialization args are all passed to the internal TF layers, through a HyperParameters instance, and accessed from there... although I'm not really sure how to pass it in... I think it can be done by predefining a HyperParameters object and passing it in to the tuner, so it then gets passed in to build_model:
hp = HyperParameters()
hp.Choice('learning_rate', [1e-1, 1e-3])
tuner = RandomSearch(
build_model,
max_trials=5,
hyperparameters=hp,
tune_new_entries=False,
objective='val_accuracy')
Internally my model has two RNNs (LSTM or GRU) and an MLP. But I have yet to come across a Keras Tuner build_model that takes an existing model like this a simply passes in the hyperparameters. The model is quite complex, and I would like to avoid having to redefine it (as well as avoiding code duplication).
Indeed this is possible, as this GitHub issue makes clear...
However rather than passing the hp object through the hyperparameters arg to the Tuner, instead I override the Tuner run_trial method in the manner suggested here.

Why I'm getting bad result with Keras vs random forest or knn?

I'm learning deep learning with keras and trying to compare the results (accuracy) with machine learning algorithms (sklearn) (i.e random forest, k_neighbors)
It seems that with keras I'm getting the worst results.
I'm working on simple classification problem: iris dataset
My keras code looks:
samples = datasets.load_iris()
X = samples.data
y = samples.target
df = pd.DataFrame(data=X)
df.columns = samples.feature_names
df['Target'] = y
# prepare data
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
# hot encoding
encoder = LabelEncoder()
y1 = encoder.fit_transform(y)
y = pd.get_dummies(y1).values
# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# build model
model = Sequential()
model.add(Dense(1000, activation='tanh', input_shape = ((df.shape[1]-1),)))
model.add(Dense(500, activation='tanh'))
model.add(Dense(250, activation='tanh'))
model.add(Dense(125, activation='tanh'))
model.add(Dense(64, activation='tanh'))
model.add(Dense(32, activation='tanh'))
model.add(Dense(9, activation='tanh'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train)
score, acc = model.evaluate(X_test, y_test, verbose=0)
#results:
#score = 0.77
#acc = 0.711
I have tired to add layers and/or change number of units per layer and/or change the activation function (to relu) by it seems that the result are not higher than 0.85.
With sklearn random forest or k_neighbors I'm getting result (on same dataset) above 0.95.
What am I missing ?
With sklearn I did little effort and got good results, and with keras, I had a lot of upgrades but not as good as sklearn results. why is that ?
How can I get same results with keras ?
In short, you need:
ReLU activations
Simpler model
Data mormalization
More epochs
In detail:
The first issue here is that nowadays we never use activation='tanh' for the intermediate network layers. In such problems, we practically always use activation='relu'.
The second issue is that you have build quite a large Keras model, and it might very well be the case that with only 100 iris samples in your training set you have too few data to effectively train such a large model. Try reducing drastically both the number of layers and the number of nodes per layer. Start simpler.
Large neural networks really thrive when we have lots of data, but in cases of small datasets, like here, their expressiveness and flexibility may become a liability instead, compared with simpler algorithms, like RF or k-nn.
The third issue is that, in contrast to tree-based models, like Random Forests, neural networks generally require normalizing the data, which you don't do. Truth is that knn also requires normalized data, but in this special case, since all iris features are in the same scale, it does not affect the performance negatively.
Last but not least, you seem to run your Keras model for only one epoch (the default value if you don't specify anything in model.fit); this is somewhat equivalent to building a random forest with a single tree (which, BTW, is still much better than a single decision tree).
All in all, with the following changes in your code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
model = Sequential()
model.add(Dense(150, activation='relu', input_shape = ((df.shape[1]-1),)))
model.add(Dense(150, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.fit(X_train, y_train, epochs=100)
and everything else as is, we get:
score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 0.9333333373069763
We can do better: use slightly more training data and stratify them, i.e.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.20, # a few more samples for training
stratify=y)
And with the same model & training epochs you can get a perfect accuracy of 1.0 in the test set:
score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 1.0
(Details might differ due to some randomness imposed by default in such experiments).
Adding some dropout might help you improve accuracy. See Tensorflow's documentation for more information.
Essentially how you add a Dropout layer is just very similar to how you added those Dense() layers.
model.add(Dropout(0.2)
Note: The parameter '0.2 implies that 20% of the connections in the layer is randomly omitted to reduce the interdependencies between them, which reduces overfitting.

Why did the Keras Sequential model give a different result compared to Model model?

I've tried a simple lstm model in keras to do a simple sentiment analysis using imdb dataset using both Sequential model and Model model, and turns out the latter gives a worse result. Here's my code :
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
It gives a result around 0.6 of accuracy in the first epoch, while the other code that use Model :
_input = Input(shape=[max_review_length], dtype='int32')
embedded = Embedding(
input_dim=top_words,
output_dim=embedding_size,
input_length=max_review_length,
trainable=False,
mask_zero=False
)(_input)
lstm = LSTM(100, return_sequences=True)(embedded)
probabilities = Dense(2, activation='softmax')(lstm)
model = Model(_input, probabilities)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
and it gives 0.5 accuracy as a result of the first epoch and never change afterwards.
Any reason for that, or am i doing something wrong? Thanks in advance
I see two main differences between your two models :
You have set the embeddings of the second model as "trainable=False". So you have probably a lot fewer parameters to optimize the second model compared to the first one.
The LSTM is returning the whole sequence in the second model, so the outputs shape will be different, so I don't see how you can compare the two models, they are not doing the same thing.