mlflow log model failed - tensorflow

I tried to log my keras model with mlflow.tensorflow.log_model but got NoCredential error.
Is there anyway to solve it?
My tracking server is host in kubernetes cluster. The parameters and metrics are logged successfully besides artifacts. I want to log and register my model with tracking API. At first the log_model API returns traceback that "no module named boto3". then I installed boto3 with pip. now it returns new traceback.
I host tracking server on kubernetes cluster not AWS. Why mlflow.tensorflow.log_model will use boto3? Is there anyway to change it?
tracking_url = "https://......"
mlflow.set_tracking_uri(tracking_url)
mlflow.set_experiment('test_mlflow')
def create_classifier():
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 12))
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(tf.keras.layers.Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])
return classifier
classifier = create_classifier()
history = classifier.fit(X_train, y_train, batch_size = batch_size, epochs = epochs,verbose = 1)
test_score, test_acc = classifier.evaluate(X_test, y_test,
batch_size=batch_size)
tf.keras.models.save_model(classifier, model_save_path)
run_name = "sample-ann-run3"
with mlflow.start_run(run_name=run_name):
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("epochs", epochs)
mlflow.log_metric("loss", test_score)
mlflow.log_metric("accuracy", test_acc)
mlflow.tensorflow.log_model(model=classifier, registered_model_name="sample-ann-1", artifact_path=model_save_path)
the mlflow.tensorflow.log_model returned following traceback:

Solved. I found the client side needs to adding AWS access information as environment variables on the python file that run machine learning code and register model with mlflow.
os.environ['AWS_ACCESS_KEY_ID'] = "<access_id>"
os.environ['AWS_SECRET_ACCESS_KEY'] = "<access_secret>"
os.environ["MLFLOW_S3_ENDPOINT_URL"] =

Related

mlflow ui doesn't show logged runs of my keras model

I created a sample keras model, and use mlflow.tensorflow.autolog() to track my model. However, the logged runs are not appeared in mlflow ui.
mlflow.tensorflow.autolog()
#setting hyper parameters
batch_size = 10
epochs = 100
optimizer = 'adam'
loss = 'binary_crossentropy'
def create_classifier():
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 12))
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(tf.keras.layers.Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])
return classifier
classifier = create_classifier()
classifier.summary()
classifier.fit(X_train.to_numpy(), y_train.to_numpy(), batch_size = batch_size, epochs = epochs,verbose = 1)
score, acc = classifier.evaluate(X_train.to_numpy(), y_train.to_numpy(), batch_size=batch_size)
y_pred = classifier.predict(X_test.to_numpy())
y_pred = (y_pred > 0.5)
print('*'*20)
score, acc = classifier.evaluate(X_test.to_numpy(), y_test.to_numpy(),
batch_size=batch_size)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
I got following warning when I call autolog
The output when I fit the model with training data:
What do I need to do in order to make the run available in mlflow ui?
Note: MlFlow doc states that it only compatible with tensorflow >=2.3 and my tensorflow is 2.10.1.
update:
I found there are 2 /mlruns directories in my project, one located where my python code located as you can see in the screen clip, "ml-flow-project-example2/model/mlruns", and another one located at "ml-flow-project-example2/venv/Scripts/mlruns". However, the /mlruns in venv is empty. But it is where the mlflow.exe located. If I move the directory "7c10db034cdd47dfbba12885da25ff0f" from /model/mlruns to venv/Scripts/mlruns, the run will appear in mlflow ui.
Is there anyway to let mlflow ui point to the correct mlruns directory?

Keras evaluation suddenly starts giving unreasonably low accuracy while loss stays the same

A couple days ago I trained a Resnet in colab and evaluated it with the following code:
model.compile(loss= "sparse_categorical_crossentropy",
optimizer = optimizer,
metrics = ["accuracy"]
)
checkpoint_cb = keras.callbacks.ModelCheckpoint(
filepath= checkpoint_path,
save_weights_only = False,
monitor= 'val_pred_loss',
save_best_only= True
)
tensorboard_cb = keras.callbacks.TensorBoard(tensorboard_path)
earlystopping_cb = keras.callbacks.EarlyStopping(patience = 6, monitor="val_pred_loss", min_delta = 0.005)
history = model.fit(
x = train_set,
validation_data = val_set,
validation_steps = 1629//val_b_size,
epochs = epochs,
steps_per_epoch = steps_per_epoch,
callbacks = [checkpoint_cb, tensorboard_cb, earlystopping_cb]
)
best_model = tf.keras.models.load_model(checkpoint_path)
test_set = test_set.batch(17)
print(best_model.evaluate(test_set))
The output was [0.42691513895988464, 0.8850889205932617]
The model does not have any custom components, it's a simple resnet with new GAP and dense layers for classification, upon rerunning the last 3 lines today I consistently got a nonsensical accuracy [0.42691513895988464, 0.004352692514657974]. I initially thought that I changed something in the script by mistake or messed up the file save and load, but the CE loss is the same. How is this possible?
Edit: the issue involves any loaded model, evaluating a trained net directly from RAM works as expected
Here's how the model is defined:
base_model = keras.applications.ResNet50( include_top = False,
weights = "imagenet",
input_shape = (448, 448, 3),
)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output) # 14 x 14 x 2048 ->2048
o = keras.layers.Dense(196, activation = "softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=[o])
Update: replacing the load_model statement with model.load_weights resolves the issue, I'd still like to know the reason.

what is different between two code block seperately build on tensorflow API and keras API? my compute result has large gap

I am building a model to classify sequence class. firstly i build the model use keras API. As we know the keras API packed the tensorflow function, but when i convert the keras code to tensorflow API, i found the result of two framwork is different. Below is the key code.
tensorflow code
x = tf.placeholder(tf.int32, shape=[None, time_steps], name='x_input')
y = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_label')
定义网络结构
def rnn_model(x):
x = tf.one_hot(x,api_vob_size)
rnn_cell_fw = tf.nn.rnn_cell.BasicLSTMCell(rnn_size)
rnn_cell_bw = tf.nn.rnn_cell.BasicLSTMCell(rnn_size)
# 将输入送入rnn,得到输出与中间状态,输出shape为[batch_size, time_steps, rnn_size]
outputs, states = tf.nn.bidirectional_dynamic_rnn(rnn_cell_fw,rnn_cell_bw, x, dtype=tf.float32)
# 获取最后一个时刻的输出,输出shape为[batch_size, rnn_size]
outputs1 = tf.concat(outputs, 2)
output = tf.transpose(outputs1, [1, 0, 2])[-1]
# 全连接层,最终输出大小为[batch_size, num_classes]
fc_w = tf.Variable(tf.random_normal([2*rnn_size, num_classes]))
fc_b = tf.Variable(tf.random_normal([num_classes]))
return tf.matmul(output, fc_w) + fc_b `
# 构建网络
logits= rnn_model(x)
prediction = tf.nn.softmax(logits)
# 定义损失函数与优化器
loss_op = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits, name='cross_entropy'))
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
train_op = optimizer.minimize(loss_op,name='optimizer_min')
#keras API
model = Sequential()
model.add(Bidirectional(LSTM(units=150), merge_mode='concat'))
model.add(Dense(9, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=64)
so why two code block has different result. thank you for answer !!!!

AlreadyExistsError while training a network on colab

I'm trying to train an LSTMs network on Google Colab. However, this error occurs:
AlreadyExistsError: Resource __per_step_116/training_4/Adam/gradients/bidirectional_4/while/ReadVariableOp/Enter_grad/ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
[[{{node training_4/Adam/gradients/bidirectional_4/while/ReadVariableOp/Enter_grad/ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var}}]]
I don't know where can be the issue. This is the model of the network:
sl_model = keras.models.Sequential()
sl_model.add(keras.layers.Embedding(max_index+1, hidden_size, mask_zero=True))
sl_model.add(keras.layers.Bidirectional(keras.layers.LSTM(hidden_size,
activation='tanh', dropout=0.2, recurrent_dropout = 0.2, return_sequences=True)))
sl_model.add(keras.layers.Bidirectional(keras.layers.LSTM(hidden_size, activation='tanh', dropout=0.2, recurrent_dropout = 0.2, return_sequences=False))
)
sl_model.add(keras.layers.Dense(max_length, activation='softsign'))
optimizer = keras.optimizers.Adam()
sl_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc'])
batch_size = 128
epochs = 3
cbk = keras.callbacks.TensorBoard("logging/keras_model")
print("\nStarting training...")
sl_model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
shuffle=True, validation_data=(x_dev, y_dev), callbacks=[cbk])
Thank you so much!
You need to restart your runtime -- this happens when you have defined multiple graphs built in a single jupyter (Colaboratory) runtime.
Calling tf.reset_default_graph() may also help, but depending on whether you are using eager exection and how you've defined your sessions this may or may not work.

Can KerasClassifier wtih TF model works with sklearn.cross_val_score when setting n_job=-1 and TF runs on a single GPU?

I have this sample code and it can only runs with n_jobs=1.
Tensorflow backend is running on a GPU.
When I run with n_jobs=-1 on method cross_val_score, the program jams/stops working or give any output, after output 4 lines Epoch 1/100 (as I have a 4 core CPU I assume it will use all 4 cores to do CV and each trys to start a tf session on GPU)
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dropout(0.3))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
# classifier.add(Dropout(0.3))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 100, epochs = 100, verbose=0)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = 1)
I have also tried to limit the TF GPU usage in this way but n_job=-1 still won't work.
np.random.seed(123)
tf.set_random_seed(123)
config = tf.ConfigProto(inter_op_parallelism_threads=1)
config.gpu_options.per_process_gpu_memory_fraction = 0.1 # in my case this setting will use around 1G memory on GPU
set_session(tf.Session(config=config))
I have the same issue, I use the below lines of code
Configure GPU to use all the memory
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.per_process_gpu_memory_fraction = 1.0
set_session(tf.Session(config=config))
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
mean = accuracies.mean()
variance = accuracies.std()
then I removed the n_jobs = -1 then I tried to run it again and check the GPU utilization using GPU-Z below is a photo from the run.
Maybe your question is you don't feel the performance enhancement using GPU. To answer this question I run the same code with CPU and GPU.
GPU at least in my average experiment 3:1 CPU. I believe it should take less than the time but this is the max performance achieved.
You can also found some good discussions Run Keras with GPU