I tried to log my keras model with mlflow.tensorflow.log_model but got NoCredential error.
Is there anyway to solve it?
My tracking server is host in kubernetes cluster. The parameters and metrics are logged successfully besides artifacts. I want to log and register my model with tracking API. At first the log_model API returns traceback that "no module named boto3". then I installed boto3 with pip. now it returns new traceback.
I host tracking server on kubernetes cluster not AWS. Why mlflow.tensorflow.log_model will use boto3? Is there anyway to change it?
tracking_url = "https://......"
mlflow.set_tracking_uri(tracking_url)
mlflow.set_experiment('test_mlflow')
def create_classifier():
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 12))
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(tf.keras.layers.Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])
return classifier
classifier = create_classifier()
history = classifier.fit(X_train, y_train, batch_size = batch_size, epochs = epochs,verbose = 1)
test_score, test_acc = classifier.evaluate(X_test, y_test,
batch_size=batch_size)
tf.keras.models.save_model(classifier, model_save_path)
run_name = "sample-ann-run3"
with mlflow.start_run(run_name=run_name):
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("epochs", epochs)
mlflow.log_metric("loss", test_score)
mlflow.log_metric("accuracy", test_acc)
mlflow.tensorflow.log_model(model=classifier, registered_model_name="sample-ann-1", artifact_path=model_save_path)
the mlflow.tensorflow.log_model returned following traceback:
Solved. I found the client side needs to adding AWS access information as environment variables on the python file that run machine learning code and register model with mlflow.
os.environ['AWS_ACCESS_KEY_ID'] = "<access_id>"
os.environ['AWS_SECRET_ACCESS_KEY'] = "<access_secret>"
os.environ["MLFLOW_S3_ENDPOINT_URL"] =
Related
I created a sample keras model, and use mlflow.tensorflow.autolog() to track my model. However, the logged runs are not appeared in mlflow ui.
mlflow.tensorflow.autolog()
#setting hyper parameters
batch_size = 10
epochs = 100
optimizer = 'adam'
loss = 'binary_crossentropy'
def create_classifier():
classifier = tf.keras.Sequential()
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 12))
classifier.add(tf.keras.layers.Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(tf.keras.layers.Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])
return classifier
classifier = create_classifier()
classifier.summary()
classifier.fit(X_train.to_numpy(), y_train.to_numpy(), batch_size = batch_size, epochs = epochs,verbose = 1)
score, acc = classifier.evaluate(X_train.to_numpy(), y_train.to_numpy(), batch_size=batch_size)
y_pred = classifier.predict(X_test.to_numpy())
y_pred = (y_pred > 0.5)
print('*'*20)
score, acc = classifier.evaluate(X_test.to_numpy(), y_test.to_numpy(),
batch_size=batch_size)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
I got following warning when I call autolog
The output when I fit the model with training data:
What do I need to do in order to make the run available in mlflow ui?
Note: MlFlow doc states that it only compatible with tensorflow >=2.3 and my tensorflow is 2.10.1.
update:
I found there are 2 /mlruns directories in my project, one located where my python code located as you can see in the screen clip, "ml-flow-project-example2/model/mlruns", and another one located at "ml-flow-project-example2/venv/Scripts/mlruns". However, the /mlruns in venv is empty. But it is where the mlflow.exe located. If I move the directory "7c10db034cdd47dfbba12885da25ff0f" from /model/mlruns to venv/Scripts/mlruns, the run will appear in mlflow ui.
Is there anyway to let mlflow ui point to the correct mlruns directory?
A couple days ago I trained a Resnet in colab and evaluated it with the following code:
model.compile(loss= "sparse_categorical_crossentropy",
optimizer = optimizer,
metrics = ["accuracy"]
)
checkpoint_cb = keras.callbacks.ModelCheckpoint(
filepath= checkpoint_path,
save_weights_only = False,
monitor= 'val_pred_loss',
save_best_only= True
)
tensorboard_cb = keras.callbacks.TensorBoard(tensorboard_path)
earlystopping_cb = keras.callbacks.EarlyStopping(patience = 6, monitor="val_pred_loss", min_delta = 0.005)
history = model.fit(
x = train_set,
validation_data = val_set,
validation_steps = 1629//val_b_size,
epochs = epochs,
steps_per_epoch = steps_per_epoch,
callbacks = [checkpoint_cb, tensorboard_cb, earlystopping_cb]
)
best_model = tf.keras.models.load_model(checkpoint_path)
test_set = test_set.batch(17)
print(best_model.evaluate(test_set))
The output was [0.42691513895988464, 0.8850889205932617]
The model does not have any custom components, it's a simple resnet with new GAP and dense layers for classification, upon rerunning the last 3 lines today I consistently got a nonsensical accuracy [0.42691513895988464, 0.004352692514657974]. I initially thought that I changed something in the script by mistake or messed up the file save and load, but the CE loss is the same. How is this possible?
Edit: the issue involves any loaded model, evaluating a trained net directly from RAM works as expected
Here's how the model is defined:
base_model = keras.applications.ResNet50( include_top = False,
weights = "imagenet",
input_shape = (448, 448, 3),
)
avg = keras.layers.GlobalAveragePooling2D()(base_model.output) # 14 x 14 x 2048 ->2048
o = keras.layers.Dense(196, activation = "softmax")(avg)
model = keras.Model(inputs=base_model.input, outputs=[o])
Update: replacing the load_model statement with model.load_weights resolves the issue, I'd still like to know the reason.
I am building a model to classify sequence class. firstly i build the model use keras API. As we know the keras API packed the tensorflow function, but when i convert the keras code to tensorflow API, i found the result of two framwork is different. Below is the key code.
tensorflow code
x = tf.placeholder(tf.int32, shape=[None, time_steps], name='x_input')
y = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_label')
定义网络结构
def rnn_model(x):
x = tf.one_hot(x,api_vob_size)
rnn_cell_fw = tf.nn.rnn_cell.BasicLSTMCell(rnn_size)
rnn_cell_bw = tf.nn.rnn_cell.BasicLSTMCell(rnn_size)
# 将输入送入rnn,得到输出与中间状态,输出shape为[batch_size, time_steps, rnn_size]
outputs, states = tf.nn.bidirectional_dynamic_rnn(rnn_cell_fw,rnn_cell_bw, x, dtype=tf.float32)
# 获取最后一个时刻的输出,输出shape为[batch_size, rnn_size]
outputs1 = tf.concat(outputs, 2)
output = tf.transpose(outputs1, [1, 0, 2])[-1]
# 全连接层,最终输出大小为[batch_size, num_classes]
fc_w = tf.Variable(tf.random_normal([2*rnn_size, num_classes]))
fc_b = tf.Variable(tf.random_normal([num_classes]))
return tf.matmul(output, fc_w) + fc_b `
# 构建网络
logits= rnn_model(x)
prediction = tf.nn.softmax(logits)
# 定义损失函数与优化器
loss_op = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits, name='cross_entropy'))
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
train_op = optimizer.minimize(loss_op,name='optimizer_min')
#keras API
model = Sequential()
model.add(Bidirectional(LSTM(units=150), merge_mode='concat'))
model.add(Dense(9, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=64)
so why two code block has different result. thank you for answer !!!!
I'm trying to train an LSTMs network on Google Colab. However, this error occurs:
AlreadyExistsError: Resource __per_step_116/training_4/Adam/gradients/bidirectional_4/while/ReadVariableOp/Enter_grad/ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var/N10tensorflow19TemporaryVariableOp6TmpVarE
[[{{node training_4/Adam/gradients/bidirectional_4/while/ReadVariableOp/Enter_grad/ArithmeticOptimizer/AddOpsRewrite_Add/tmp_var}}]]
I don't know where can be the issue. This is the model of the network:
sl_model = keras.models.Sequential()
sl_model.add(keras.layers.Embedding(max_index+1, hidden_size, mask_zero=True))
sl_model.add(keras.layers.Bidirectional(keras.layers.LSTM(hidden_size,
activation='tanh', dropout=0.2, recurrent_dropout = 0.2, return_sequences=True)))
sl_model.add(keras.layers.Bidirectional(keras.layers.LSTM(hidden_size, activation='tanh', dropout=0.2, recurrent_dropout = 0.2, return_sequences=False))
)
sl_model.add(keras.layers.Dense(max_length, activation='softsign'))
optimizer = keras.optimizers.Adam()
sl_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc'])
batch_size = 128
epochs = 3
cbk = keras.callbacks.TensorBoard("logging/keras_model")
print("\nStarting training...")
sl_model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,
shuffle=True, validation_data=(x_dev, y_dev), callbacks=[cbk])
Thank you so much!
You need to restart your runtime -- this happens when you have defined multiple graphs built in a single jupyter (Colaboratory) runtime.
Calling tf.reset_default_graph() may also help, but depending on whether you are using eager exection and how you've defined your sessions this may or may not work.
I have this sample code and it can only runs with n_jobs=1.
Tensorflow backend is running on a GPU.
When I run with n_jobs=-1 on method cross_val_score, the program jams/stops working or give any output, after output 4 lines Epoch 1/100 (as I have a 4 core CPU I assume it will use all 4 cores to do CV and each trys to start a tf session on GPU)
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dropout(0.3))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
# classifier.add(Dropout(0.3))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 100, epochs = 100, verbose=0)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = 1)
I have also tried to limit the TF GPU usage in this way but n_job=-1 still won't work.
np.random.seed(123)
tf.set_random_seed(123)
config = tf.ConfigProto(inter_op_parallelism_threads=1)
config.gpu_options.per_process_gpu_memory_fraction = 0.1 # in my case this setting will use around 1G memory on GPU
set_session(tf.Session(config=config))
I have the same issue, I use the below lines of code
Configure GPU to use all the memory
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.per_process_gpu_memory_fraction = 1.0
set_session(tf.Session(config=config))
def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
mean = accuracies.mean()
variance = accuracies.std()
then I removed the n_jobs = -1 then I tried to run it again and check the GPU utilization using GPU-Z below is a photo from the run.
Maybe your question is you don't feel the performance enhancement using GPU. To answer this question I run the same code with CPU and GPU.
GPU at least in my average experiment 3:1 CPU. I believe it should take less than the time but this is the max performance achieved.
You can also found some good discussions Run Keras with GPU