Here is my code for distributed training via spark-tensorflow-distributor that uses tensorflow MultiWorkerMirroredStrategy to train using multiple servers
https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py
import sys
from spark_tensorflow_distributor import MirroredStrategyRunner
import mlflow.keras
mlflow.keras.autolog()
mlflow.log_param("learning_rate", 0.001)
import tensorflow as tf
import time
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
def train():
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
#tf.distribute.experimental.CollectiveCommunication.NCCL
model = None
with strategy.scope():
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
N, D = X_train.shape # number of observation and variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = tf.keras.models.Sequential([
tf.keras.layers.Input(shape=(D,)),
tf.keras.layers.Dense(1, activation='sigmoid') # use sigmoid function for every epochs
])
model.compile(optimizer='adam', # use adaptive momentum
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the Model
r = model.fit(X_train, y_train, validation_data=(X_test, y_test))
mlflow.keras.log_model(model, "mymodel")
MirroredStrategyRunner(num_slots=4, use_custom_strategy=True).run(train)
I realize that saving via mlflow.keras.log_model produces 4 models in databrick experiments,
each of the 4 models is not a good predictor
if I change num_slots from 4 to 1, there is only 1 model saved in databrick experiment and the model is a good predictor during inference
My question is
Do I need an extra step to merge the 4 models together to create 1 model that can predict as good as num_slot = 1? Or am I doing something wrong? I was expecting only the chief node saving models
So, you do not want to call log_model in all 4 of the Tensorflow workers. You want to log it in 1 of them. I believe you would use https://www.tensorflow.org/api_docs/python/tf/distribute/get_replica_context to figure out which worker you are, and perhaps only log if you are worker 0. That's what I do when using Horovod for a similar purpose.
You do not merge the models; they are the same model in all 4 replicas. That's the point of what this is doing.
If the model is 'worse' than with 1 replica, I would suspect other subtler issues are at play. For example, with 4 workers, your batch size has changed unless you compensate for that. See https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#train_the_model for a discussion.
Related
I am trying to solve an exercise from a machine learning book, where a classifier should be trained on the cifar10 dataset using tensorflow and keras. I have attached a code example. The code is running in a Jupyter notebook inside PyCharm.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import time
import os
import tensorflow as tf
from tensorflow import keras
tf.random.set_seed(42)
np.random.seed(42)
def build_model():
# Build a model as instructed
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for i in range(20):
model.add(keras.layers.Dense(100,
activation=keras.activations.elu,
kernel_initializer=tf.keras.initializers.HeNormal))
model.add(keras.layers.Dense(10, activation=keras.activations.softmax))
return model
model = build_model()
# Load the CIFAR10 image dataset
cifar10 = keras.datasets.cifar10.load_data()
X_train = cifar10[0][0] / 255.
X_test = cifar10[1][0] / 255.
y_train = cifar10[0][1]
y_test = cifar10[1][1]
print(X_train.max())
from sklearn.model_selection import train_test_split
# Split the data
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, shuffle= True)
model = build_model()
root_logdir = os.path.join(os.curdir, "my_logs_11-8")
def get_run_logdir():
run_id = time.strftime("run_%Y_%m_%d_%H_%M_%S")
return os.path.join(root_logdir, run_id)
run_logdir = get_run_logdir()
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
model.compile(optimizer=keras.optimizers.Nadam(lr=5e-5), loss="sparse_categorical_crossentropy")
history = model.fit(X_train, y_train, epochs=100,
validation_data=(X_valid, y_valid),
callbacks=[tensorboard_cb],
batch_size=32)
I am already running into problems here:
I do not get past a few epochs (10-20), because with each epoch, the python process takes up more and more memory. My machine has 16 GB, of which 12 GB are usually free. When memory is full, my IDE (PyCharm) crashes and the python process is killed. Is that due to a memory leak? What can I do to fix it?
For each epoch, keras measures an estimated time of how long that epoch took. However, the time measured is much smaller than walltime (24s/epoch vs. ~60s/epoch). The book I am following seems to reach much faster training on a much weaker machine. How can this be?
Im turning around since a years with this problem, I want to forcast t+1 using the forcast t+0 as one of my input.
All I find is running my model one step at time and manualy insert my last forcast in the input for the next one step run... not efficient and impossible to train.
I use keras with tensorflow. Thank for any help!
I suggest u ChainRegressor/Classifier from sklearn. as u specify this model iterate fit in each step using the previous predictions as features for the new fit. here an example in a regression task
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from sklearn.multioutput import RegressorChain
n_sample = 1000
input_size = 20
X = np.random.uniform(0,1, (n_sample,input_size))
y = np.random.uniform(0,1, (n_sample,3)) <=== 3 step forecast
def create_model():
global input_size
model = Sequential([
Dense(32, activation='relu', input_shape=(input_size,)),
Dense(1)
])
model.compile(optimizer='Adam', loss='mse')
input_size += 1 # <== important
# increase the input dimension and include the previous predictions in each iteration
return model
model = tf.keras.wrappers.scikit_learn.KerasRegressor(build_fn=create_model, epochs=1,
batch_size=256, verbose = 1)
chain = RegressorChain(model, order='random', random_state=42)
chain.fit(X, y)
chain.predict(X).shape
Im running a rather simple Keras classification, based on 30 features. What I dont get to understand yet is why the loss function gets way more volatile if I increase the number of rows going into the model:
df = pd.read_csv("cancer_classification.csv")
df = df.iloc[:50]
# split data
X = df.drop("benign_0__mal_1", axis=1).values
y = df["benign_0__mal_1"].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=101)
# scale
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
print(X_train.shape)
# ---> (50, 30)
model = Sequential()
model.add(Dense(30, activation="relu"))
model.add(Dense(15, activation="relu"))
model.add(Dense(5, activation="relu"))
# binary classification - so last layer has sigmoid activation function
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam")
# we will overfit to show how it looks like - so 1000 epochs
model.fit(X_train, y_train, epochs=1000, validation_data=(X_test, y_test))
# plotting it out - we leave out first 10 rows so we dont skew chart too much with high loss number on the beginning
loss_df = pd.DataFrame(model.history.history)
loss_df = loss_df.iloc[10:]
loss_df.plot()
plt.show()
The original idea was to visualize overfitting when loss keeps dropping while val_loss starts to rise. I wonder why putting 500 rows creates such wild oscilations in the loss functions.
50 df rows
500 df rows
Try to shuffle your data. And check the number class distribution for the first 50 rows and for the 500 rows I think this might be a consequence of not shuffling. It can also be caused by the features you are getting into your model.
I'm using Google colab TPU to train a simple Keras model. Removing the distributed strategy and running the same program on the CPU is much faster than TPU. How is that possible?
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)
print('\nTime: ', timeit.default_timer() - start)
Thank you for your question.
I think what's happening here is a matter of overhead -- since the TPU runs on a separate VM (accessible at grpc://$COLAB_TPU_ADDR), each call to run a model on the TPU incurs some amount of overhead as the client (the Colab notebook in this case) sends a graph to the TPU, which is then compiled and run. This overhead is small compared to the time it takes to run e.g. ResNet50 for one epoch, but large compared to run a simple model like the one in your example.
For best results on TPU we recommend using tf.data.Dataset. I updated your example for TensorFlow 2.2:
%tensorflow_version 2.x
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(train_dataset, epochs=20, validation_data=val_dataset)
print('\nTime: ', timeit.default_timer() - start)
This takes about 30 seconds to run, compared to ~1.3 seconds to run on CPU. We can substantially reduce the overhead here by repeating the dataset and running one long epoch rather than several small ones. I replaced the dataset setup with this:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat(20).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
And replaced the fit call with this:
model.fit(train_dataset, validation_data=val_dataset)
This brings the runtime down to about 6 seconds for me. This is still slower than CPU, but that's not surprising for such a small model that can easily be run locally. In general, you'll see more benefit from using TPUs with larger models. I recommend looking through TensorFlow's official TPU guide, which presents a larger image classification model for the MNIST dataset.
This is probably due to the batch size you are using. In comparison to CPU and GPU, the training speed of a TPU is highly dependent on the batch size. Check the following site for more information:
https://cloud.google.com/tpu/docs/performance-guide
The Cloud TPU hardware is different from CPUs and GPUs. At a high
level, CPUs can be characterized as having a low number of high
performing threads. GPUs can be characterized as having a very high
number of low performing threads. A Cloud TPU, with its 128 x 128
matrix unit, can be thought of as either a single, very powerful
thread, which can perform 16K ops per cycle, or 128 x 128 tiny, simple
threads that are connected in pipeline fashion. Correspondingly, when
addressing memory, multiples of 8 (floats) are desirable, as well as
multiples of 128 for operations targeting the matrix unit.
This means that the batch size should be a multiple of 128, depending on the number of TPUs. Google Colab provides 8 TPUs to you, so in the best case you should select a batch size of 128 * 8 = 1024.
I am trying to implement a Keras Regression model on a dataset for my learning purpose. I have taken the data from the Kaggle Loan Default Prediction Challenge and I am trying to predict whether a person will default on a loan or not
The target column seems to be imbalanced and majority of the observations seems to have "0" as their value. I have tried the following approaches to overcome this data imbalance (a) Downsampled the Majority class (b) Upsample the Minority class (c) use the SMOTE algorithm. But these approaches do not seem to help the cause and prediction from the model is biased only towards "0" since majority of the classes in the dataset is "0". I have used the resample method from sklearn for performing the downsampling and upsampling.
What different approaches can I try to overcome this problem and achieve a good accuracy with my model on this data and get a realistic prediction from the model. I am sharing my code
from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import L1L2
import pandas
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.metrics import roc_auc_score
import statsmodels.api as sm
from sklearn import preprocessing as pre
train = pandas.read_csv('/train_v2.csv/train_v2.csv')
# Defining the target column
train_loss = train.loss
# Defining the features for the model
train = train[['f527','f528','f271']]
# Defining the imputer function
imp = Imputer()
# Fitting the imputation function to the training dataset
imp.fit(train)
train = imp.transform(train)
train=pre.StandardScaler().fit_transform(train)
# Splitting the data into Training and Testing samples
X_train,X_test,y_train,y_test = train_test_split( train,
train_loss,test_size=0.3, random_state=42)
# logistic regression with L1 and L2 regularization
reg = L1L2(l1=0.01, l2=0.01)
model = Sequential()
model.add(Dense(13,kernel_initializer='normal', activation='relu',
W_regularizer=reg, input_dim=X_train.shape[1]))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, nb_epoch=10, validation_data=(X_test, y_test))