Optimizing number of optimum features - optimization

I am training neural network using Keras. Every time I train my model, I use slightly different set of features selected using Tree-based feature selection via ExtraTreesClassifier(). After training every time, I compute the AUCROC on my validation set and then go back in a loop to train the model again with different set of feature. This process is very inefficient and I want to select the optimum number of features using some optimization technique available in some python library.
The function to be optimized is the auroc for cross validation which can only be calculated after training the model on selected features. The features are selected via following function ExtraTreesClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’) Here we see that the objective function is not directly dependent on the parameters to be optimized. The objective function which is auroc is related to the neural network training and the neural network takes features as input which are extracted on the basis of their important from ExtraTreesClassifier.
So in a way, the parameters for which I optimize auroc are n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’ or some other variables in ExtraTreesClassifier. These are not directly related to auroc.

You should combine GridSearchCV and Pipeline. Find more here
Use Pipeline when you need to run a set of instruction in sequence to get the optimal config.
For example, you have these steps to run:
1. Select KBest feature(s)
2. Use classifier DecisionTree or NaiveBayes
By combining GridSearchCV and Pipeline, you can select which features that best for a particular classifier, best config on the classifier, and so on, based on the scoring criteria.
Example:
#set your configuration options
param_grid = [{
'classify': [DecisionTreeClassifier()], #first option use DT
'kbest__k': range(1, 22), #range of n in SelectKBest(n)
#classifier's specific configs
'classify__criterion': ('gini', 'entropy'),
'classify__min_samples_split': range(2,10),
'classify__min_samples_leaf': range(1,10)
},
{
'classify': [GaussianNB()], #second option use NB
'kbest__k': range(1, 22), #range of n in SelectKBest(n)
}]
pipe = Pipeline(steps=[("kbest", SelectKBest()), ("classify", DecisionTreeClassifier())]) #I put DT as default, but eventually the program will ignore this when you use GridSearchCV.
# Here the might of GridSearchCV working, this may takes time especially if you have more than one classifiers to be evaluated
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10, scoring='f1')
grid.fit(features, labels)
#Find your best params if you want to use optimal setting later without running the grid search again (by commenting all these grid search lines)
print grid.best_params_
#You can now use pipeline again to wrap the steps with it best configs to build your model
pipe = Pipeline(steps=[("kbest", SelectKBest(k=12)), ("classify", DecisionTreeClassifier(criterion="entropy", min_samples_leaf=2, min_samples_split=9))])
Hope this helps

The flow of my program is in two stages.
I am using Sklearn ExtraTreesClassifier along with SelectFromModelmethod to select the most important features. Here it should be noted that the ExtraTreesClassifier takes many parameters as input like n_estimators etc for classification and eventually giving different set of important features for different values of n_estimators via SelectFromModel. This means that I can optimize the n_estimators to get the best features.
In the second stage, I am traing my NN keras model based on the features selected in the first stage. I am using AUROC as the score for grid search but this AUROC is calculated using Keras based neural network. I want to use Grid Search for n_estimators in my ExtraTreesClassifier to optimize the AUROC of keras neural Network. I know I have to use Pipline but I am confused in implementing both together.
I don't know where to put Pipeline in my code. I am getting an error which saysTypeError: estimator should be an estimator implementing 'fit' method, <function fs at 0x0000023A12974598> was passed
#################################################################################
I concatenate the CV set and the train set so that I may select the most important features
in both CV and Train together.
##############################################################################
frames11 = [train_x_upsampled, cross_val_x_upsampled]
train_cv_x = pd.concat(frames11)
frames22 = [train_y_upsampled, cross_val_y_upsampled]
train_cv_y = pd.concat(frames22)
def fs(n_estimators):
m = ExtraTreesClassifier(n_estimators = tree_number)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
##################################################
The code below is to get the names of the selected important features
###################################################
feature_idx = sel.get_support()
feature_name = train_cv_x.columns[feature_idx]
feature_name =pd.DataFrame(feature_name)
X_new = sel.transform(train_cv_x)
X_new =pd.DataFrame(X_new)
######################################################################
So Now the important features selected are in the data-frame X_new. In
code below, I am again dividing the data into train and CV but this time
only with the important features selected.
####################################################################
train_selected_x = X_new.iloc[0:train_x_upsampled.shape[0], :]
cv_selected_x = X_new.iloc[train_x_upsampled.shape[0]:train_x_upsampled.shape[0]+cross_val_x_upsampled.shape[0], :]
train_selected_y = train_cv_y.iloc[0:train_x_upsampled.shape[0], :]
cv_selected_y = train_cv_y.iloc[train_x_upsampled.shape[0]:train_x_upsampled.shape[0]+cross_val_x_upsampled.shape[0], :]
train_selected_x=train_selected_x.values
cv_selected_x=cv_selected_x.values
train_selected_y=train_selected_y.values
cv_selected_y=cv_selected_y.values
##############################################################
Now with this new data which only contains the important features,
I am training a neural network as below.
#########################################################
def create_model():
n_x_new=train_selected_x.shape[1]
model = Sequential()
model.add(Dense(n_x_new, input_dim=n_x_new, kernel_initializer='glorot_normal', activation='relu'))
model.add(Dense(10, kernel_initializer='glorot_normal', activation='relu'))
model.add(Dropout(0.8))
model.add(Dense(1, kernel_initializer='glorot_normal', activation='sigmoid'))
optimizer = keras.optimizers.Adam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
seed = 7
np.random.seed(seed)
model = KerasClassifier(build_fn=create_model, epochs=20, batch_size=400, verbose=0)
n_estimators=[10,20,30]
param_grid = dict(n_estimators=n_estimators)
grid = GridSearchCV(estimator=fs, param_grid=param_grid,scoring='roc_auc',cv = PredefinedSplit(test_fold=my_test_fold), n_jobs=1)
grid_result = grid.fit(np.concatenate((train_selected_x, cv_selected_x), axis=0), np.concatenate((train_selected_y, cv_selected_y), axis=0))

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

Optimizing in tensorflow

Suppose I have a tensorflow graph implementing a classification model:
x = tf.placeholder(tf.float32, shape)
# [insert mdoel here]
logits = tf.layers.dense(inputs=..., units=num_labels, activation=None)
Now suppose I want to optimize over the inputs using the Adam optimizer.
For instance, in order to find targeted adversarial examples, I would declare a variable to optimize over (initialized at some sample during execution), specify a target class different from the true class, compute the cross-entropy and minimize it.
var_to_optimize = tf.Variable(np.zeros(shape, dtype=np.float32))
tgt_label = tf.placeholder(tf.float32, shape=[num_labels])
xent = tf.nn.softmax_cross_entropy_with_logits_v2(labels=tgt_label, logits=logits)
I would then like to minimize the cross-entropy by perturbing the inputs
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
training_op = optimizer.minimize(xent, var_list=[var_to_optimize])
However, xent requires that I feed values for the input placeholder x. How do I link the model's logits with var_to_optimize?
The question I was trying to answer is essentially the following: how can one create two separate optimization procedures on the same tensorflow graph?
The tutorial in the following link describes how to do this: a tensorflow graph is defined that trains a neural network and then adds random noise (uniform across samples) optimized to induce misclassification of most samples.
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/12_Adversarial_Noise_MNIST.ipynb

How to create two graphs for train and validation?

When I read tensorflow guidance about graph and session(Graphs and Sessions), I found they suggest to create two graphs for train and validation.
I think this reasonable and I want to use this because my train and validation models are different (for encoder-decoder mode or dropout). However, i don't know how to make variables in trained graph available for test graph without using tf.saver().
When I create two graphs and create variables inside each graph, I found these two variables are totally different as they belong to different graphs.
I have googled a lot and I know there are questions about this problems, such as question1. But there is still no useful answer. If there is any code example or anyone know how to create two graphs for train and validation separately, such as:
def train_model():
g_train = tf.graph()
with g_train.as_default():
train_models
def validation_model():
g_test = tf.graph()
with g_test.as_default():
test_models
One easy way of doing that is to create a 'forward function' that defines the model and change behaviour based on extra parameters.
Here is an example:
def forward_pass(x, is_training, reuse=tf.AUTO_REUSE, name='model_forward_pass'):
# Note the reuse attribute as it tells the getter to either create the graph or get the weights
with tf.variable_scope(name=name, reuse=reuse):
x = tf.layers.conv(x, ...)
...
x = tf.layers.dense(x, ...)
x = tf.layers.dropout(x, rate, training=is_training) # Note the is_training attribute
...
return x
Now you can call the 'forward_pass' function anywhere in your code. You simply need to provide the is_training attribute to use the correct mode for dropout for example. The 'reuse' argument will automatically get the correct values for your weights as long as the 'name' of the 'variable_scope' is the same.
For example:
train_logits_model1 = forward_pass(x_train, is_training=True, name='model1')
# Graph is defined and dropout is used in training mode
test_logits_model1 = forward_pass(x_test, is_training=False, name='model1')
# Graph is reused but the dropout behaviour change to inference mode
train_logits_model2 = forward_pass(x_train2, is_training=True, name='model2')
# Name changed, model2 is added to the graph and dropout is used in training mode
To add to this answer as you stated that you want to have 2 separated graph, you could to that using an assign function:
train_graph = forward_pass(x, is_training=True, reuse=False, name='train_graph')
...
test_graph = forward_pass(x, is_training=False, reuse=False, name='test_graph')
...
train_vars = tf.get_collection('variables', 'train_graph/.*')
test_vars = tf.get_collection('variables','test_graph/.*')
test_assign_ops = []
for test, train in zip(test_vars, train_vars):
test_assign_ops += [tf.assign(test, train)]
assign_op = tf.group(*test_assign_ops)
sess.run(assign_op) # Replace vars in the test_graph by the one in train_graph
I'm a big advocate of method 1 as it is way cleaner and reduce memory usage.

Tensorflow variable value different on same training set

I build a neural network model on Python 3.6
I'm trying to predict price of condominium based on their attributes such as lat, lng, distance to public transport, year-built, and so on.
I use the same training set for the model. However, each time I print out value of the variables in hidden layer is different.
testing_df_w_price = testing_df.copy()
testing_df.drop('PricePerSq',axis = 1, inplace = True)
training_df, testing_df = training_df.drop(['POID'], axis=1), testing_df.drop(['POID'], axis=1)
col_train = list(training_df.columns)
col_train_bis = list(training_df.columns)
col_train_bis.remove('PricePerSq')
mat_train = np.matrix(training_df)
mat_test = np.matrix(testing_df)
mat_new = np.matrix(training_df.drop('PricePerSq', axis = 1))
mat_y = np.array(training_df.PricePerSq).reshape((training_df.shape[0],1))
prepro_y = MinMaxScaler()
prepro_y.fit(mat_y)
prepro = MinMaxScaler()
prepro.fit(mat_train)
prepro_test = MinMaxScaler()
prepro_test.fit(mat_new)
train = pd.DataFrame(prepro.transform(mat_train),columns = col_train)
test = pd.DataFrame(prepro_test.transform(mat_test),columns = col_train_bis)
# List of features
COLUMNS = col_train
FEATURES = col_train_bis
LABEL = "PricePerSq"
# Columns for tensorflow
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
# Training set and Prediction set with the features to predict
training_set = train[COLUMNS]
prediction_set = train.PricePerSq
# Train and Test
x_train, x_test, y_train, y_test = train_test_split(training_set[FEATURES] , prediction_set, test_size=0.25, random_state=42)
y_train = pd.DataFrame(y_train, columns = [LABEL])
training_set = pd.DataFrame(x_train, columns = FEATURES).merge(y_train, left_index = True, right_index = True) # good
# Training for submission
training_sub = training_set[col_train] # good
# Same thing but for the test set
y_test = pd.DataFrame(y_test, columns = [LABEL])
testing_set = pd.DataFrame(x_test, columns = FEATURES).merge(y_test, left_index = True, right_index = True) # good
# Model
# tf.logging.set_verbosity(tf.logging.INFO)
tf.logging.set_verbosity(tf.logging.ERROR)
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[int(len(col_train)+1/2)],
model_dir = "/tmp/tf_model")
for k in regressor.get_variable_names():
print(k)
print(regressor.get_variable_value(k))
Example of hidden layer value difference
The variables are initialized with random values when you construct the network. Since there's likely to be many local minima of your loss function, the fitted parameters will change every time you run the network.
In addition if your loss function is convex (only one (global) minima) the order of the variables is somewhat arbitrary. If for example you fit a network with 1 hidden layers with 2 hidden nodes, the parameters of node 1 in your first run might correspond to the parameters of node 2 and vice versa.
In Machine Learnining, the current "knowledge state" of your neural network is expressed through the weights of the connections in your graph. Generally considered, your whole network represents a high-dimensional function and the task of learning means finding the global optimum of this funktion. The learning process changes the weights of the connections in your neural network according to the specified optimizer, which in your case is the default of tf.contrib.learn.DNNRegressor (which is the Adagrad optimizer). But there are other parameters that affect the final "knowledge state" in your model. There are for instance (and i guarantee no completeness in the following list):
The initial learning rate in your model
The learning rate schedule that adapts the learning rate over time
eventually defined regularities and early stopping
The initialization strategy used for weight initialization (e.g. He-initialization or random initialization)
Plus (and this is maybe the most important thing to understand why your weights are different after each retraining), you have to consider that you use a stochastic gradient descent algorithm during training. This means, that for each optimization step the algorithm choses a random subset of your whole training set. Therefore, one optimization step doesn't always point tho the global optimum of your high-dimensional function, but to the steepest descent that could be computed with the randomly chosen subset. Because of this stochastic component in the optimization process, you will likely never reach the global optimum for your task. But with carefully chosen hyperparameters (and of course good data) you will reach a good approximate solution, which lies whithin a local optimum of the function and which can change everytime you retrain the model.
So to conclude, don't look at the weights to judge the performance of your model, because they will be slightly different each time. Use a performance measure like the accuracy computed in a cross validation or a confusion matrix computed on the test set.
P.S. tf.contrib.learn.DNNRegressor is a deprecated function in the newest TensorFlow release, as you can see in the docs. Use tf.estimator.DNNRegressor instead.

Creating an image summary only for a subset of validation set images using Tensorflow Estimator API

I'm trying to add image summary operations to visualize how well my network manages to reconstruct inputs from the validation set. However, since there are too many images in the validation set I would only like to plot a small subset of them.
I managed to achieve this with manual training loop, but I struggle to achieve the same with the new Tensorflow Estimator/Experiment/Datasets API. Has anyone done something like this?
The Experiment and Estimator are high level TensorFlow APIs. Although you could probably solve your issue with a hook, if you want more control on what's happening during the training process, it may be easier not to use these APIs.
That said, you can still use the Dataset API which will bring you a lot of useful features.
To solve your problem with the Dataset API, you will need to switch between train and validation datasets in your training loop.
One way to do that is to use a feedable iterator. See here for more details:
https://www.tensorflow.org/programmers_guide/datasets
You can also see a full example switching between training and validation with the Dataset API in this notebook.
In brief, after having created your train_dataset and your val_dataset, your training loop could be something like this:
# create TensorFlow Iterator objects
training_iterator = val_dataset.make_initializable_iterator()
val_iterator = val_dataset.make_initializable_iterator()
with tf.Session() as sess:
# Initialize variables
init = tf.global_variables_initializer()
sess.run(init)
# Create training data and validation data handles
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(val_iterator.string_handle())
for epoch in range(number_of_epochs):
# Tell iterator to go to beginning of dataset
sess.run(training_iterator.initializer)
print ("Starting epoch: ", epoch)
# iterate over the training dataset and train
while True:
try:
sess.run(train_op, feed_dict={handle: training_handle})
except tf.errors.OutOfRangeError:
# End of epoch
break
# Tell validation iterator to go to beginning of dataset
sess.run(val_iterator.initializer)
# run validation on only 10 examples
for i in range(10):
my_value = sess.run(my_validation_op, feed_dict={handle: validation_handle}))
# Do whatever you want with my_value
...
I figured out a solution that uses Estimator/Experiment API.
First you need to modify your Dataset input to not only provide labels and features, but also some form of an identifier for each sample (in my case it was a filename). Then in the hyperparameters dictionary (params argument) you need to specify which of the validation samples you want to plot. You also will have to pass the model_dir in those parameters. For example:
params = tf.contrib.training.HParams(
model_dir=model_dir,
images_to_plot=["100307_EMOTION.nii.gz", "100307_FACE-SHAPE.nii.gz",
"100307_GAMBLING.nii.gz", "100307_RELATIONAL.nii.gz",
"100307_SOCIAL.nii.gz"]
)
learn_runner.run(
experiment_fn=experiment_fn,
run_config=run_config,
schedule="train_and_evaluate",
hparams=params
)
Having this set up you can create conditional Summary operations in your model_fn and an evaluation hook to include them in your outputs.
if mode == tf.contrib.learn.ModeKeys.EVAL:
summaries = []
for image_to_plot in params.images_to_plot:
is_to_plot = tf.equal(tf.squeeze(filenames), image_to_plot)
summary = tf.cond(is_to_plot,
lambda: tf.summary.image('predicted', predictions),
lambda: tf.summary.histogram("ignore_me", [0]),
name="%s_predicted" % image_to_plot)
summaries.append(summary)
evaluation_hooks = [tf.train.SummarySaverHook(
save_steps=1,
output_dir=os.path.join(params.model_dir, "eval"),
summary_op=tf.summary.merge(summaries))]
else:
evaluation_hooks = None
Note that the summaries have to be conditional - we are either plotting an image (computationally expensive) or saving a constant (computationally cheap). I opted for using histogram versus scalar in for the dummy summaries to avoid cluttering my tensorboard dashboard.
Finally you need to pass the hook in the return object of your `model_fn'
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
evaluation_hooks=evaluation_hooks
)
Please note that this only works when your batch size is 1 when evaluating the model (which should not be a problem).