How to predict on a test sequence using a distilbert model? - tensorflow

Im trying to predict on a test sequence using Ktrain with a distilbert model, my code looks like this:
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
x_test=x_test, y_test=y_test,
class_names=train_b.target_names,
preprocess_mode='distilbert',
maxlen=350)
model = text.text_classifier('distilbert', train_data=trn, preproc=preproc,multilabel=True)
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=64)
y_pred = learner.model.predict(val,verbose = 0)
In the other implementation of models like nbsvm, fasttext, bigru from Ktrain its quite easy as texts_from_array function returns a numpy array but with distilbert it returns a TransformerDataset, it's therefore not possible to predict on a sequence with learner.model.predict() as it generates a python index exception. Its also not possible for me to use the validate() method to generate a confusion matrix given that I have multi label classification problem. My question is how can I therefore test on a test sequence with Ktrain using distilbert, my need for this comes from the fact that my metric function is implemented based on sklearn.metric library and it needs test and validation sequence in a numpy format.

You can use a Predictor instance as shown in the tutorial.
The Predictor simply uses the preproc object to transform the raw text into the format expected by the model and feeds this to the model.

Related

How is data from tf.data generated and passed to the model

In the book Hands-On ML with Scikit-Learn, Tensorflow and Keras, the author explains using the Data API to manipulate, transform and pass data to the model efficiently, he writes the following function:
def csv_reader(filepaths, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths)
dataset = dataset.interleave(lambda filepath:
tf.data.TextLineDataset(filepath).skip(1), cycle_length=5)
dataset = dataset.shuffle(10000).repeat(1)
return dataset.batch(batch_size).prefetch(1)
Then : train_set = csv_reader_dataset(train_filepaths)
and: model.fit(train_set, epochs=10)
What I don't understand is the part where he creates the actual train_set from the function, isn't that way he only has one batch of data? He says that we create a training set once and don't need to repeat it as it will be taken care of by Keras but I don't see how.
A tf.data.Dataset is like a blueprint for how to get your data. To use a dataset to read data, you create an iterator over the dataset. The same dataset can be used to create multiple iterators, which will each iterate over the whole dataset. So Keras just needs one dataset, then can use the dataset to iterate over your data multiple times.

Extract the output of the embedding layer

I am trying to build a regression model, for which I have a nominal variable with very high cardinality. I am trying to get the categorical embedding of the column.
Input:
df["nominal_column"]
Output:
the embeddings of the column.
I want to use the op of the embedding column alone since I would require that as a input to my traditional regression model. Is there a way to extract that output alone.
P.S I am not asking for code, any suggestion on the approach would be great.
If the embedding is part of the model and you train it, then you can use functional API of keras to get output of any intermediate operation in your graph:
x=Input((number_of_categories,))
y=Embedding(parameters_of_your_embeddings)(x)
output=Rest_of_your_model()(y)
model=Model(inputs=[x],outputs=[output,y])
if you do it before you train the model, you'll have to define custom loss function, that deals only with part of the output. The other way is to train the model with just one output, then create identical model with two outputs and set the weights of the second model from the trained one.
If you want to get the embedding matrix from your model, you can just use method get_weights of the embedding layer which returns the weights in numpy array.

xgboost.train probability output needed

XGBClassifier outputs probabilities if we use the method "predict_proba", however, when I train the model using xgboost.train, I cannot figure out how to get probabilities as output. Here is a chunk of my code:
dtrain=xgb.DMatrix(X_train, label=y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
modelXG=xgb.train(param,dtrain,xgb_model='xgbmodel')
xgboost.train() returns a xgb.Booster object. The xgb.Booster.predict() call returns probabilities in the case of a classification problem instead of the expected labels, if you are used to the .predict()methods of sklearn models. So modelXG.predict(dtest) call will give you want you need.

How to get value of a tensor from a Tensorflow Mode

I am using the following implementation of the Seq2Seq model. Now, if I want to pass some inputs and get the corresponding values of encoder's hidden state (self.encoder_last_state), how can I do it?
https://github.com/JayParks/tf-seq2seq/blob/master/seq2seq_model.py
You need to first assemble input_feed, similar to the predict routine. Once you have that, just execute sess.run over the required hidden layer.
To assmeble the input_feed:
input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length, decoder_inputs=None, decoder_inputs_length=None, decode=True)
input_feed[self.keep_prob_placeholder.name] = 1.0
sess.run over self.encoder_last_state:
encoder_last_state_activations = sess.run(self.encoder_last_state, input_feed)

How to initialize a keras tensor employed in an API model

I am trying to implemente a Memory-augmented neural network, in which the memory and the read/write/usage weight vectors are updated according to a combination of their previous values. These weigths are different from the classic weight matrices between layers that are automatically updated with the fit() function! My problem is the following: how can I correctly initialize these weights as keras tensors and use them in the model? I explain it better with the following simplified example.
My API model is something like:
input = Input(shape=(5,6))
controller = LSTM(20, activation='tanh',stateful=False, return_sequences=True)(input)
write_key = Dense(4,activation='tanh')(controller)
read_key = Dense(4,activation='tanh')(controller)
w_w = Add()([w_u, w_r]) #<---- UPDATE OF WRITE WEIGHTS
to_write = Dot()([w_w, write_key])
M = Add()([M,to_write])
cos_sim = Dot()([M,read_key])
w_r = Lambda(lambda x: softmax(x,axis=1))(cos_sim) #<---- UPDATE OF READ WEIGHTS
w_u = Add()([w_u,w_r,w_w]) #<---- UPDATE OF USAGE WEIGHTS
retrieved_memory = Dot()([w_r,M])
controller_output = concatenate([controller,retrieved_memory])
final_output = Dense(6,activation='sigmoid')(controller_output)`
You can see that, in order to compute w_w^t, I have to have first defined w_r^{t-1} and w_u^{t-1}. So, at the beginning I have to provide a valid initialization for these vectors. What is the best way to do it? The initializations I would like to have are:
M = K.variable(numpy.zeros((10,4))) # MEMORY
w_r = K.variable(numpy.zeros((1,10))) # READ WEIGHTS
w_u = K.variable(numpy.zeros((1,10))) # USAGE WEIGHTS`
But, analogously to what said in #2486(entron), these commands do not return a keras tensor with all the needed meta-data and so this returns the following error:
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
I also thought to use the old M, w_r and w_u as further inputs at each iteration and analogously get in output the same variables to complete the loop. But this means that I have to use the fit() function to train online the model having just the target as final output (Model 1), and employ the predict() function on the model with all the secondary outputs (Model 2) to get the variables to use at the next iteration. I have also to pass the weigth matrices from Model 1 to Model 2 using get_weights() and set_weights(). As you can see, it becomes a little bit messy and too slow.
Do you have any suggestions for this problem?
P.S. Please, do not focus too much on the API model above because it is a simplified (almost meaningless) version of the complete one where I skipped several key steps.