Keras .predict with word embeddings back to string - tensorflow

I'm following the tutorial here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html, using a different data set. I'm trying to predict the label for a new random string.
I'm doing labelling a bit different:
encoder = LabelEncoder()
encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
And then trying to predict like:
string = "I am a cat"
query = tokenizer.texts_to_sequences(string)
query = pad_sequences(query, maxlen=50)
prediction = model.predict(query)
print(prediction)
I get back an array of arrays like below (perhaps the word embeddings?). What are those and how can I translate them back to a string?
[[ 0.03039312 0.02099193 0.02320454 0.02183384 0.01965107 0.01830118
0.0170384 0.01979697 0.01764384 0.02244077 0.0162186 0.02672437
0.02190582 0.01630476 0.01388928 0.01655456 0.011678 0.02256939
0.02161663 0.01649982 0.02086013 0.0161493 0.01821378 0.01440909
0.01879989 0.01217389 0.02032642 0.01405699 0.01393504 0.01957162
0.01818203 0.01698637 0.02639499 0.02102267 0.01956343 0.01588933
0.01635705 0.01391534 0.01587612 0.01677094 0.01908684 0.02032183
0.01798265 0.02017053 0.01600159 0.01576616 0.01373934 0.01596323
0.01386674 0.01532488 0.01638312 0.0172212 0.01432543 0.01893282
0.02020231]

Save the fitted labels in the encoder:
encoder = LabelEncoder()
encoder = encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
Prediction will give you a class vector. And by using the inverse_transform you will get the label type as from your original input:
prediction = model.predict_classes(query)
label = encoder.inverse_transform(prediction)

Related

How do I make train and test set the same size using bag of words

I want to use bag of words to train a regression model. But, I do not want to have information leak between my train and test sets. So, that means I need to create a word vector on the train set separate from my test set. I ran this code
def bow (tokens, data):
tokens = tokens.apply(nltk.word_tokenize)
cvec = CountVectorizer(min_df = .01, max_df = .95, ngram_range=(1,2), tokenizer=lambda doc:doc, lowercase=False)
cvec.fit(tokens)
cvec_counts = cvec.transform(tokens)
cvec_counts_bow = cvec_counts.toarray()
vocab = cvec.get_feature_names()
bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
return bow_model
X_train = bow(train['clean_text'], train)
X_test = bow(test['clean_text'], test)
vocab = list(X_train.columns)
But the shape of my data frames are X_train: (300, 730) and X_test (35, 1661)
There are more unique words in my test set than my train set -- because the train set is so small --- and the words do not match. There are words in X_train that are not in X_test and vice versa. I thought to create a vocab list from X_train then keep only the columns in X_test but that doesn't seem right.
How do I fit the vocabulary from the X_train onto X_test?
It's looking like you create different vocabs for train and test sets.
You can try to move next code from bow function and create just one vocab which is based on train set. Also you don't need to use nltk.word_tokenize because CountVectorizer already have tokenizer:
cvec = CountVectorizer(min_df = .01, max_df = .95, ngram_range=(1,2), lowercase=False)
cvec.fit(train['clean_text'])
vocab = cvec.get_feature_names()
print(vocab)
And then change bow function:
def bow (tokens, vocab, cvec):
cvec_counts = cvec.transform(tokens)
cvec_counts_bow = cvec_counts.toarray()
bow_model = pd.DataFrame(cvec_counts_bow, columns=vocab)
return bow_model
X_train = bow(train['clean_text'], vocab, cvec)
X_test = bow(test['clean_text'], vocab, cvec)
Or look at example how I did it.
Firstly, I create vocab using just train set. Also I use just more common words from this set and other set like '<UNK>'.
You can use next code for this:
def create_vocab(all_tokens, vocab_size):
token_counts = Counter(all_tokens)
token_counts = token_counts.most_common()[:vocab_size]
vocab_list = ['<UNK>'] + [token for token, _ in token_counts]
return vocab_list
Than after creating vocab I use it for bag of words like there:
def bag_of_words(tokens, vocab_list):
result = [0] * len(vocab_list)
for t in tokens:
if t not in vocab_list:
t = '<UNK>'
result[vocab_list.index(t)] += 1
return result
Using this code I create bag of words which include all words from my vocab (which was created using just train set) and '<UNK>' symbol for words which is not in my vocab.
There is an example how I run this code:
vocab_list = create_vocab(list(chain.from_iterable(tokens_train)), 5000)
bag_of_words_train = []
for token in tokens_train:
bag_of_words_train.append(bag_of_words(token, vocab_list))
bag_of_words_test = []
for token in tokens_test:
bag_of_words_test.append(bag_of_words(token, vocab_list))

How can i extract the encoded part of multi-modal autoencoder and convert the .h5 model to a numpy array?

I am making a deep multimodal autoencoder model which takes two inputs and produces a two outputs (which are the reconstructed inputs). The two inputs are with shape of (1000, 50) and (1000,60) respectively and the model has 3 hidden layers and aim to concatenate the two latent layer of input1 and input2.
I would like to extract the encoded part of my model and save the data as a numpy array.
here is the complete code of the model :
input_X = Input(shape=(X[0].shape))
dense_X = Dense(40,activation='relu')(input_X)
dense1_X = Dense(20,activation='relu')(dense_X)
latent_X= Dense(2,activation='relu')(dense1_X)
input_X1 = Input(shape=(X1[0].shape))
dense_X1 = Dense(40,activation='relu')(input_X1)
dense1_X1 = Dense(20,activation='relu')(dense_X1)
latent_X1= Dense(2,activation='relu')(dense1_X1)
Concat_X_X1 = concatenate([latent_X, latent_X1])
decoding_X = Dense(20,activation='relu')(Concat_X_X1)
decoding1_X = Dense(40,activation='relu')(decoding_X)
output_X = Dense(X[0].shape[0],activation='sigmoid')(decoding1_X)
decoding_X1 = Dense(20,activation='relu')(Concat_X_X1)
decoding1_X1 = Dense(40,activation='relu')(decoding_X1)
output_X1 = Dense(X1[0].shape[0],activation='sigmoid')(decoding1_X1)
multi_modal_autoencoder = Model([input_X, input_X1], [output_X, output_X1], name='multi_modal_autoencoder')
encoder = Model([input_X, input_X1], Concat_X_X1)
encoder.save('encoder.h5')
multi_modal_autoencoder.compile(optimizer=keras.optimizers.Adam(lr=0.001),loss='mse')
model = multi_modal_autoencoder.fit([X,X1], [X, X1], epochs=70, batch_size=150)
With h5py package you can get into your .h5 file and extract exactly what you want:
f = h5py.File('encoder.h5', 'r')
keys = list(f.keys())
values = f.get('some_key')
You can hierarchically use .get many times to go deeper into your .h5 file to extract what you need.

Mapping text data through huggingface tokenizer

I have my encode function that looks like this:
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
def encode(texts, tokenizer=tokenizer, maxlen=10):
# import pdb; pdb.set_trace()
inputs = tokenizer.encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=maxlen
)
return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
I want to get my data encoded on the fly by doing this:
x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
.map(encode))
However, this chucks the error:
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Now from my understanding when I set a breakpoint inside encode it was because I was sending a non-numpy array. How do I get huggingface transformers to play nice with tensorflow strings as inputs?
If you need a dummy dataframe here it is:
df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})
What I tried
So I tried to use from_generator so that I can parse in the strings to the encode_plus function. However, this does not work with TPUs.
AUTO = tf.data.experimental.AUTOTUNE
def get_gen(df):
def gen():
for i in range(len(df)):
yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
return gen
shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))
train_dataset = tf.data.Dataset.from_generator(
get_gen(df_train),
((tf.int32, tf.int32, tf.int32), tf.int32),
shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)
Version Info:
transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')
the tokenizer of bert works on a string, a list/tuple of strings or a list/tuple of integers. So, check is your data getting converted to string or not. To apply tokenizer on whole dataset I used Dataset.map, but this runs on graph mode. So, I need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function. My data was getting converted to bytes after using py_function hence I applied tf.compat.as_str to convert bytes to string.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
return lang1, lang2
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE,
padded_shapes=(60, 60))
a,p = next(iter(train_dataset))
When you create the tensorflow dataset with: tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
tensorflow converts your strings into tensors of string type which is not an accepted input of of tokenizer.encode_plus. Like the error message says it only accepts a string, a list/tuple of strings or a list/tuple of integers. You can verify this by adding a print(type(texts)) inside your encode function (Output:<class 'tensorflow.python.framework.ops.Tensor'>).
I'm not sure what your follow up plan is and why you need a tf.data.Dataset, but you have to encode your input before you turn it into a tf.data.Dataset:
import tensorflow as tf
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
texts = ['Today was a good day', 'Today was a bad day',
'Today was a rainy day', 'Today was a sunny day',
'Today was a cloudy day']
#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=10
)
dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
inputs['attention_mask'],
inputs['token_type_ids']))
print(type(dataset))
Output:
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
I had this exact error but my mistake was simple, I had a few NaNs in my texts.
So make sure to check if there are NaNs in your texts dataframe.

Error in prediction in svm classifier after one hot encoding

I have used one hot encoding to my dataset before training my SVM classifier.
which increased number of features in training set to 982. But during
prediction of test dataset which has 7 features i am getting error " X has 7
features per sample; expecting 982". I don't understand how to increase
number of features in test dataset.
My code is:
df = pd.read_csv('train.csv',header=None);
features = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values
encode = LabelEncoder()
features[:,2] = encode.fit_transform(features[:,2])
features[:,3] = encode.fit_transform(features[:,3])
features[:,4] = encode.fit_transform(features[:,4])
features[:,5] = encode.fit_transform(features[:,5])
df1 = pd.DataFrame(features)
#--------------------------- ONE HOT ENCODING --------------------------------#
hotencode = OneHotEncoder(categorical_features=[2])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[14])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[37])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[466])
features = hotencode.fit_transform(features).toarray()
X = np.array(features)
y = np.array(labels)
clf = svm.LinearSVC()
clf.fit(X,y)
d_test = pd.read_csv('query.csv')
Z_test =np.array(d_test)
confidence = clf.predict(Z_test)
print("The query image belongs to Class ")
print(confidence)
######################### test dataset
query.csv
1 0.076 1 3232236298 2886732679 3128 60604
The short answer: you need to apply the same OHE transform (or LE+OHE in your case) on the test set.
For a good advice, see Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting or How to deal with imputation and hot one encoding in pandas?

Which variables to pass to a tensor flow predictor for tf.feature_coloumns using wide and deep learning model?

I've been trying out the wide and deep learning example from the tensor flow site: https://www.tensorflow.org/tutorials/wide_and_deep
I can train and evaluate the model and even predict within that same process but I can't seem to figure out what input I need to pass into the predictor function when I try to do a prediction from a model that was saved and then reloaded via the predictor.from_saved_model function.
My feature columns and model look like this which runs fine:
term = tf.feature_column.categorical_column_with_vocabulary_list("term", unique_terms['term'].tolist())
name = tf.feature_column.categorical_column_with_vocabulary_list("name", unique_name['name'].tolist())
base_columns = [term, cust_name]
crossed_columns = [
tf.feature_column.crossed_column(["term", "cust_name"], hash_bucket_size=100000),
]
deep_columns = [
tf.feature_column.indicator_column(term),
tf.feature_column.indicator_column(cust_name),
]
model_dir = export_dir
search_model = tf.estimator.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=crossed_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])
I saved the model like this:
feature_columns = crossed_columns + deep_columns
feature_spec = tf.feature_column.make_parse_example_spec(feature_columns)
export_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
servable_model_dir = export_dir
servable_model_path = search_model.export_savedmodel(servable_model_dir, export_input_fn)
And then I load it back from file like this:
predict_fn = predictor.from_saved_model(export_dir)
predictions = predict_fn({'X':[10]})
The predict_fn is expecting a dictionary with key "inputs" not "x" except I am not sure what the value of "inputs" should be. Can anyone help me out on this please?