Mapping text data through huggingface tokenizer - tensorflow

I have my encode function that looks like this:
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
def encode(texts, tokenizer=tokenizer, maxlen=10):
# import pdb; pdb.set_trace()
inputs = tokenizer.encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=maxlen
)
return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
I want to get my data encoded on the fly by doing this:
x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
.map(encode))
However, this chucks the error:
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Now from my understanding when I set a breakpoint inside encode it was because I was sending a non-numpy array. How do I get huggingface transformers to play nice with tensorflow strings as inputs?
If you need a dummy dataframe here it is:
df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})
What I tried
So I tried to use from_generator so that I can parse in the strings to the encode_plus function. However, this does not work with TPUs.
AUTO = tf.data.experimental.AUTOTUNE
def get_gen(df):
def gen():
for i in range(len(df)):
yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
return gen
shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))
train_dataset = tf.data.Dataset.from_generator(
get_gen(df_train),
((tf.int32, tf.int32, tf.int32), tf.int32),
shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)
Version Info:
transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')

the tokenizer of bert works on a string, a list/tuple of strings or a list/tuple of integers. So, check is your data getting converted to string or not. To apply tokenizer on whole dataset I used Dataset.map, but this runs on graph mode. So, I need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function. My data was getting converted to bytes after using py_function hence I applied tf.compat.as_str to convert bytes to string.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
return lang1, lang2
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE,
padded_shapes=(60, 60))
a,p = next(iter(train_dataset))

When you create the tensorflow dataset with: tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
tensorflow converts your strings into tensors of string type which is not an accepted input of of tokenizer.encode_plus. Like the error message says it only accepts a string, a list/tuple of strings or a list/tuple of integers. You can verify this by adding a print(type(texts)) inside your encode function (Output:<class 'tensorflow.python.framework.ops.Tensor'>).
I'm not sure what your follow up plan is and why you need a tf.data.Dataset, but you have to encode your input before you turn it into a tf.data.Dataset:
import tensorflow as tf
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
texts = ['Today was a good day', 'Today was a bad day',
'Today was a rainy day', 'Today was a sunny day',
'Today was a cloudy day']
#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=10
)
dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
inputs['attention_mask'],
inputs['token_type_ids']))
print(type(dataset))
Output:
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>

I had this exact error but my mistake was simple, I had a few NaNs in my texts.
So make sure to check if there are NaNs in your texts dataframe.

Related

How to avoid memory leakage in an autoregressive model within tensorflow

Recently, I am training a LSTM with attention mechanism for regressionin tensorflow 2.9 and I met an problem during training with model.fit():
At the beginning, the training time is okay, like 7s/step. However, it was increasing during the process and after several steps, like 1000, the value might be 50s/step. Here below is a part of the code for my model:
class AttentionModel(tf.keras.Model):
def __init__(self, encoder_output_dim, dec_units, dense_dim, batch):
super().__init__()
self.dense_dim = dense_dim
self.batch = batch
encoder = Encoder(encoder_output_dim)
decoder = Decoder(dec_units,dense_dim)
self.encoder = encoder
self.decoder = decoder
def call(self, inputs):
# Creat a tensor to record the result
tempt = list()
encoder_output, encoder_state = self.encoder(inputs)
new_features = np.zeros((self.batch, 1, 1))
dec_initial_state = encoder_state
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt.append(dec_result.logits)
new_features = dec_result.logits
dec_initial_state = dec_state
result=tf.concat(tempt,1)
return result
In the official documents for tf.function, I notice: "Don't rely on Python side effects like object mutation or list appends".
Since I use a dynamic python list with append() to record the intermediate variables, I guess each time during training, a new tf.graph was added. Is the reason my training is getting slower and slower?
Additionally, what should I use instead of python list to avoid this? I have tried with a numpy.zeros matrix but it will lead to another problem:
tempt = np.zeros(shape=(1,6))
...
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt[i]=(dec_result.logits)
...
Cannot convert a symbolic tf.Tensor (decoder/dense_3/BiasAdd:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported.

How to perform string find and replace on Tensorflow String Tensor?

I currently am using the Tensorflow dataset api to perform some augmentations to images at a specified path. The filename itself contains information that states whether to augment the file or not. So what I want to do is read in the files from the dataset and for each file, perform a find within the filename and if I find a specific substring, then set a bool flag and replace the substring with "".
The error I get is:
AttributeError: 'Tensor' object has no attribute 'find'
I can't perform a "find" on the tensor with dtype string entries because find is not a part of the Tensor, so I am trying to figure out how I can go about performing the above action. I have shared some code below that I think demonstrates what I am trying to do. Performance is important, so I would prefer to do this the correct way if anyone sees that I am going about doing this via the Dataset API incorrectly.
def preproc_img(filenames):
def parse_fn(filename):
augment_inst = False
if cfg.SPLIT_INTO_INST:
#*****************************************************
#*** THIS IS WHERE THE LOGIC IS CURRENTLY BREAKING ***
#*****************************************************
if filename.find('_data_augmentation') != -1:
augment_inst = True
filename = filename.replace('_data_augmentation', '')
image_string = tf.read_file(filename)
img = tf.image.decode_image(image_string, channels=3)
return dict(zip([filename], [img]))
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map(parse_fn)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def perform_train():
if __name__ == '__main__':
filenames = helper.get_image_paths()
next_batch = preproc_img(filenames)
with tf.Session() as sess:
with sess .graph.as_default():
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
dat = sess.run(next_batch)
# I would now go about calling any of my tf op code below
You can use tf.regex_replace for replacing text in a tf.string tensor.
filename = tf.regex_replace(filename, "_data_augmentation", "")
For TF 2.0
filename = tf.strings.regex_replace(filename, "_data_augmentation", "")

Keras .predict with word embeddings back to string

I'm following the tutorial here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html, using a different data set. I'm trying to predict the label for a new random string.
I'm doing labelling a bit different:
encoder = LabelEncoder()
encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
And then trying to predict like:
string = "I am a cat"
query = tokenizer.texts_to_sequences(string)
query = pad_sequences(query, maxlen=50)
prediction = model.predict(query)
print(prediction)
I get back an array of arrays like below (perhaps the word embeddings?). What are those and how can I translate them back to a string?
[[ 0.03039312 0.02099193 0.02320454 0.02183384 0.01965107 0.01830118
0.0170384 0.01979697 0.01764384 0.02244077 0.0162186 0.02672437
0.02190582 0.01630476 0.01388928 0.01655456 0.011678 0.02256939
0.02161663 0.01649982 0.02086013 0.0161493 0.01821378 0.01440909
0.01879989 0.01217389 0.02032642 0.01405699 0.01393504 0.01957162
0.01818203 0.01698637 0.02639499 0.02102267 0.01956343 0.01588933
0.01635705 0.01391534 0.01587612 0.01677094 0.01908684 0.02032183
0.01798265 0.02017053 0.01600159 0.01576616 0.01373934 0.01596323
0.01386674 0.01532488 0.01638312 0.0172212 0.01432543 0.01893282
0.02020231]
Save the fitted labels in the encoder:
encoder = LabelEncoder()
encoder = encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
Prediction will give you a class vector. And by using the inverse_transform you will get the label type as from your original input:
prediction = model.predict_classes(query)
label = encoder.inverse_transform(prediction)

use Featureunion in scikit-learn to combine two pandas columns for tfidf

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.
I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']
I receive the following error:
TypeError: 'FeatureUnion' object is not iterable
How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?
from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer', TfidfTransformer()),
('classifier', MultinomialNB())])
pipeline.fit(combined_2, df['ham/spam'])
k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = combined_2.iloc[train_indices]
train_y = df.iloc[test_indices]['ham/spam'].values
test_text = combined_2.iloc[test_indices]
test_y = df.iloc[test_indices]['ham/spam'].values
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
prediction_prob = pipeline.predict_proba(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label='spam')
scores.append(score)
FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.
CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.
import scipy.sparse as sp
subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])
body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])
combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
A third option is to implement your own transformer that would extract a dataframe column.
class DataFrameColumnExtracter(TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.column]
In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.
subj_pipe = make_pipeline(
DataFrameColumnExtracter('Subject'),
CountVectorizer()
)
body_pipe = make_pipeline(
DataFrameColumnExtracter('body_text'),
CountVectorizer()
)
feature_union = make_union(subj_pipe, body_pipe)
This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.
sparse_matrix_of_counts = feature_union.fit_transform(df)
This feature union can also be added as the first step in a larger pipeline.

Parsing `summary_str` byte string evaluated on tensorflow summary object

Currently tensorflow's tensorboard is not compatible with python3. Therefore and generally, I am looking for a way to print out the summary readouts once in 100 epochs.
Is there a function to parse the summary_str byte string produced in the following lines into a dictionary of floats?
summary_op = tf.merge_all_summaries()
summary_str = sess.run(summary_op, feed_dict=feed_dict)
You can get a textual representation of summary_str by parsing it into a tf.Summary protocol buffer as follows:
summary_proto = tf.Summary()
summary_proto.ParseFromString(summary_str)
print(summary_proto)
You can then convert it into a dictionary mapping string tags to floats:
summaries = {}
for val in summary_proto.value:
# Assuming all summaries are scalars.
summaries[val.tag] = val.simple_value