I'm new to NLP and just started getting into tensorflow, I'm curious as to why
imdb_sentences =[]
train = tfds.as_numpy(tfds.load('imdb_reviews',split='train'))
for item in train:
imdb_sentences.append(str(item['text']))
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
returns no error and works fine but,
(x_train,y_train),(x_test,y_test) = tfds.as_numpy(tfds.load('imdb_reviews',split=['train','test'],batch_size=-1,as_supervised=True))
x_train=list(x_train)
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)
returns this:
TypeError: a bytes-like object is required, not 'dict'
Why is there a type mismatch if imdb_sentences and x_train contain the exact same data and both are of type list?
If you print the first element of imdb_sentences:
['b"This was an absolutely terrible movie. Don\'t be ..."']
This is the first element of x_train in the second scenario:
[b"This was an absolutely terrible movie. Don't be ..."]
See the difference?
Just transform the content of x_train to string, something like the code below, and it will work:
x_train=list(x_train)
x_train = [str(elem) for elem in x_train]
Related
I am pretty new at this, so there might be something I am missing completely, but here is my problem: I am trying to create a Tokenizer class that uses the pretrained tokenizer models from Huggingface. I would then like to use this class in a larger transformer model to tokenize my input data. Here is the class code
class Roberta(MyTokenizer):
from transformers import AutoTokenizer
from transformers import RobertaTokenizer
class Roberta(MyTokenizer):
def build(self, *args, **kwargs):
self.max_length = self.phd.max_length
self.untokenized_data = self.questions + self.answers
def tokenize_and_filter(self):
# Initialize the tokenizer with a pretrained model
Tokenizer = AutoTokenizer.from_pretrained('roberta')
tokenized_inputs, tokenized_outputs = [], []
inputs = Tokenizer(self.questions, padding=True)
outputs = Tokenizer(self.answers, padding=True)
tokenized_inputs = inputs['input_ids']
tokenized_outputs = outputs['input_ids']
return tokenized_inputs, tokenized_outputs
When I call the function tokenize_and_filter in my Transformer model as below
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
print(questions)
and I try to print the tokenized data, I get this message:
<bound method Roberta.tokenize_and_filter of <MyTokenizer.Roberta.Roberta object at
0x000002779A9E4D30>>
It appears that the function returns a method instead of a list or a tensor - I've tried passing the parameter 'return_tensors='tf'', I have tried using the tokenizer.encode() method, I have tried both with AutoTokenizer and with RobertaTokenizer, I have tried the batch_encode_plus() method, nothing seems to work.
Please help!
it seems this was a really stupid error on my part, I forgot to put parentheses when calling the function
questions = self.get_tokenizer().tokenize_and_filter
answers = self.get_tokenizer().tokenize_and_filter
should actually be
questions = self.get_tokenizer().tokenize_and_filter()
answers = self.get_tokenizer().tokenize_and_filter()
and it works this way :)
I have my encode function that looks like this:
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
def encode(texts, tokenizer=tokenizer, maxlen=10):
# import pdb; pdb.set_trace()
inputs = tokenizer.encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=maxlen
)
return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
I want to get my data encoded on the fly by doing this:
x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
.map(encode))
However, this chucks the error:
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Now from my understanding when I set a breakpoint inside encode it was because I was sending a non-numpy array. How do I get huggingface transformers to play nice with tensorflow strings as inputs?
If you need a dummy dataframe here it is:
df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})
What I tried
So I tried to use from_generator so that I can parse in the strings to the encode_plus function. However, this does not work with TPUs.
AUTO = tf.data.experimental.AUTOTUNE
def get_gen(df):
def gen():
for i in range(len(df)):
yield encode(df.loc[i, 'comment_text']) , df.loc[i, 'toxic']
return gen
shapes = ((tf.TensorShape([maxlen]), tf.TensorShape([maxlen]), tf.TensorShape([maxlen])), tf.TensorShape([]))
train_dataset = tf.data.Dataset.from_generator(
get_gen(df_train),
((tf.int32, tf.int32, tf.int32), tf.int32),
shapes
)
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(AUTO)
Version Info:
transformers.__version__, tf.__version__ => ('2.7.0', '2.1.0')
the tokenizer of bert works on a string, a list/tuple of strings or a list/tuple of integers. So, check is your data getting converted to string or not. To apply tokenizer on whole dataset I used Dataset.map, but this runs on graph mode. So, I need to wrap it in a tf.py_function. The tf.py_function will pass regular tensors (with a value and a .numpy() method to access it), to the wrapped python function. My data was getting converted to bytes after using py_function hence I applied tf.compat.as_str to convert bytes to string.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
return lang1, lang2
def tf_encode(pt, en):
result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
result_pt.set_shape([None])
result_en.set_shape([None])
return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE,
padded_shapes=(60, 60))
a,p = next(iter(train_dataset))
When you create the tensorflow dataset with: tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
tensorflow converts your strings into tensors of string type which is not an accepted input of of tokenizer.encode_plus. Like the error message says it only accepts a string, a list/tuple of strings or a list/tuple of integers. You can verify this by adding a print(type(texts)) inside your encode function (Output:<class 'tensorflow.python.framework.ops.Tensor'>).
I'm not sure what your follow up plan is and why you need a tf.data.Dataset, but you have to encode your input before you turn it into a tf.data.Dataset:
import tensorflow as tf
from transformers import BertTokenizer, BertModel
MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)
texts = ['Today was a good day', 'Today was a bad day',
'Today was a rainy day', 'Today was a sunny day',
'Today was a cloudy day']
#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
texts,
return_tensors='tf',
return_attention_masks=True,
return_token_type_ids=True,
pad_to_max_length=True,
max_length=10
)
dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
inputs['attention_mask'],
inputs['token_type_ids']))
print(type(dataset))
Output:
<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
I had this exact error but my mistake was simple, I had a few NaNs in my texts.
So make sure to check if there are NaNs in your texts dataframe.
simple question and im sure answer is straightforward but im really struggling to match model shape with tensor fitting into model.
this simple code
let tf = require('#tensorflow/tfjs-node');
let features = {
x: [1,2,3,4,5,6,7,8,9],
y: [1,2,3,4,5,6,7,8,9]
}
let tensorfeature = tf.tensor2d(Object.values(features))
console.log(tensorfeature.shape)
const model = tf.sequential();
model.add(tf.layers.dense(
{
inputShape: tensorfeature.shape,
units: 1
}
))
const optimizer = tf.train.sgd(0.005);
model.compile({optimizer: optimizer, loss: 'meanAbsoluteError'});
model.fit(tensorfeature,
{epochs: 5}
)
Results in Error: Error when checking input: expected dense_Dense1_input to have 3 dimension(s). but got array with shape 2,9
tried multiple things with reshape, slice, etc with no luck. Can someone point me what exactly is wrong?
model.fit takes at least two parameters x, y which are either tensors or array of tensors. The config object is the third parameter.
Also, the feature(tensorfeature) tensor passed as argument to model.fit should be one dimension higher than the inputShape of the model. Since tensorfeature.shape is used as the inputShape, if we want to traing the model with tensorfeature its dimension should be expanded. It can be done using reshape or expandDims.
model.fit(tensorfeature.expandDims(0))
// or possibly
model.fit(tensorfeature.reshape([1, ...tensorfeature.shape])
This shape mismatch between the model and the training data has been discussed here and there
I am working on a sample project of California housing price problem and getting above error while training my model.
Following this article
https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=firststeps-colab&hl=en#scrollTo=pDIxp6vcU809
In my case, the error was caused by passing:
my_feature = california_housing_dataframe["total_rooms"]
into:
ds = Dataset.from_tensor_slices((features, targets))
The solution is to pass:
my_feature = california_housing_dataframe[["total_rooms"]]
y_train = train_set.pop("satisfaction")
train_input_fn = make_input_fn(train_set, y_train) #make_input_fn is the input function
linear_est.train(train_input_fn) # train
The error for me was that I wrote y_train = "satisfaction" instead of
y_train = train_set.pop("satisfaction"). The pop function will allow you to remove a specified column and save it to in this case the y_train var. You then use to predict that value in your model later on
I currently follow the tutorial to retrain Inception for image classification:
https://cloud.google.com/blog/big-data/2016/12/how-to-train-and-classify-images-using-google-cloud-machine-learning-and-cloud-dataflow
However, when I make a prediction with the API I get only the index of my class as a label. However I would like that the API actually gives me a string back with the actual class name e.g instead of
​predictions:
- key: '0'
prediction: 4
scores:
- 8.11998e-09
- 2.64907e-08
- 1.10307e-06
I would like to get:
​predictions:
- key: '0'
prediction: ROSES
scores:
- 8.11998e-09
- 2.64907e-08
- 1.10307e-06
Looking at the reference for the Google API it should be possible:
https://cloud.google.com/ml-engine/reference/rest/v1/projects/predict
I already tried to change in the model.py the following to
outputs = {
'key': keys.name,
'prediction': tensors.predictions[0].name,
'scores': tensors.predictions[1].name
}
tf.add_to_collection('outputs', json.dumps(outputs))
to
if tensors.predictions[0].name == 0:
pred_name ='roses'
elif tensors.predictions[0].name == 1:
pred_name ='tulips'
outputs = {
'key': keys.name,
'prediction': pred_name,
'scores': tensors.predictions[1].name
}
tf.add_to_collection('outputs', json.dumps(outputs))
but this doesn't work.
My next idea was to change this part in the preprocess.py file. So instead getting the index I want to use the string label.
def process(self, row, all_labels):
try:
row = row.element
except AttributeError:
pass
if not self.label_to_id_map:
for i, label in enumerate(all_labels):
label = label.strip()
if label:
self.label_to_id_map[label] = label #i
and
label_ids = []
for label in row[1:]:
try:
label_ids.append(label.strip())
#label_ids.append(self.label_to_id_map[label.strip()])
except KeyError:
unknown_label.inc()
but this gives the error:
TypeError: 'roses' has type <type 'str'>, but expected one of: (<type 'int'>, <type 'long'>) [while running 'Embed and make TFExample']
hence I thought that I should change something here in preprocess.py, in order to allow strings:
example = tf.train.Example(features=tf.train.Features(feature={
'image_uri': _bytes_feature([uri]),
'embedding': _float_feature(embedding.ravel().tolist()),
}))
if label_ids:
label_ids.sort()
example.features.feature['label'].int64_list.value.extend(label_ids)
But I don't know how to change it appropriately as I could not find someting like str_list. Could anyone please help me out here?
Online prediction certainly allows this, the model itself needs to be updated to do the conversion from int to string.
Keep in mind that the Python code is just building a graph which describes what computation to do in your model -- you're not sending the Python code to online prediction, you're sending the graph you build.
That distinction is important because the changes you have made are in Python -- you don't yet have any inputs or predictions, so you won't be able to inspect their values. What you need to do instead is add the equivalent lookups to the graph that you're exporting.
You could modify the code like so:
labels = tf.constant(['cars', 'trucks', 'suvs'])
predicted_indices = tf.argmax(softmax, 1)
prediction = tf.gather(labels, predicted_indices)
And leave the inputs/outputs untouched from the original code