Bert tokenizer wont work with tensor format (tensorflow)

Bert tokenizer wont work with tensor format (tensorflow) - tensorflow

This may be a silly question but im new using tf. I have the following code but the tokenizer wont use the strings inside the tensor.
import tensorflow as tf
docs = tf.data.Dataset.from_tensor_slices([['hagamos que esto funcione.'], ["por fin funciona!"]])
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize(review):
return tokenizer(review)
tokens = docs.map(tokenize)
I get the folowing output:
ValueError: in user code:
File "<ipython-input-54-3272cedfdcab>", line 13, in tokenize *
return tokenizer(review)
File "/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py", line 2429, in __call__ *
raise ValueError(
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
while my expected output is something like this:
tokenizer('esto al fin funciona!')
{'input_ids': [4, 1202, 1074, 1346, 4971, 1109, 5], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Any idea how to make it work?

As mentioned in the error you have to pass the inputs to the tokenzier as a string, list(str) or list(list(str)).
Please check the working code below.
import tensorflow as tf
docs = ['hagamos que esto funcione.', "por fin funciona!"]
from transformers import AutoTokenizer, DataCollatorWithPadding
checkpoint = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize(review):
return tokenizer(review)
tokens = tokenizer(docs)
The output of the above code is:
{'input_ids': [[4, 8700, 1041, 1202, 13460, 1008, 5], [4, 1076, 1346, 4971, 1109, 5]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Related

How to create a Keras layer from tf.math.segment_sum

I would like to use the tf.math.segment_sum function in a Keras layer but I don't get the dimensions right.
As an example, I would like to sum the values of x_1 grouped by id in the dataframe df:
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3, 4, 4],
'x_1': [1, 0, 0, 0, 0, 1, 1, 1],
'target': [1, 1, 0, 0, 1, 1, 2, 2]})
The 'model' I created looks as follows:
input_ = tf.keras.Input((1,), name='X')
cid = tf.keras.Input(shape=(1,), dtype='int64', name='id')
summed = tf.keras.layers.Lambda(lambda x: tf.math.segment_sum(x[0], x[1]), name='segment_sum')([input_, cid])
model = tf.keras.Model(inputs=[input_, cid], outputs=[summed])
I get an error about the rank:
ValueError: Shape must be rank 1 but is rank 2 for 'segment_sum/SegmentSum' (op: 'SegmentSum') with input shapes: [?,1], [?,1].
What do I do wrong here?

I solved it using tf.gather. The working code is as follows:
input_ = tf.keras.Input((1,), name='X')
cid = tf.keras.Input(shape=(1,), dtype='int64', name='id')
summed = tf.keras.layers.Lambda(lambda x: tf.gather(tf.math.segment_sum(x[0], tf.reshape(x[1], (-1,))), x[1]), output_shape=(None,1), name='segment_sum')([input_, cid])
model = tf.keras.Model(inputs=[input_, cid], outputs=[summed])

How to do predictions with tf.dataset

I have an issue with tf.datasets and tf.keras.predict(). I don't know why the length of the output array of predict() is larger than the original lenght of data used. Here is a sketch:
Before I used arrays. And if I applied predict() on a array of lenght x I get an output of lenght x... This is my expected behaviour.
I have a csv of test data with some lenght (10000). Now I use
LABEL_COLUMN = 'label'
LABELS = [0, 1]
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=1, # Artificially small to make examples easier to show.
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
to convert this to a tf.dataset.
val='data/test.csv'
val_data= get_dataset(val)
Now using
scores=bert_model.predict(val_data)
gives an array ouput which is very much larger than of the original csv file (10000)...
I am really off. Also I ask myself how does keras know what "keys" of the tf.dataset to use for predictrions.
The structure of the 1. elemnt of the dataset looks like "val[0]":
({'input_ids': <tf.Tensor: shape=(15,), dtype=int32, numpy=
array([ 3, 2019, 479, 1169, 4013, 26918, 259, 4, 14576,
3984, 889, 648, 1610, 26918, 4])>, 'token_type_ids': <tf.Tensor: shape=(15,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1])>, 'attention_mask': <tf.Tensor: shape=(15,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])>}, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
why does my label column has no key with name "label"? The first 3 keys all have their names and the model is trained with these 3 columns.
I use above structure with label column as input for predict...
Any idea? Is it due to the function of making a dataset from a csv?

Map RGB Semantic Maps to One Hot Encodings and vice versa in TensorFlow

The image below is a sample semantic map from the Cityscapes Dataset. It's provided in the form of an RGB image where each specific colour represents a class.
In some deep learning tasks, we would like to map this into a one hot encoding. For example, if it has 20 classes, then this image would be mapped from H x W x 3 to H x W x 20.
How do we do this in TensorFlow?

My solution is below. Looking forward to suggestions on how to make this more efficient or perhaps an answer that's more efficient.
import tensorflow as tf
import numpy as np
import scipy.misc
img = scipy.misc.imread('aachen_000000_000019_gtFine_color.png', mode = 'RGB')
palette = np.array(
[[128, 64, 128],
[244, 35, 232],
[ 70, 70, 70],
[102, 102, 156],
[190, 153, 153],
[153, 153, 153],
[250, 170, 30],
[220, 220, 0],
[107, 142, 35],
[152, 251, 152],
[ 70, 130, 180],
[220, 20, 60],
[255, 0, 0],
[ 0, 0, 142],
[ 0, 0, 70],
[ 0, 60, 100],
[ 0, 80, 100],
[ 0, 0, 230],
[119, 11, 32],
[ 0, 0, 0],
[255, 255, 255]], np.uint8)
semantic_map = []
for colour in palette:
class_map = tf.reduce_all(tf.equal(img, colour), axis=-1)
semantic_map.append(class_map)
semantic_map = tf.stack(semantic_map, axis=-1)
# NOTE cast to tf.float32 because most neural networks operate in float32.
semantic_map = tf.cast(semantic_map, tf.float32)
magic_number = tf.reduce_sum(semantic_map)
print semantic_map.shape
palette = tf.constant(palette, dtype=tf.uint8)
class_indexes = tf.argmax(semantic_map, axis=-1)
# NOTE this operation flattens class_indexes
class_indexes = tf.reshape(class_indexes, [-1])
color_image = tf.gather(palette, class_indexes)
color_image = tf.reshape(color_image, [1024, 2048, 3])
sess = tf.Session()
# NOTE magic_number checks that there are only 1024*2048 1s in the entire
# 1024*2048*21 tensor.
magic_number_val = sess.run(magic_number)
assert magic_number_val == 1024*2048
color_image_val = sess.run(color_image)
scipy.misc.imsave('test.png', color_image_val)

Tensorflow confusion matrix using one-hot code

I have multi-class classification using RNN and here is my main code for RNN:
def RNN(x, weights, biases):
x = tf.unstack(x, input_size, 1)
lstm_cell = rnn.BasicLSTMCell(num_unit, forget_bias=1.0, state_is_tuple=True)
stacked_lstm = rnn.MultiRNNCell([lstm_cell]*lstm_size, state_is_tuple=True)
outputs, states = tf.nn.static_rnn(stacked_lstm, x, dtype=tf.float32)
return tf.matmul(outputs[-1], weights) + biases
logits = RNN(X, weights, biases)
prediction = tf.nn.softmax(logits)
cost =tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(cost)
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
I have to classify all inputs to 6 classes and each of classes is composed of one-hot code label as the follow:
happy = [1, 0, 0, 0, 0, 0]
angry = [0, 1, 0, 0, 0, 0]
neutral = [0, 0, 1, 0, 0, 0]
excited = [0, 0, 0, 1, 0, 0]
embarrassed = [0, 0, 0, 0, 1, 0]
sad = [0, 0, 0, 0, 0, 1]
The problem is I cannot print confusion matrix using tf.confusion_matrix() function.
Is there any way to print confusion matrix using those labels?
If not, how can I convert one-hot code to integer indices only when I need to print confusion matrix?

You cannot generate confusion matrix using one-hot vectors as input parameters of labels and predictions. You will have to supply it a 1D tensor containing your labels directly.
To convert your one hot vector to normal label, make use of argmax function:
label = tf.argmax(one_hot_tensor, axis = 1)
After that you can print your confusion_matrix like this:
import tensorflow as tf
num_classes = 2
prediction_arr = tf.constant([1, 1, 1, 1, 0, 0, 0, 0, 1, 1])
labels_arr = tf.constant([0, 1, 1, 1, 1, 1, 1, 1, 0, 0])
confusion_matrix = tf.confusion_matrix(labels_arr, prediction_arr, num_classes)
with tf.Session() as sess:
print(confusion_matrix.eval())
Output:
[[0 3]
[4 3]]

Keras Array Input Error

I get the following error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 6 arrays but instead got the following list of 3 arrays: [array([[ 0, 0, 0, ..., 18, 12, 1],
[ 0, 0, 0, ..., 18, 11, 1],
[ 0, 0, 0, ..., 18, 9, 1],
...,
[ 0, 0, 0, ..., 18, 15, 1],
[ 0, 0, 0, ..., 18, 9, ...
in my keras model.
I think the model is mistaking something?
This happens when I feed input to my model. The same input works perfectly well in another program.

It's impossible to diagnose your exact problem without more information.
I usually specify the input_shape parameter of the first layer based on my training data X.
e.g.
model = Sequential()
model.add(Dense(32, input_shape=X.shape[0]))
I think you'll want X to look something like this:
[
[[ 0, 0, 0, ..., 18, 11, 1]],
[[ 0, 0, 0, ..., 18, 9, 1]],
....
]
So you could try reshaping it with the following line:
X = np.array([[sample] for sample in X])

The problem really comes from giving the wrong input to the network.
In my case the problem was that my custom image generator was passing the entire dataset as input rather than a certain pair of image-label. This is because I thought that generator.flow(x,y, batch_size) of Keras already has a yield structure inside, however the correct generator structure should be as follows(with a separate yield):
def generator(batch_size):
(images, labels) = utils.get_data(1000) # gets 1000 samples from dataset
labels = to_categorical(labels, 2)
generator = ImageDataGenerator(featurewise_center=True,
featurewise_std_normalization=True,
rotation_range=90.,
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=0.2)
generator.fit(images)
gen = generator.flow(images, labels, batch_size=32)
while 1:
x_batch, y_batch = gen.next()
yield ([x_batch, y_batch])
I realize the question is old but it might save some time for someone to find the issue.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bert tokenizer wont work with tensor format (tensorflow) - tensorflow

Related

How to create a Keras layer from tf.math.segment_sum

How to do predictions with tf.dataset

Map RGB Semantic Maps to One Hot Encodings and vice versa in TensorFlow

Tensorflow confusion matrix using one-hot code

Keras Array Input Error

Categories

Resources