Convert datasets of strings to arrays the value of its bytes - tensorflow2.0

I'm trying to use wiki40b/en to train a neural network. I want preprocess the data and also generate the expected output from it:
Remove tags
Convert string to an array of the values of its bytes
Add noise to the array (zero or modify some values)
Currently I have the following code:
def add_noise_to_sentence(sentence, noise=0.15):
# How I can get the length of the sentence?
mask = np.random.random_sample((len(sentence),)) > noise
with_noise = sentence * mask
return with_noise, mask
def preprocess_lang(text):
text = tf.strings.regex_replace(text, "_START_ARTICLE_ | _START_PARAGRAPH_ | \n | <br> | <p> | </p> | <html> | </html> | <body> | </body>", " ")
text = tf.io.decode_raw(text, tf.uint8) # I thought it was going to cast to uint8[]
# text = tf.strings.bytes_split(text) # This didn't work either
noise_text = add_noise_to_sentence(text)
return noise_text, text
ds = tfds.load('wiki40b', split=["train"], shuffle_files=True)
ds = ds.map(lambda x: preprocess_lang(x["text"]))
ds = ds.cache()
ds = ds.batch(128)
ds = ds.prefetch(tf.data.AUTOTUNE)
How I can transform the string to an array of numbers (my model expects numbers as inputs) and do the required transformations as well?

Related

Tensor to Dataframe for each sentence

For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:
s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']
I get the values the following way:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
print(output.logits.detach().numpy())
# returns [[-0.8390876 2.9480567 -0.5134539 0.70386493 -0.5019671 -2.619496 ]]
#[[-0.8847909 -0.9642067 -2.2108874 -0.43932158 4.3386173 -0.37383893]]
#[[-0.48750368 3.2949197 2.1660519 -0.6453249 -1.7101991 -2.817954 ]]
How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?
Expected output:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow
sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
Maybe creating two separate data frames and then appending them would be one plausible way of doing this?
Save the model outputs in a list and then create the dataframe from an object:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
outputs = []
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
outputs.append(output.logits.detach().numpy()[0])
# convert to one numpy array
outputs = np.array(outputs)
# create dataframe
obj = {"sentence": s}
for class_id in range(outputs.shape[1]):
# get the data column for that class
obj[f"class_{class_id}"] = outputs[:,class_id].tolist()
df = pd.DataFrame(obj)
I managed to come up with a solution and I am posting it as someone might find it useful.
The idea is to initialize a data frame and to append the absolute values for every sentence while iterating
absolute_vals = pd.DataFrame()
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
absolute_vals = absolute_vals.append(px, ignore_index = True)
absolute_vals
Returns:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...

combine two lists to PCollection

I'm using Apache Beam. When writing to tfRecord I need to include the ID of the item along with its text and embedding.
The tutorial works with just one list of text but I also have a list of the IDs to match the list of text so I was wondering how I could pass the ID to the following function:
def to_tf_example(entries):
examples = []
text_list, embedding_list = entries
for i in range(len(text_list)):
text = text_list[i]
embedding = embedding_list[i]
features = {
# need to pass in ID here like so:
'id': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[ids.encode('utf-8')])),
'text': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
'embedding': tf.train.Feature(
float_list=tf.train.FloatList(value=embedding.tolist()))
}
example = tf.train.Example(
features=tf.train.Features(
feature=features)).SerializeToString(deterministic=True)
examples.append(example)
return examples
My first thought was just to include the ids in the text column of my database and then extract them via slicing or regex or something but was wondering if there was a better way, I assume converting to a PCollection but don't know where to start. Here is the pipeline:
with beam.Pipeline(args.runner, options=options) as pipeline:
query_data = pipeline | 'Read data from BigQuery' >>
beam.io.Read(beam.io.BigQuerySource(project='my-project', query=get_data(args.limit), use_standard_sql=True))
# list of texts
text = query_data | 'get list of text' >> beam.Map(lambda x: x['text'])
# list of ids
ids = query_data | 'get list of ids' >> beam.Map(lambda x: x['id'])
( text
| 'Batch elements' >> util.BatchElements(
min_batch_size=args.batch_size, max_batch_size=args.batch_size)
| 'Generate embeddings' >> beam.Map(
generate_embeddings, args.module_url, args.random_projection_matrix)
| 'Encode to tf example' >> beam.FlatMap(to_tf_example)
| 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
file_path_prefix='{0}'.format(args.output_dir),
file_name_suffix='.tfrecords')
)
query_data | 'Convert to entity and write to datastore' >> beam.Map(
lambda input_features: create_entity(
input_features, args.kind))
I altered generate_embeddings to return List[int], List[string], List[List[float]] and then used the following function to pass the list of ids and text in:
def generate_embeddings_for_batch(batch, module_url, random_projection_matrix):
embeddings = generate_embeddings([x['id'] for x in batch], [x['text'] for x in batch], module_url, random_projection_matrix)
return embeddings
Here I'll assume generate_embeddings has the signature List[str], ... -> (List[str], List[List[float]])
What you want to do is avoid splitting your texts and ids into separate PCollections. So you might want to write something like
def generate_embeddings_for_batch(
batch,
module_url,
random_projection_matrix) -> Tuple[int, str, List[float]]:
embeddings = generate_embeddings(
[x['text'] for x in batch], module_url, random_projection_matrix)
text_to_embedding = dict(embeddings)
for id, text in batch:
yield x['id'], x['text'], text_to_embedding[x['text']]
From there you should be able to write to_tf_example.
It would probably make sense to look at using TFX.

Numpy: Construct Slice A La Carte

Suppose I have the following:
# in pseudo code
# function input 1
chord = [0,1,17,35,47,0]
dims = [0,1,2,4,5,6]
x_axis = 3
t_axis = 7
# what I'd like to return
np.squeeze(arr[0,1,17,:,35,47,0,:])
# function input 2
chord = [0,3,4,5,6,7]
dims = [0,2,3,4,5,6]
x_axis = 1
t_axis = 7
# desired return
np.squeeze(arr[0,:,3,4,5,6,7,:])
How do I construct these numpy slices given input that I can arbitrarily specify a pair of axes and a chord coordinate?
I implemented a reflection-based solution:
def reflection_window(arr:np.ndarray,chord:list,dim0,dim1):
var = "arr"
bra = "["
ket = "]"
coord = [str(i) for int(i) in chord]
coord.insert(dim0,':')
coord.insert(dim1,':')
chordstr = ','.join(coord)
slicer = var+bra+chordstr+ket
return eval(slicer)
Staying native to numpy is probably better, but since python is a shell scripting language, it probably makes sense to treat it that way if necessary.

How to read data of multiple input model using tf.data.TextlineDataset?

Model
I've created a model with multiple inputs which can be embedding index or continuous numbers. For example, there are three inputs whose name are input1, input2 and input3 specifically, and they are fixed length embedding index, variable length embedding index and continuous numbers.
Data
The format of data file is organized as follow:
input1 input2 input3 label
1 1,2 0.51,0.62 2
All inputs are separated by tab(\t).
Variable length embedding index and continuous numbers input values are separated by comma(,) .
Load Data
Now I want to load the train data from data files. And I use tf.data.TextLineDataset for that purpose. But how can I convert the value of input2 and input3 to a array tensor for training and eval? I've tried map function of Dataset.
Snipped code
dataset = tf.data.TextLineDataset('file.tsv')
dataset = dataset.map(labeler)
def labeler(record):
fields = tf.decode_csv(record, record_defaults=['0', '0', '0', 0], field_delim='\t')
label = fields[-1]
del fields[-1]
data = dict()
data['input1'] = tf.cast(fields[0], dtype=int64)
# How to do with input2 and input3??
data['input2'] = ??
data['input3'] = ??
return data, label
I'll answer this question myself, Here the code of function labeler:
def labeler(record):
fields = tf.io.decode_csv(record,
record_defaults=['0'] * 4,
field_delim='\t',
select_cols=list(range(0, 4)))
data = dict()
data['input1'] = tf.strings.to_number(fields[0], out_type='int64')
data['input2'] = tf.strings.to_number(tf.strings.split([fields[1]],
sep=',').values,
out_type='int64')
data['input3'] = tf.strings.to_number(tf.strings.split([fields[2]],
sep=',').values,
out_type='float64')
label = tf.strings.to_number(fields[-1], out_type='int64')
return data, label
Notice:
If you want to batch the dataset above using batch fuction, it will fail. Because the dataset has the variable length input field.
The method to solve this problem is to use padded_batch function of dataset. And as you have multiple input, you should set the shape for each input using tuple which will be passed to padded_batch. Here is the code:
shapes = ({'input1': [], 'input2': [None], 'input3': []}, [])
dataset = dataset.map(lambda ex: labeler(ex))
dataset = dataset.shuffle(1000).repeat(2).padded_batch(batch_size,
padded_shapes=shapes)
[] means no pad, [None] means pad to the longest record in that batch using 0.
Although this works, whether padded with all 0 affect the training effect is still unknown. If you have any idea, it's very pleasure to hear your voice.

How to build the input pipeline for a Siamese Network in Tensorflow?

Currently, I am trying to implement the experiment in the paper: Siamese Neural Networks for One-shot Image Recognition using Tensorflow.
The image set is Omniglot, in which each image can be loaded as an [105,105,1] array.
Since the input of Siamese network is a pair of images with same-or-different class, I need to preprocess the dataset as follows.
I transfer the Omniglot dataset into a [n,20,105,105,1] numpy array, where n represents the number of classes, in which each class has 20 examples of images of size [105,105,1].
Then I implement a function to return one pair of images:
def get_example(dataset):
"""
get one pair of images
:param dataset: the set, eg. training set
:return: when label is 1, return a concatenated array of two imgs from same character
when label is 0, return a concatenated array of two imgs from different characters
"""
# randint(0, x) generates 1 random numbers from 0 ~ x
set_upper = len(dataset)
set_lower = 0
# sample(range(0, 20), 2) generates 2 random numbers from 0 ~ 19
char_upper = 20
char_lower = 0
label = randint(0, 1)
if label:
# randomly select one character from the set
char = randint(set_lower, set_upper-1)
rand_char = dataset[char]
# randomly select two different images from the character
a = b = 0
while a == b:
a, b = sample(range(char_lower, char_upper), 2)
img_a = rand_char[a]
img_b = rand_char[b]
else:
# randomly select two characters from the set
c1, c2 = sample(range(set_lower, set_upper), 2)
rand_char1 = dataset[c1]
rand_char2 = dataset[c2]
# randomly select two images from two characters
a, b = sample(range(char_lower, char_upper), 2)
img_a = rand_char1[a]
img_b = rand_char2[b]
img_input = np.concatenate((img_a, img_b), axis=0)
img_input = img_input[..., newaxis]
return img_input, label
So here is my question, how to group the images into batches, and how to feed them into the model in Tensorflow?
You should be able to create a dataset as described in https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays and use standard tf.data.Dataset operations like shuffle and batch to achieve your goal.