how to convert a MapDataset to tensor - tensorflow

i'm using tensorflow_io_bigquery_client in order to read data from bq.
each record has 4 integer elements.
but i need to transform it into tensor (4 integer each).
so i have something like:
tensorflow_io_bigquery_client = BigQueryClient()
read_session = tensorflow_io_bigquery_client.read_session(..)
dataset = read_session.parallel_read_rows()
dataset = dataset.map (transofrom_row)
#todo convert it into tensor (but i guess it should be an iterator or something like this because the dataset is hugh?
k_means_estimator = tf.compat.v1.estimator.experimental.KMeans(num_clusters = num_clusters, use_mini_batch = False, relative_tolerance = 1)
fit = k_means_estimator.train(input_fn=lambda: to_tensor(dataset), steps=1000)
# todo convert dataset into tensor

Related

How to index values in Tensorflow based on a Tensor?

I have three tensors given below in tensorflow v1:
mod_labels = [0,1,1,0,0]
feats = [[1,2,1], [3,2,6], [1,1,1], [9,8,4], [5,4,8]]
labels = [1,53,12,89,54]
I want to create four new tensors based on values from mod_labels as:
# Make new tensors for mod_labels=0
mod0_feats = [[1,2,1], [9,8,4], [5,4,8]]
mod0_labels = [1,89,54]
# Similarly make tensors for mod_labels=1
mod1_feats = [[3,2,6], [1,1,1]]
mod1_labels = [53,12]
I have tried using for loop to iterate over mod_labels but tensorflow does not allow to iterate over placeholders.
well, supposing those are tensors:
mod_labels = tf.convert_to_tensor([0,1,1,0,0])
feats = tf.convert_to_tensor([[1,2,1], [3,2,6], [1,1,1], [9,8,4], [5,4,8]])
labels = tf.convert_to_tensor([1,53,12,89,54])
zeros = tf.where(mod_labels == 0)
ones = tf.where(mod_labels == 1)
mod0_feats = tf.gather_nd(feats, zeros)
mod0_labels = tf.gather_nd(labels, zeros)
mod1_feats = tf.gather_nd(feats, ones)
mod1_labels = tf.gather_nd(labels, ones)

How can i extract the encoded part of multi-modal autoencoder and convert the .h5 model to a numpy array?

I am making a deep multimodal autoencoder model which takes two inputs and produces a two outputs (which are the reconstructed inputs). The two inputs are with shape of (1000, 50) and (1000,60) respectively and the model has 3 hidden layers and aim to concatenate the two latent layer of input1 and input2.
I would like to extract the encoded part of my model and save the data as a numpy array.
here is the complete code of the model :
input_X = Input(shape=(X[0].shape))
dense_X = Dense(40,activation='relu')(input_X)
dense1_X = Dense(20,activation='relu')(dense_X)
latent_X= Dense(2,activation='relu')(dense1_X)
input_X1 = Input(shape=(X1[0].shape))
dense_X1 = Dense(40,activation='relu')(input_X1)
dense1_X1 = Dense(20,activation='relu')(dense_X1)
latent_X1= Dense(2,activation='relu')(dense1_X1)
Concat_X_X1 = concatenate([latent_X, latent_X1])
decoding_X = Dense(20,activation='relu')(Concat_X_X1)
decoding1_X = Dense(40,activation='relu')(decoding_X)
output_X = Dense(X[0].shape[0],activation='sigmoid')(decoding1_X)
decoding_X1 = Dense(20,activation='relu')(Concat_X_X1)
decoding1_X1 = Dense(40,activation='relu')(decoding_X1)
output_X1 = Dense(X1[0].shape[0],activation='sigmoid')(decoding1_X1)
multi_modal_autoencoder = Model([input_X, input_X1], [output_X, output_X1], name='multi_modal_autoencoder')
encoder = Model([input_X, input_X1], Concat_X_X1)
encoder.save('encoder.h5')
multi_modal_autoencoder.compile(optimizer=keras.optimizers.Adam(lr=0.001),loss='mse')
model = multi_modal_autoencoder.fit([X,X1], [X, X1], epochs=70, batch_size=150)
With h5py package you can get into your .h5 file and extract exactly what you want:
f = h5py.File('encoder.h5', 'r')
keys = list(f.keys())
values = f.get('some_key')
You can hierarchically use .get many times to go deeper into your .h5 file to extract what you need.

How to read data of multiple input model using tf.data.TextlineDataset?

Model
I've created a model with multiple inputs which can be embedding index or continuous numbers. For example, there are three inputs whose name are input1, input2 and input3 specifically, and they are fixed length embedding index, variable length embedding index and continuous numbers.
Data
The format of data file is organized as follow:
input1 input2 input3 label
1 1,2 0.51,0.62 2
All inputs are separated by tab(\t).
Variable length embedding index and continuous numbers input values are separated by comma(,) .
Load Data
Now I want to load the train data from data files. And I use tf.data.TextLineDataset for that purpose. But how can I convert the value of input2 and input3 to a array tensor for training and eval? I've tried map function of Dataset.
Snipped code
dataset = tf.data.TextLineDataset('file.tsv')
dataset = dataset.map(labeler)
def labeler(record):
fields = tf.decode_csv(record, record_defaults=['0', '0', '0', 0], field_delim='\t')
label = fields[-1]
del fields[-1]
data = dict()
data['input1'] = tf.cast(fields[0], dtype=int64)
# How to do with input2 and input3??
data['input2'] = ??
data['input3'] = ??
return data, label
I'll answer this question myself, Here the code of function labeler:
def labeler(record):
fields = tf.io.decode_csv(record,
record_defaults=['0'] * 4,
field_delim='\t',
select_cols=list(range(0, 4)))
data = dict()
data['input1'] = tf.strings.to_number(fields[0], out_type='int64')
data['input2'] = tf.strings.to_number(tf.strings.split([fields[1]],
sep=',').values,
out_type='int64')
data['input3'] = tf.strings.to_number(tf.strings.split([fields[2]],
sep=',').values,
out_type='float64')
label = tf.strings.to_number(fields[-1], out_type='int64')
return data, label
Notice:
If you want to batch the dataset above using batch fuction, it will fail. Because the dataset has the variable length input field.
The method to solve this problem is to use padded_batch function of dataset. And as you have multiple input, you should set the shape for each input using tuple which will be passed to padded_batch. Here is the code:
shapes = ({'input1': [], 'input2': [None], 'input3': []}, [])
dataset = dataset.map(lambda ex: labeler(ex))
dataset = dataset.shuffle(1000).repeat(2).padded_batch(batch_size,
padded_shapes=shapes)
[] means no pad, [None] means pad to the longest record in that batch using 0.
Although this works, whether padded with all 0 affect the training effect is still unknown. If you have any idea, it's very pleasure to hear your voice.

Split a dataset created by Tensorflow dataset API in to Train and Test?

Does anyone know how to split a dataset created by the dataset API (tf.data.Dataset) in Tensorflow into Test and Train?
Assuming you have all_dataset variable of tf.data.Dataset type:
test_dataset = all_dataset.take(1000)
train_dataset = all_dataset.skip(1000)
Test dataset now has first 1000 elements and the rest goes for training.
You may use Dataset.take() and Dataset.skip():
train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
full_dataset = tf.data.TFRecordDataset(FLAGS.input_file)
full_dataset = full_dataset.shuffle()
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(val_size)
test_dataset = test_dataset.take(test_size)
For more generality, I gave an example using a 70/15/15 train/val/test split but if you don't need a test or a val set, just ignore the last 2 lines.
Take:
Creates a Dataset with at most count elements from this dataset.
Skip:
Creates a Dataset that skips count elements from this dataset.
You may also want to look into Dataset.shard():
Creates a Dataset that includes only 1/num_shards of this dataset.
Disclaimer I stumbled upon this question after answering this one so I thought I'd spread the love
Most of the answers here use take() and skip(), which requires knowing the size of your dataset before hand. This isn't always possible, or is difficult/intensive to ascertain.
Instead what you can do is to essentially slice the dataset up so that 1 every N records becomes a validation record.
To accomplish this, lets start with a simple dataset of 0-9:
dataset = tf.data.Dataset.range(10)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Now for our example, we're going to slice it so that we have a 3/1 train/validation split. Meaning 3 records will go to training, then 1 record to validation, then repeat.
split = 3
dataset_train = dataset.window(split, split + 1).flat_map(lambda ds: ds)
# [0, 1, 2, 4, 5, 6, 8, 9]
dataset_validation = dataset.skip(split).window(1, split + 1).flat_map(lambda ds: ds)
# [3, 7]
So the first dataset.window(split, split + 1) says to grab split number (3) of elements, then advance split + 1 elements, and repeat. That + 1 effectively skips the 1 element we're going to use in our validation dataset.
The flat_map(lambda ds: ds) is because window() returns the results in batches, which we don't want. So we flatten it back out.
Then for the validation data we first skip(split), which skips over the first split number (3) of elements that were grabbed in the first training window, so we start our iteration on the 4th element. The window(1, split + 1) then grabs 1 element, advances split + 1 (4), and repeats.
 
Note on nested datasets:
The above example works well for simple datasets, but flat_map() will generate an error if the dataset is nested. To address this, you can swap out the flat_map() with a more complicated version that can handle both simple and nested datasets:
.flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))
#ted's answer will cause some overlap. Try this.
train_ds_size = int(0.64 * full_ds_size)
valid_ds_size = int(0.16 * full_ds_size)
train_ds = full_ds.take(train_ds_size)
remaining = full_ds.skip(train_ds_size)
valid_ds = remaining.take(valid_ds_size)
test_ds = remaining.skip(valid_ds_size)
use code below to test.
tf.enable_eager_execution()
dataset = tf.data.Dataset.range(100)
train_size = 20
valid_size = 30
test_size = 50
train = dataset.take(train_size)
remaining = dataset.skip(train_size)
valid = remaining.take(valid_size)
test = remaining.skip(valid_size)
for i in train:
print(i)
for i in valid:
print(i)
for i in test:
print(i)
Now Tensorflow doesn't contain any tools for that.
You could use sklearn.model_selection.train_test_split to generate train/eval/test dataset, then create tf.data.Dataset respectively.
You can use shard:
dataset = dataset.shuffle() # optional
trainset = dataset.shard(2, 0)
testset = dataset.shard(2, 1)
See:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard
The upcoming TensorFlow 2.10.0 will have a tf.keras.utils.split_dataset function, see the rc3 release notes:
Added tf.keras.utils.split_dataset utility to split a Dataset object or a list/tuple of arrays into two Dataset objects (e.g. train/test).
In case size of the dataset is known:
from typing import Tuple
import tensorflow as tf
def split_dataset(dataset: tf.data.Dataset,
dataset_size: int,
train_ratio: float,
validation_ratio: float) -> Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]:
assert (train_ratio + validation_ratio) < 1
train_count = int(dataset_size * train_ratio)
validation_count = int(dataset_size * validation_ratio)
test_count = dataset_size - (train_count + validation_count)
dataset = dataset.shuffle(dataset_size)
train_dataset = dataset.take(train_count)
validation_dataset = dataset.skip(train_count).take(validation_count)
test_dataset = dataset.skip(validation_count + train_count).take(test_count)
return train_dataset, validation_dataset, test_dataset
Example:
size_of_ds = 1001
train_ratio = 0.6
val_ratio = 0.2
ds = tf.data.Dataset.from_tensor_slices(list(range(size_of_ds)))
train_ds, val_ds, test_ds = split_dataset(ds, size_of_ds, train_ratio, val_ratio)
A robust way to split dataset into two parts is to first deterministically map every item in the dataset into a bucket with, for example, tf.strings.to_hash_bucket_fast. Then you can split the dataset into two by filtering by the bucket. If you split your data into five buckets, you get 80-20 split assuming that the split is even.
As an example, assume that your dataset contains dictionaries with key filename. We split the data into five buckets based on this key. With this add_fold function, we add the key "fold" in the dictionaries:
def add_fold(buckets: int):
def add_(sample, label):
fold = tf.strings.to_hash_bucket(sample["filename"], num_buckets=buckets)
return {**sample, "fold": fold}, label
return add_
dataset = dataset.map(add_fold(buckets=5))
Now we can split the dataset into two disjoint datasets with Dataset.filter:
def pick_fold(fold: int):
def filter_fn(sample, _):
return tf.math.equal(sample["fold"], fold)
return filter_fn
def skip_fold(fold: int):
def filter_fn(sample, _):
return tf.math.not_equal(sample["fold"], fold)
return filter_fn
train_dataset = dataset.filter(skip_fold(0))
val_dataset = dataset.filter(pick_fold(0))
The key that you use for hashing should be one that captures the correlations in the dataset. For example, if your samples collected by the same person are correlated and you want all samples with the same collector end up in the same bucket (and the same split), you should use the collector name or ID as the hashing column.
Of course, you can skip the part with dataset.map and do the hashing and filtering in one filter function. Here's a full example:
dataset = tf.data.Dataset.from_tensor_slices([f"value-{i}" for i in range(10000)])
def to_bucket(sample):
return tf.strings.to_hash_bucket_fast(sample, 5)
def filter_train_fn(sample):
return tf.math.not_equal(to_bucket(sample), 0)
def filter_val_fn(sample):
return tf.math.logical_not(filter_train_fn(sample))
train_ds = dataset.filter(filter_train_fn)
val_ds = dataset.filter(filter_val_fn)
print(f"Length of training set: {len(list(train_ds.as_numpy_iterator()))}")
print(f"Length of validation set: {len(list(val_ds.as_numpy_iterator()))}")
This prints:
Length of training set: 7995
Length of validation set: 2005
Can't comment, but above answer has overlap and is incorrect. Set BUFFER_SIZE to DATASET_SIZE for perfect shuffle. Try different sized val/test size to verify. Answer should be:
DATASET_SIZE = tf.data.experimental.cardinality(full_dataset).numpy()
train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
full_dataset = full_dataset.shuffle(BUFFER_SIZE)
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.take(val_size)
test_dataset = test_dataset.skip(val_size)

Keras .predict with word embeddings back to string

I'm following the tutorial here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html, using a different data set. I'm trying to predict the label for a new random string.
I'm doing labelling a bit different:
encoder = LabelEncoder()
encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
And then trying to predict like:
string = "I am a cat"
query = tokenizer.texts_to_sequences(string)
query = pad_sequences(query, maxlen=50)
prediction = model.predict(query)
print(prediction)
I get back an array of arrays like below (perhaps the word embeddings?). What are those and how can I translate them back to a string?
[[ 0.03039312 0.02099193 0.02320454 0.02183384 0.01965107 0.01830118
0.0170384 0.01979697 0.01764384 0.02244077 0.0162186 0.02672437
0.02190582 0.01630476 0.01388928 0.01655456 0.011678 0.02256939
0.02161663 0.01649982 0.02086013 0.0161493 0.01821378 0.01440909
0.01879989 0.01217389 0.02032642 0.01405699 0.01393504 0.01957162
0.01818203 0.01698637 0.02639499 0.02102267 0.01956343 0.01588933
0.01635705 0.01391534 0.01587612 0.01677094 0.01908684 0.02032183
0.01798265 0.02017053 0.01600159 0.01576616 0.01373934 0.01596323
0.01386674 0.01532488 0.01638312 0.0172212 0.01432543 0.01893282
0.02020231]
Save the fitted labels in the encoder:
encoder = LabelEncoder()
encoder = encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
Prediction will give you a class vector. And by using the inverse_transform you will get the label type as from your original input:
prediction = model.predict_classes(query)
label = encoder.inverse_transform(prediction)