numpy method for tensors in TensorFlow 2.x and eager execution - numpy

Using TensorFlow 2.4.1 on colab
Running this code below:
import tensorflow as tf
from tensorflow.keras.datasets import cifar100
import numpy as np
(train_data, train_labels), (test_data, test_labels) = cifar100.load_data(label_mode='fine')
train_dataset = tf.data.Dataset.from_tensor_slices((train_data, train_labels))
for (train, label) in train_dataset.take(1):
print(label)
print(label.numpy()[0])
# tf.Tensor([19], shape=(1,), dtype=int64)
# 19
This is all fine but when trying this with the filter method for keras.Dataset objects in the code below, the numpy method does not work:
def filter_classes(dataset, classes):
def match_class(data, label):
print(label)
print(label.numpy()[0])
return label.numpy()[0] in classes
return dataset.filter(match_class)
cifar_classes = [0, 29, 99]
train_dataset = filter_classes(train_dataset, cifar_classes)
# Tensor("args_1:0", shape=(1,), dtype=int64)
# AttributeError: 'Tensor' object has no attribute 'numpy'
From reading some of the related questions, the error seems to be due to the latter tensor not being eager executed.
Does the attribute in the tensor "arg_1:0" rather than being a numpy array signify that the tensor has not been evaluated?
With the filter method, is it by design that the tensors within the dataset objects do not get evaluated eagerly?
Thanks.

tf.data.Dataset functions do not run in EagerMode for performance reasons. You can't use the .numpy method in a function used by a tf.data.Dataset.
In your case, you can use a combination of tf.math.equal and tf.math.reduce_any to filter your dataset and keep only the desired classes:
ds_filtered = train_dataset.filter(lambda x:tf.math.reduce_any(tf.equal(x,cifar_classes)))

Related

Importing pre-trained embeddings into Tensorflow's Embedding Feature Column

I have a TF Estimator that uses Feature Columns at its input layer. One of these is and EmbeddingColumn which I have been initializing randomly (the default behaviour).
Now I would like to pre-train my embeddings in gensim and transfer the learned embeddings into my TF model. The embedding_column accepts an initializer argument which expects a callable that can be created using tf.contrib.framework.load_embedding_initializer.
However, that function expects a saved TF checkpoint, which I don't have, because I trained my embeddings in gensim.
The question is: how do I save gensim word vectors (which are numpy arrays) as a tensor in the TF checkpoint format so that I can use that to initialize my embedding column?
Figured it out! This worked in Tensorflow 1.14.0.
You first need to turn the embedding vectors into a tf.Variable. Then use tf.train.Saver to save it in a checkpoint.
import tensorflow as tf
import numpy as np
ckpt_name = 'gensim_embeddings'
vocab_file = 'vocab.txt'
tensor_name = 'embeddings_tensor'
vocab = ['A', 'B', 'C']
embedding_vectors = np.array([
[1,2,3],
[4,5,6],
[7,8,9]
], dtype=np.float32)
embeddings = tf.Variable(initial_value=embedding_vectors)
init_op = tf.global_variables_initializer()
saver = tf.train.Saver({tensor_name: embeddings})
with tf.Session() as sess:
sess.run(init_op)
saver.save(sess, ckpt_name)
# writing vocab file
with open(vocab_file, 'w') as f:
f.write('\n'.join(vocab))
To use this checkpoint to initialize an embedding feature column:
cat = tf.feature_column.categorical_column_with_vocabulary_file(
key='cat', vocabulary_file=vocab_file)
embedding_initializer = tf.contrib.framework.load_embedding_initializer(
ckpt_path=ckpt_name,
embedding_tensor_name='embeddings_tensor',
new_vocab_size=3,
embedding_dim=3,
old_vocab_file=vocab_file,
new_vocab_file=vocab_file
)
emb = tf.feature_column.embedding_column(cat, dimension=3, initializer=embedding_initializer, trainable=False)
And we can test to make sure it has been initialized properly:
def test_embedding(feature_column, sample):
feature_layer = tf.keras.layers.DenseFeatures(feature_column)
print(feature_layer(sample).numpy())
tf.enable_eager_execution()
sample = {'cat': tf.constant(['B', 'A'], dtype=tf.string)}
test_embedding(item_emb, sample)
The output, as expected, is:
[[4. 5. 6.]
[1. 2. 3.]]
Which are the embeddings for 'B' and 'A' respectively.

Tensorflow Tensor out of CSV has no size?

I just can't get any dimensions (size, lenght) out of this damn tensor "datatens". here is the code and the error message:
import tensorflow as tf
import numpy as np
import tflearn
import pandas as pd
from tensorflow import keras
file = 'some.csv'
record_defaults = [tf.float64]*18
from tflearn.data_utils import load_csv
data , label = load_csv(file, target_column=0,has_header=True,
categorical_labels=True, n_classes=50)
datatens = tf.data.Dataset.from_tensor_slices((data,label))
print(datatens.get_shape().as_list())
ERROR:
<TensorSliceDataset shapes: ((17,), (50,)), types: (tf.string, tf.float64)>
Traceback (most recent call last):
File "basic_class.m", line 44, in <module>
print(datatens.get_shape().as_list())
AttributeError: 'TensorSliceDataset' object has no attribute 'get_shape'
FOLLOWUP:
after getting eager execution running im curious, why my tensor is integer instead of float. here is the output of the advised code.
CODE:
print(tf.shape(data))
print(tf.shape(label))
OUTPUT:
Tensor("Shape:0", shape=(2,), dtype=int32)
Tensor("Shape_1:0", shape=(2,), dtype=int32)
When you call tf.data.Dataset.from_tensor_slices, you get a dataset, not a tensor. A dataset is essentially a container of tensors, and you can access its tensors in a few ways.
The simplest way is to call the dataset's make_one_shot_iterator method. This returns an iterator that cycles through the tensors. The best documentation on datasets and iterators is here.
Are you sure you want to call tf.data.Dataset.from_tensor_slices? Aren't data and label already tensors?
EDIT:
If you want to validate the tensor containing labels, try this code:
import tensorflow as tf
import numpy as np
import tflearn
import pandas as pd
from tensorflow import keras
from tflearn.data_utils import load_csv
tf.enable_eager_execution()
file = 'some.csv'
record_defaults = [tf.float64]*18
data, label = load_csv(file, target_column=0,has_header=True,
categorical_labels=True, n_classes=50)
print(tf.shape(label))
Enabling eager execution is important because it lets you access the tensor without having to create and run a session.

Converting tokens to word vectors effectively with TensorFlow Transform

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase.
I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as expected and I obtain vectors of EMB_DIM for each token.
import numpy as np
import tensorflow as tf
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
# sample string
string_tensor = tf.constant(["plays", "piano", "unknown_token", "another_unknown_token"])
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping = tf.constant(pretrained_vocab),
default_value = len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=False)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
word_vectors = tf.nn.embedding_lookup(embeddings, string_tensor)
with tf.Session() as sess:
tf.tables_initializer().run()
tf.global_variables_initializer().run()
print(sess.run(word_vectors))
When I refactor the code to run as a TFX Transform Graph, I am getting the error the ConversionError below.
import pprint
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
tf.reset_default_graph()
EMB_DIM = 10
def load_pretrained_glove():
tokens = ["a", "cat", "plays", "piano"]
return tokens, np.random.rand(len(tokens), EMB_DIM)
def embed_tensor(string_tensor, trainable=False):
"""
Convert List of strings into list of indices then into EMB_DIM vectors
"""
pretrained_vocab, pretrained_embs = load_pretrained_glove()
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(pretrained_vocab),
default_value=len(pretrained_vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.cast(tf.concat([pretrained_embs, unk_embedding], axis=0), tf.float32)
return tf.nn.embedding_lookup(embeddings, string_tensor)
def preprocessing_fn(inputs):
input_string = tf.string_split(inputs['sentence'], delimiter=" ")
return {'word_vectors': tft.apply_function(embed_tensor, input_string)}
raw_data = [{'sentence': 'This is a sample sentence'},]
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
'sentence': dataset_schema.ColumnSchema(
tf.string, [], dataset_schema.FixedColumnRepresentation())
}))
with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
(raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
preprocessing_fn))
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
pprint.pprint(transformed_data)
Error Message
TypeError: Failed to convert object of type <class
'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor.
Contents: SparseTensor(indices=Tensor("StringSplit:0", shape=(?, 2),
dtype=int64), values=Tensor("hash_table_Lookup:0", shape=(?,),
dtype=int64), dense_shape=Tensor("StringSplit:2", shape=(2,),
dtype=int64)). Consider casting elements to a supported type.
Questions
Why would the TF Transform step require an additional conversion/casting?
Is this approach of converting tokens to word vectors feasible? The word vectors might be multiple gigabytes in memory. How is Apache Beam handling the vectors? If Beam in a distributed setup, would it require N x vector memory with N the number of workers?
The SparseTensor related error is because you are calling string_split which returns a SparseTensor. Your test code does not call string_split so that's why it only happens with your Transform code.
Regarding memory, you are correct, the embedding matrix must be loaded into each worker.
One cannot put a SparseTensor into the dictionary, returned by the TFX Transform, in your case by the function "preprocessing_fn". The reason is that SparseTensor is not a Tensor, it is actually a small subgraph.
To fix your code, you can convert your SparseTensor into a Tensor. There is a number of ways to do so, I would recommend to use tf.serialize_sparse for regular SparseTensor and tf.serialize_many_sparse for batched one.
To consume such serialized Tensor in Trainer, you could call the function tf. deserialize_many_sparse.

'DataFrame' object has no attribute 'train'

Please help me where is my missing? why I always get this error:
'DataFrame' object has no attribute 'train'
# -*- coding: utf-8 -*-
import tensorflow as tf
from tensorflow.contrib import rnn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("all.csv")
x = dataset.iloc[:, 1:51].values
y = dataset.iloc[:, 51].values
time_steps=5
num_units=128
n_input=50
learning_rate=0.001
n_classes=2
batch_size=5
#weights and biases of appropriate shape to accomplish above task
out_weights=tf.Variable(tf.random_normal([num_units,n_classes]))
out_bias=tf.Variable(tf.random_normal([n_classes]))
#defining placeholders
#input image placeholder
x=tf.placeholder("float",[None,time_steps,n_input])
#input label placeholder
y=tf.placeholder("float",[None,n_classes])
#processing the input tensor from [batch_size,n_steps,n_input] to
"time_steps"
number of [batch_size,n_input] tensors
input=tf.unstack(x ,time_steps,1)
#defining the network
lstm_layer=rnn.BasicLSTMCell(num_units,forget_bias=1)
outputs,_=rnn.static_rnn(lstm_layer,input,dtype="float32")
#converting last output of dimension [batch_size,num_units] to
[batch_size,n_classes] by out_weight multiplication
prediction=tf.matmul(outputs[-1],out_weights)+out_bias
#loss_function
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))
#optimization
opt=tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
#model evaluation
correct_prediction=tf.equal(tf.argmax(prediction,1),tf.argmax(y,1))
accuracy=tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
#initialize variables
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
iter=1
while iter<800:
batch_x,batch_y=dataset.train.next_batch(batch_size=batch_size)
batch_x=batch_x.reshape((batch_size,time_steps,n_input))
sess.run(opt, feed_dict={x: batch_x, y: batch_y})
if iter %10==0:
acc=sess.run(accuracy,feed_dict={x:batch_x,y:batch_y})
los=sess.run(loss,feed_dict={x:batch_x,y:batch_y})
print("For iter ",iter)
print("Accuracy ",acc)
print("Loss ",los)
print("__________________")
iter=iter+1
As the error states, your pandas "DataFrame" object has no attribute/method called "next_batch".
You probably followed a tutorial that used Tensorflow helper methods to load the MNIST database maybe. But pandas return a different object than the "DataSet" class you are expecting.

Cross-validation with Scikit Flow fails

I am using scikit-learn for evaluating my neural net implemented in tensorFlow and wrapped by an TensorFlow estimator:
import tensorflow.contrib.learn as skflow
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
...
def my_model(X,y):
...
...
return skflow.models.logistic_regression(h_drop, y)
def main():
X_train, X_test, y_train, y_test = train_test_split(data,labels,test_size=0.1, random_state=3)`
classifier = skflow.TensorFlowEstimator(model_fn=my_model, n_classes=2,batch_size=64,steps=5,optimizer='Adam',learning_rate=1e-4)
classifier.fit(X_train, y_train)
cross_val_score(classifier, data, y=labels, cv=10)
cross_val_score results in following error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator TensorFlowEstimator(steps=5, batch_size=64, continue_training=False, verbose=1, n_classes=2, learning_rate=0.0001, clip_gradients=5.0, class_weight=None, params=None, optimizer=Adam) does not.
When i define a scoring method like shown here:
from sklearn import metrics
cross_val_score(classifier, data, y=labels, cv=10,scoring=metrics.f1_score)
following error occurs:
scoring value looks like it is a metric function rather than a scorer. A scorer should require an estimator as its first parameter. Please use make_scorer to convert a metric to a scorer.
When i use make_scorer like shown here:
cross_val_score(classifier, data, y=labels, cv=10,scoring=metrics.make_scorer(metrics.accuracy_score))
following error occurs:
new_object = klass(**new_object_params)
TypeError: init() got an unexpected keyword argument 'params'
Any idea?