Use Text as feature column in Tensorflows existing Estimator - tensorflow

I try to build an Classification with the existing estimator to predict if an article will be sold or not.
I tried to use a linearClassifier, because I'm a beginner in Tensorflow and Pyhton.
I have a dataset with price, category and size, which is perfect for numeric or category feature columns. But I also have a description of the article, only 3-6 words per article and around 6500 different words as per my analysis.
I tried to use shared embed, with one category column per word, but this not work. And when I add all 6500 columns directly to the model it is very slow.
What is the best way and easiest way to handle the description? At best with code example. The word order doesn't matter, but for example if it's from a brand it will sell better than noname.
Many thanks for your answers
Edit: I tried with this post Tensorflow pad sequence feature column
But I now have the problem that tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) don't work
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib
import tensorflow.compat.v2.feature_column as fc
from sklearn.feature_extraction.text import CountVectorizer
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
from tensorflow.python.framework.ops import disable_eager_execution
import itertools
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence
dfall = pd.read_csv('./articles.csv')
# Build vacabulary
vocab_size = 6203
oov_tok = '<OOV>'
sentences = dfall['description'].to_list()
tokenizer = Tokenizer(num_words = vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# if word_index shorter then default value of vocab_size we'll save actual size
vocab_size=len(word_index)
print("vocab_size = word_index = ",len(word_index))
# Split sentensec on tokens. here token = word
# text_to_word_sequence() has good default filter for
# charachters include basic punctuation, tabs, and newlines
dfall['description'] = dfall['description'].apply(text_to_word_sequence)
max_length = 9
# paddind and trancating setnences
# do that directly with strings without using tokenizer.texts_to_sequences()
# the feature_colunm will convert strings into numbers
dfall['description']=dfall['description'].apply(lambda x, N=max_length: (x + N * [''])[:N])
dfall['description']=dfall['description'].apply(lambda x, N=max_length: x[:N])
#dfall['description']=dfall['description'].apply(np.asarray)
dfall.head()
# Define method to create tf.data dataset from Pandas Dataframe
def df_to_dataset(dataframe, label_column, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
#labels = dataframe.pop(label_column)
labels = dataframe[label_column]
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
# Split dataframe into train and validation sets
train_df, val_df = train_test_split(dfall, test_size=0.2)
print(len(train_df), 'train examples')
print(len(val_df), 'validation examples')
batch_size = 32
ds = df_to_dataset(dfall, 'sold',shuffle=False,batch_size=batch_size)
train_ds = df_to_dataset(train_df, 'sold', shuffle=False, batch_size=batch_size)
val_ds = df_to_dataset(val_df, 'sold', shuffle=False, batch_size=batch_size)
# and small batch for demo
example_batch = next(iter(ds))[0]
example_batch
# Helper methods to print exxample outputs of for defined feature_column
def demo(feature_column):
feature_layer = tf.keras.layers.DenseFeatures(feature_column)
print(feature_layer(example_batch).numpy())
def seqdemo(feature_column):
sequence_feature_layer = tf.keras.experimental.SequenceFeatures(feature_column)
print(sequence_feature_layer(example_batch))
dfall.head() is
sold description category_id size_id gender price host_id lat long year month
0 1 [dünne, jacke, gepunktet, , , , , , ] 9 25 f 3.5 1 48.21534 11.29949 2019 3
1 1 [kleid, pudel, dunkelblau, gepunktet, , , , , ] 9 25 f 4.0 1 48.21534 11.29949 2019 3
2 0 [kleid, rosa, hum, hund, katze, , , , ] 9 24 f 4.0 1 48.21534 11.29949 2019 3
3 1 [kleid, hum, blau, elsa, und, anna, , , ] 9 24 f 4.0 1 48.21534 11.29949 2019 3
4 0 [kleid, blue, seven, lachsfarben, , , , , ] 9 23 f 4.5 1 48.21534 11.29949 2019 3
The result is
vocab_size = word_index = 6203
12482 train examples
3121 validation examples
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\data\util\structure.py in normalize_element(element)
92 try:
---> 93 spec = type_spec_from_value(t, use_fallback=False)
94 except TypeError:
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\data\util\structure.py in type_spec_from_value(element, use_fallback)
464
--> 465 raise TypeError("Could not build a TypeSpec for %r with type %s" %
466 (element, type(element).__name__))
TypeError: Could not build a TypeSpec for 0 [dünne, jacke, gepunktet, , , , , , ]
1 [kleid, pudel, dunkelblau, gepunktet, , , , , ]
2 [kleid, rosa, hum, hund, katze, , , , ]
3 [kleid, hum, blau, elsa, und, anna, , , ]
4 [kleid, blue, seven, lachsfarben, , , , , ]
...
15598 [gartenschuhe, pink, , , , , , , ]
15599 [sandalen, grau, blume, superfit, , , , , ]
15600 [turnschuhe, converse, grau, , , , , , ]
15601 [strickjacke, rosa, , , , , , , ]
15602 [bikinihose, schmetterling, , , , , , , ]
Name: description, Length: 15603, dtype: object with type Series
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-1-420304a651bd> in <module>
71
72 batch_size = 32
---> 73 ds = df_to_dataset(dfall, 'sold',shuffle=False,batch_size=batch_size)
74
75 train_ds = df_to_dataset(train_df, 'sold', shuffle=False, batch_size=batch_size)
<ipython-input-1-420304a651bd> in df_to_dataset(dataframe, label_column, shuffle, batch_size)
58 labels = dataframe[label_column]
59
---> 60 ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
61 if shuffle:
62 ds = ds.shuffle(buffer_size=len(dataframe))
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in from_tensor_slices(tensors)
638 Dataset: A `Dataset`.
639 """
--> 640 return TensorSliceDataset(tensors)
641
642 class _GeneratorState(object):
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in __init__(self, element)
2856 def __init__(self, element):
2857 """See `Dataset.from_tensor_slices()` for details."""
-> 2858 element = structure.normalize_element(element)
2859 batched_spec = structure.type_spec_from_value(element)
2860 self._tensors = structure.to_batched_tensor_list(batched_spec, element)
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\data\util\structure.py in normalize_element(element)
96 # the value. As a fallback try converting the value to a tensor.
97 normalized_components.append(
---> 98 ops.convert_to_tensor(t, name="component_%d" % i))
99 else:
100 if isinstance(spec, sparse_tensor.SparseTensorSpec):
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\framework\ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
1339
1340 if ret is None:
-> 1341 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1342
1343 if ret is NotImplemented:
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
319 as_ref=False):
320 _ = as_ref
--> 321 return constant(v, dtype=dtype, name=name)
322
323
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name)
259 ValueError: if called on a symbolic tensor.
260 """
--> 261 return _constant_impl(value, dtype, shape, name, verify_shape=False,
262 allow_broadcast=True)
263
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
268 ctx = context.context()
269 if ctx.executing_eagerly():
--> 270 t = convert_to_eager_tensor(value, ctx, dtype)
271 if shape is None:
272 return t
c:\users\nibur\appdata\local\programs\python\python38\lib\site-packages\tensorflow\python\framework\constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
94 dtype = dtypes.as_dtype(dtype).as_datatype_enum
95 ctx.ensure_initialized()
---> 96 return ops.EagerTensor(value, ctx.device_name, dtype)
97
98
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I already tried to use
dfall['description']=dfall['description'].apply(np.asarray)
but then I got
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
For all have same problem the solution is
tf.data.Dataset.from_tensor_slices((dataframe .to_dict(orient='list'), labels))

Unless there is a good reason to use Tensorflow, I would advise to start with a simple model first. Use scikit-learn and follow their tutorial on working with text data. This will show you techniques like Bag of words (BoW) embeddings or TF-IDF embeddings.
For your particular problem, one thing really interesting to try is the following: you embed your article description using BoW or TF-IDF, and you embed the rest of your features as you would for regular tabular data. And then you concatenate the embeddings and feed that to a linear classifier in scikit-learn.

Related

to_numpy() doesnt work on pandas dataframe

im working with tensorflow. Previously i used tensorflowjs, but as all of you now it has limited functionalities. S, to create the model i started to use numpy+pandas+tensorflow on vscode + ipynb
i got a dataframe "seqs":
[code, C, M, S, string_to_classified]
the string can be classified on three categories(non exclusives) C, M and S.
So the label should be [C, M, S].
This code work and give me a nice pd dataframe:
trainingData = pd.DataFrame()
trainingData['string_to_classified'] = seqs['string_to_classified'].apply(nucleoBits)
trainingData['label']= seqs[['C', 'M', 'S']].values.tolist()`
however, when i try this
trainingDataSet = tf.data.Dataset.from_tensor_slices((trainingData['string_to_classified'].values, trainingData['label'].values))
I got
<ipython-input-89-897ad7666fa6> in <module>
----> 1 trainingDataSet = tf.data.Dataset.from_tensor_slices((trainingData['string_to_classified'].values, trainingData['label'].values))
2
c:\Users\Dua\anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in from_tensor_slices(tensors, name)
812 Dataset: A `Dataset`.
813 """
--> 814 return TensorSliceDataset(tensors, name=name)
815
816 class _GeneratorState(object):
c:\Users\Dua\anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py in __init__(self, element, is_files, name)
4706 def __init__(self, element, is_files=False, name=None):
4707 """See `Dataset.from_tensor_slices()` for details."""
-> 4708 element = structure.normalize_element(element)
4709 batched_spec = structure.type_spec_from_value(element)
4710 self._tensors = structure.to_batched_tensor_list(batched_spec, element)
c:\Users\Dua\anaconda3\lib\site-packages\tensorflow\python\data\util\structure.py in normalize_element(element, element_signature)
124 dtype = getattr(spec, "dtype", None)
125 normalized_components.append(
--> 126 ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
127 return nest.pack_sequence_as(pack_as, normalized_components)
...
--> 102 return ops.EagerTensor(value, ctx.device_name, dtype)
103
104
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
pd: tfjs was simplier
const labels = tf.tensor3d(seqs.C, seqs.M, seqs.S)
and it was done

ValueError: The two structures don't have the same sequence length. Input structure has length 1, while shallow structure has length 2

What is the solution to the following error in tensorflow.
ValueError: The two structures don't have the same sequence length.
Input structure has length 1, while shallow structure has length 2.
I tried tensorflow versions: 2.9.1 and 2.4.0.
The toy example is given to reproduce the error.
import tensorflow as tf
d1 = tf.data.Dataset.range(10)
d1 = d1.map(lambda x:tf.cast([x], tf.float32))
def func1(x):
y1 = 2.0 * x
y2 = -3.0 * x
return tuple([y1, y2])
d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
d3 = d2.padded_batch(3, padded_shapes=(None,))
for x, y in d2.as_numpy_iterator():
pass
The full error is:
ValueError Traceback (most recent call last)
~/Documents/pythonProject/tfProjects/asr/transformer/dataset.py in <module>
256 return tuple([y1, y2])
257 d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
---> 258 d3 = d2.padded_batch(3, padded_shapes=(None,))
259 for x, y in d2.as_numpy_iterator():
260 pass
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in padded_batch(self, batch_size, padded_shapes, padding_values, drop_remainder, name)
1887 padding_values,
1888 drop_remainder,
-> 1889 name=name)
1890
1891 def map(self,
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, input_dataset, batch_size, padded_shapes, padding_values, drop_remainder, name)
5171
5172 input_shapes = get_legacy_output_shapes(input_dataset)
-> 5173 flat_padded_shapes = nest.flatten_up_to(input_shapes, padded_shapes)
5174
5175 flat_padded_shapes_as_tensors = []
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py in flatten_up_to(shallow_tree, input_tree)
377 `input_tree`.
378 """
--> 379 assert_shallow_structure(shallow_tree, input_tree)
380 return list(_yield_flat_up_to(shallow_tree, input_tree))
381
~/miniconda3/envs/jtf2/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py in assert_shallow_structure(shallow_tree, input_tree, check_types)
290 if len(input_tree) != len(shallow_tree):
291 raise ValueError(
--> 292 "The two structures don't have the same sequence length. Input "
293 f"structure has length {len(input_tree)}, while shallow structure "
294 f"has length {len(shallow_tree)}.")
ValueError: The two structures don't have the same sequence length. Input structure has length 1, while shallow structure has length 2.
The following modification in padded_shapes argument will resolve the error.
import tensorflow as tf
d1 = tf.data.Dataset.range(10)
d1 = d1.map(lambda x:tf.cast([x], tf.float32))
def func1(x):
y1 = 2.0 * x
y2 = -3.0 * x
return tuple([y1, y2])
d2 = d1.map(lambda x: tf.py_function(func1, [x], [tf.float32, tf.float32]))
d3 = d2.padded_batch(3, padded_shapes=([None],[None]))
for x, y in d2.as_numpy_iterator():
pass

tf.keras.layers.Concatenate() works with a list but fails on a tuple of tensors

This will work:
tf.keras.layers.Concatenate()([features['a'], features['b']])
While this:
tf.keras.layers.Concatenate()((features['a'], features['b']))
Results in:
TypeError: int() argument must be a string or a number, not 'TensorShapeV1'
Is that expected? If so - why does it matter what sequence do I pass?
Thanks,
Zach
EDIT (adding a code example):
import pandas as pd
import numpy as np
data = {
'a': [1.0, 2.0, 3.0],
'b': [0.1, 0.3, 0.2],
}
with tf.Session() as sess:
ds = tf.data.Dataset.from_tensor_slices(data)
ds = ds.batch(1)
it = ds.make_one_shot_iterator()
features = it.get_next()
concat = tf.keras.layers.Concatenate()((features['a'], features['b']))
try:
while True:
print(sess.run(concat))
except tf.errors.OutOfRangeError:
pass
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-135-0e1a45017941> in <module>()
6 features = it.get_next()
7
----> 8 concat = tf.keras.layers.Concatenate()((features['a'], features['b']))
9
10
google3/third_party/tensorflow/python/keras/engine/base_layer.py in __call__(self, inputs, *args, **kwargs)
751 # the user has manually overwritten the build method do we need to
752 # build it.
--> 753 self.build(input_shapes)
754 # We must set self.built since user defined build functions are not
755 # constrained to set self.built.
google3/third_party/tensorflow/python/keras/utils/tf_utils.py in wrapper(instance, input_shape)
148 tuple(tensor_shape.TensorShape(x).as_list()) for x in input_shape]
149 else:
--> 150 input_shape = tuple(tensor_shape.TensorShape(input_shape).as_list())
151 output_shape = fn(instance, input_shape)
152 if output_shape is not None:
google3/third_party/tensorflow/python/framework/tensor_shape.py in __init__(self, dims)
688 else:
689 # Got a list of dimensions
--> 690 self._dims = [as_dimension(d) for d in dims_iter]
691
692 #property
google3/third_party/tensorflow/python/framework/tensor_shape.py in as_dimension(value)
630 return value
631 else:
--> 632 return Dimension(value)
633
634
google3/third_party/tensorflow/python/framework/tensor_shape.py in __init__(self, value)
183 raise TypeError("Cannot convert %s to Dimension" % value)
184 else:
--> 185 self._value = int(value)
186 if (not isinstance(value, compat.bytes_or_text_types) and
187 self._value != value):
TypeError: int() argument must be a string or a number, not 'TensorShapeV1'
https://github.com/keras-team/keras/blob/master/keras/layers/merge.py#L329
comment on the concanate class states it requires a list.
this class calls K.backend's concatenate function
https://github.com/keras-team/keras/blob/master/keras/backend/tensorflow_backend.py#L2041
which also states it requires a list.
in tensorflow https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/python/ops/array_ops.py#L1034
also states it requires a list of tensors. Why? I don't know. in this function the tensors (variable called "values") actually gets checked if its a list or tuple. but somewhere along the way you still get an error.

tensorflow dataset shuffle examples instead of batches

How do I get a tensorflow dataset in batch mode to shuffle across all the samples? It is only shuffling the batches.
Below is a program that makes a dataset of 1000 items and goes through 10 epochs of it in batches of 5. I have shuffle() turned on. I can see that tensorflow groups the dataset into 200 batches of 5 examples each, and the shuffle is across those batches. I want each new batch to be a random sample of the original 1000 examples, not a sample of the 200 original batches.
That is, this program:
import numpy as np
import tensorflow as tf
import random
def rec2tfrec_example(rec):
def _int64_feat(value):
arr_value = np.empty([1], dtype=np.int64)
arr_value[0] = value
return tf.train.Feature(int64_list=tf.train.Int64List(value=arr_value))
feat = {
'uid': _int64_feat(rec['uid']),
}
return tf.train.Example(features=tf.train.Features(feature=feat)).SerializeToString()
def parse_example(tfrec_serialized_string):
feat = {
'uid': tf.FixedLenFeature([], tf.int64),
}
return tf.parse_example(tfrec_serialized_string, feat)
def write_tfrecs_to_file(fname, recs):
recwriter = tf.python_io.TFRecordWriter(fname)
for rec in recs:
recwriter.write(bytes(rec))
recwriter.close()
def check_shuffle(sess, tfrec_output_filename, data, N, batch_size):
epochs = 10
dataset = tf.data.TFRecordDataset(tfrec_output_filename) \
.batch(batch_size) \
.repeat(epochs) \
.shuffle(2*N) \
.map(parse_example, num_parallel_calls=2)
tf_iter = dataset.make_initializable_iterator()
get_next = tf_iter.get_next()
sess.run(tf_iter.initializer)
num_batches = N//batch_size
for epoch in range(epochs ):
for batch in range(N//batch_size):
tfres = sess.run(get_next)
print("epoch=%4d batch=%d uid=%s" % (epoch, batch, tfres['uid']))
def main(N=1000, batch_size=5, tfrec_output_filename='tfrec_testing.tfrecords'):
tf.reset_default_graph()
data = [{'uid': uid } for uid in range(N)]
tfrec_strings = [rec2tfrec_example(rec) for rec in data]
write_tfrecs_to_file(tfrec_output_filename, tfrec_strings)
with tf.Session() as sess:
check_shuffle(sess, tfrec_output_filename, data, N, batch_size)
if __name__ == '__main__':
main()
produces output like:
epoch= 9 batch=186 uid=[685 686 687 688 689]
epoch= 9 batch=187 uid=[235 236 237 238 239]
epoch= 9 batch=188 uid=[520 521 522 523 524]
epoch= 9 batch=189 uid=[135 136 137 138 139]
epoch= 9 batch=190 uid=[95 96 97 98 99]
epoch= 9 batch=191 uid=[290 291 292 293 294]
epoch= 9 batch=192 uid=[230 231 232 233 234]
epoch= 9 batch=193 uid=[215 216 217 218 219]
ah, the order of batch and shuffle matters, if I set up the dataset like
dataset = tf.data.TFRecordDataset(tfrec_output_filename) \
.shuffle(2*N) \
.batch(batch_size) \
.repeat(epochs) \
.map(parse_example, num_parallel_calls=2)
with shuffle before batch, then it works.

type mismatch using sparse_precision_at_k from tensorflow.metrics

I am working with a toy example to check how tensorflow.metrics.sparse_precision_at_k works
From the documentation:
labels: int64 Tensor or SparseTensor with shape
[D1, ... DN, num_labels] or [D1, ... DN], where the latter implies
num_labels=1. N >= 1 and num_labels is the number of target classes for
the associated prediction. Commonly, N=1 and labels has shape
[batch_size, num_labels]. [D1, ... DN] must match predictions. Values
should be in range [0, num_classes), where num_classes is the last
dimension of predictions. Values outside this range are ignored.
predictions: Float Tensor with shape [D1, ... DN, num_classes] where
N >= 1. Commonly, N=1 and predictions has shape [batch size, num_classes].
The final dimension contains the logit values for each class. [D1, ... DN]
must match labels.
k: Integer, k for #k metric.
So I have written a following example accordingly:
import tensorflow as tf
import numpy as np
pred = np.asarray([[.8,.1,.1,.1],[.2,.9,.9,.9]]).T
print(pred.shape)
segm = [0,1,1,1]
segm = np.asarray(segm, np.float32)
print(segm.shape)
segm_tf = tf.Variable(segm, np.int64)
pred_tf = tf.Variable(pred, np.float32)
print("segm_tf", segm_tf.shape)
print("pred_tf", pred_tf.shape)
prec,_ = tf.metrics.sparse_precision_at_k(segm_tf, pred_tf, 1, class_id=1)
sess = tf.InteractiveSession()
tf.variables_initializer([prec, segm_tf, pred_tf])
However, I am getting an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-c6243802dedc> in <module>()
25 print("pred_tf", pred_tf.shape)
26
---> 27 prec,_ = tf.metrics.sparse_precision_at_k(segm_tf, pred_tf, 1, class_id=1)
28 sess = tf.InteractiveSession()
29 tf.variables_initializer([prec, segm_tf, pred_tf])
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in sparse_precision_at_k(labels, predictions, k, class_id, weights, metrics_collections, updates_collections, name)
2828 metrics_collections=metrics_collections,
2829 updates_collections=updates_collections,
-> 2830 name=scope)
2831
2832
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in _sparse_precision_at_top_k(labels, predictions_idx, k, class_id, weights, metrics_collections, updates_collections, name)
2726 tp, tp_update = _streaming_sparse_true_positive_at_k(
2727 predictions_idx=top_k_idx, labels=labels, k=k, class_id=class_id,
-> 2728 weights=weights)
2729 fp, fp_update = _streaming_sparse_false_positive_at_k(
2730 predictions_idx=top_k_idx, labels=labels, k=k, class_id=class_id,
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in _streaming_sparse_true_positive_at_k(labels, predictions_idx, k, class_id, weights, name)
1743 tp = _sparse_true_positive_at_k(
1744 predictions_idx=predictions_idx, labels=labels, class_id=class_id,
-> 1745 weights=weights)
1746 batch_total_tp = math_ops.to_double(math_ops.reduce_sum(tp))
1747
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in _sparse_true_positive_at_k(labels, predictions_idx, class_id, weights, name)
1689 name, 'true_positives', (predictions_idx, labels, weights)):
1690 labels, predictions_idx = _maybe_select_class_id(
-> 1691 labels, predictions_idx, class_id)
1692 tp = sets.set_size(sets.set_intersection(predictions_idx, labels))
1693 tp = math_ops.to_double(tp)
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in _maybe_select_class_id(labels, predictions_idx, selected_id)
1651 if selected_id is None:
1652 return labels, predictions_idx
-> 1653 return (_select_class_id(labels, selected_id),
1654 _select_class_id(predictions_idx, selected_id))
1655
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in _select_class_id(ids, selected_id)
1627 filled_selected_id = array_ops.fill(
1628 filled_selected_id_shape, math_ops.to_int64(selected_id))
-> 1629 result = sets.set_intersection(filled_selected_id, ids)
1630 return sparse_tensor.SparseTensor(
1631 indices=result.indices, values=result.values, dense_shape=ids_shape)
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/sets_impl.py in set_intersection(a, b, validate_indices)
191 intersections.
192 """
--> 193 a, b, _ = _convert_to_tensors_or_sparse_tensors(a, b)
194 return _set_operation(a, b, "intersection", validate_indices)
195
/home/ubuntu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/sets_impl.py in _convert_to_tensors_or_sparse_tensors(a, b)
82 b = sparse_tensor.convert_to_tensor_or_sparse_tensor(b, name="b")
83 if b.dtype.base_dtype != a.dtype.base_dtype:
---> 84 raise TypeError("Types don't match, %s vs %s." % (a.dtype, b.dtype))
85 if (isinstance(a, sparse_tensor.SparseTensor) and
86 not isinstance(b, sparse_tensor.SparseTensor)):
TypeError: Types don't match, <dtype: 'int64'> vs <dtype: 'float32'>.
Below is a simple example of using this metric.
sess = tf.Session()
predictions = tf.constant([[0.1, 0.3, 0.2, 0.4], [0.1, 0.2, 0.3, 0.4]],
dtype=tf.float32)
labels = tf.constant([3, 2], tf.int64)
precision_op, update_op = tf.metrics.sparse_precision_at_k(
labels=labels,
predictions=predictions,
k=1,
class_id=3)
sess.run(tf.local_variables_initializer())
print(sess.run(update_op))
This examples prints 0.5 because our predictions predicted class 3 for all (two) examples and only one of them is correct.
The two returned ops (precision_op and update_op) can be confusing. Please read this guide -https://www.tensorflow.org/api_guides/python/contrib.metrics. It talks about "streaming" metrics, but the same logic applies to all metrics. Basically, update_op actually updates the variables using the examples/labels you gave and precision_op is idempotent - it simply returns the current value of the metric. If you never call update_op the current value of the metric is undefined, likely nan.
In regard to your code, the shapes are not correct. In the simplest case, labels should just give the correct label for each example in the batch. In your case, there are just two examples, so there should be just two labels. Also, you don't need to create variables yourself - sparse_precision_at_k does it for you.