Using TF Estimator with TFRecord generator - tensorflow

I am trying to create a simple NN that reads in a folder of tfrecords. Each record has a 1024-value 'mean_rgb' vector, and a category label. I am trying to create a simple feed-forward NN that learns the categories based on this feature vector.
def generate(dir, shuffle, batch_size):
def parse(serialized):
features = {
'mean_rgb': tf.FixedLenFeature([1024], tf.float32),
'category': tf.FixedLenFeature([], tf.int64)
}
parsed_example = tf.parse_single_example(serialized=serialized, features=features)
vrv = parsed_example['mean_rgb']
label = parsed_example['category']
d = dict(zip(['mean_rgb'], [vrv])), label
return d
dataset = tf.data.TFRecordDataset(dir).repeat(1)
dataset = dataset.map(parse)
if shuffle:
dataset = dataset.shuffle(8000)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next = iterator.get_next()
print(next)
return next
def batch_generator(dir, shuffle=False, batch_size=64):
sess = K.get_session()
while True:
yield sess.run(generate(dir, shuffle, batch_size))
num_classes = 29
batch_size = 64
yt8m_train = [os.path.join(yt8m_dir_train, x) for x in read_all_file_names(yt8m_dir_train) if '.tfrecord' in x]
yt8m_test = [os.path.join(yt8m_dir_test, x) for x in read_all_file_names(yt8m_dir_test) if '.tfrecord' in x]
feature_columns = [tf.feature_column.numeric_column(k) for k in ['mean_rgb']]
#batch_generator(yt8m_test).__next__()
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[1024, 1024],
n_classes=num_classes,
model_dir=model_dir)
classifier.train(
input_fn=lambda: generate(yt8m_train, True, batch_size))
However, I get the following error:
InvalidArgumentError (see above for traceback): Input to reshape is a
tensor with 65536 values, but the requested shape has 64
I am not sure why it sees the input as a 64x1024=65536 vector instead of a (64, 1024) vector. When I print the next item in the generator, I get
({'mean_rgb': <tf.Tensor: id=23, shape=(64, 1024), dtype=float32, numpy=
array([[ 0.9243997 , 0.28990048, -0.4130672 , ..., -0.096692 ,
0.27225342, 0.13346168],
[ 0.5853526 , 0.67050666, -0.24683481, ..., -0.6999033 ,
-0.4100128 , -0.00349384],
[ 0.49572858, 0.5231492 , -0.53445834, ..., 0.0449002 ,
0.10582132, -0.37333965],
...,
[ 0.5776026 , -0.07128889, -0.61762846, ..., 0.22194198,
0.61441416, -0.27355513],
[-0.01848815, 0.20132884, 1.1023484 , ..., 0.06496283,
0.29560333, 0.09157721],
[-0.25877073, -1.9552246 , 0.10309827, ..., 0.22032814,
-0.6812989 , -0.23649289]], dtype=float32)>}
which has the correct (64, 1024) shape

the problem is at how the features_columns works, for example, I had a similar problem and I solved by doing a reshape here is part of my code that will help you understand:
defining the features_column:
feature_columns = {
'images': tf.feature_column.numeric_column('images', self.shape),
}
then to create the input for the model:
with tf.name_scope('input'):
feature_columns = list(self._features_columns().values())
input_layer = tf.feature_column.input_layer(
features=features, feature_columns=feature_columns)
input_layer = tf.reshape(
input_layer,
shape=(-1, self.parameters.size, self.parameters.size,
self.parameters.channels))
if pay attention to the last part I had to reshape the tensor, the -1 is to let Tensorflow figure out the batch size

I believe the issue was that feature_columns = [tf.feature_column.numeric_column(k) for k in ['mean_rgb']] assumes that the column is a scalar - when actually it is a 1024 vector. I had to add shape=1024 to the numeric_column call. Also had to remove existing checkpoint saved model.

Related

how to add text preprocessing tokenization step into Tensorflow model

I have a TensorFlow model SavedModel which includes saved_model.pb and variables folder. The preprocessing step has not been incorporated into this model that's why I need to do preprocessing(Tokenization etc) before feeding the data to the model for the prediction aspect.
I am looking for an approach that I can incorporate the preprocessing step into the model. I have seen examples here and here however they are image data.
Just to get an idea how the training part has been done, this is a portion of the code that we did training (if you need the implementation of the function I have used here, please let me know(I did not include it to make my question more understandable ))
Training:
processor = IntentProcessor(FLAGS.data_path, FLAGS.test_data_path,
FLAGS.test_proportion, FLAGS.seed, FLAGS.do_early_stopping)
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps)
train_examples = None
num_train_steps = None
num_warmup_steps = None
if FLAGS.do_train:
train_examples = processor.get_train_examples()
num_iter_per_epoch = int(len(train_examples) / FLAGS.train_batch_size)
num_train_steps = num_iter_per_epoch * FLAGS.num_train_epochs
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.output_dir,
save_checkpoints_steps=num_iter_per_epoch)
best_temperature = 1.0 # Initiate the best T value as 1.0 and will
# update this during the training
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(processor.le.classes_),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
best_temperature=best_temperature,
seed=FLAGS.seed)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config)
# add parameters by passing a prams variable
if FLAGS.do_train:
train_features = convert_examples_to_features(
train_examples, FLAGS.max_seq_length, tokenizer)
train_labels = processor.get_train_labels()
train_input_fn = input_fn_builder(
features=train_features,
is_training=True,
batch_size=FLAGS.train_batch_size,
seed=FLAGS.seed,
labels=train_labels
)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
And this is the preprocessing that I use for the training:
LABEL_LIST = ['negative', 'neutral', 'positive']
INTENT_MAP = {i: LABEL_LIST[i] for i in range(len(LABEL_LIST))}
BATCH_SIZE = 1
MAX_SEQ_LEN = 70
def convert_examples_to_features(texts, max_seq_length, tokenizer):
"""Loads a data file into a list of InputBatchs.
texts is the list of input text
"""
features = {}
input_ids_list = []
input_mask_list = []
segment_ids_list = []
for (ex_index, text) in enumerate(texts):
tokens_a = tokenizer.tokenize(str(text))
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# print(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
input_ids_list.append(input_ids)
input_mask_list.append(input_mask)
segment_ids_list.append(segment_ids)
features['input_ids'] = np.asanyarray(input_ids_list)
features['input_mask'] = np.asanyarray(input_mask_list)
features['segment_ids'] = np.asanyarray(segment_ids_list)
# tf.data.Dataset.from_tensor_slices needs to pass numpy array not
# tensor, or the tensor graph (shape) should match
return features
and inferencing would be like this:
def inference(texts,MODEL_DIR, VOCAB_FILE):
if not isinstance(texts, list):
texts = [texts]
tokenizer = FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=False)
features = convert_examples_to_features(texts, MAX_SEQ_LEN, tokenizer)
predict_fn = predictor.from_saved_model(MODEL_DIR)
response = predict_fn(features)
#print(response)
return get_sentiment(response)
def preprocess(texts):
if not isinstance(texts, list):
texts = [texts]
tokenizer = FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=False)
features = convert_examples_to_features(texts, MAX_SEQ_LEN, tokenizer)
return features
def get_sentiment(response):
idx = response['intent'].tolist()
print(idx)
print(INTENT_MAP.get(idx[0]))
outputs = []
for i in range(0, len(idx)):
outputs.append({
"sentiment": INTENT_MAP.get(idx[i]),
"confidence": response['prob'][i][idx[i]]
})
return outputs
sentence = 'The movie is ok'
inference(sentence, args.model_path, args.vocab_path)
And this is the implementation of model_fn_builder:
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, best_temperature, seed):
"""Returns multi-intents `model_fn` closure for Estimator"""
def model_fn(features, labels, mode,
params): # pylint: disable=unused-argument
"""The `model_fn` for Estimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(
" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits) = create_intent_model(
bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, mode, seed)
tvars = tf.trainable_variables()
initialized_variable_names = None
if init_checkpoint:
(assignment_map,
initialized_variable_names) = \
modeling.get_assignment_map_from_checkpoint(
tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, labels, logits):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.accuracy(labels, predictions)
loss = tf.metrics.mean(per_example_loss)
return {
"eval_accuracy": accuracy,
"eval_loss": loss
}
eval_metrics = metric_fn(per_example_loss, labels, logits)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
eval_metric_ops=eval_metrics)
elif mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
'intent': tf.argmax(logits, axis=-1, output_type=tf.int32),
'prob': tf.nn.softmax(logits / tf.constant(best_temperature)),
'logits': logits
}
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions)
return output_spec
return model_fn
And this is the implementation of create_intent_model
def create_intent_model(bert_config, is_training, input_ids, input_mask,
segment_ids,
labels, num_labels, mode, seed):
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=False,
seed=seed
)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
with tf.variable_scope("loss"):
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02, seed=seed))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9, seed=seed)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
loss = None
per_example_loss = None
if mode == tf.estimator.ModeKeys.TRAIN or mode == \
tf.estimator.ModeKeys.EVAL:
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels,
dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs,
axis=-1)
loss = tf.reduce_mean(per_example_loss)
return loss, per_example_loss, logits
This is the list tensorflow related libraries:
tensorboard==1.15.0
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.0
There is good documentation here, however, it uses Keras API. Plus, I don't know how can I incorporate preprocessing layer here even with the Keras API.
Again, my final goal is to incorporate the preprocessing step into the model building phase so that when I later load the model I directly pass the The movie is ok to the model?
I just need the idea on how to incorporate a preprocessing layer into this code which is function based.
Thanks in advance~
You can use the TextVectorization layer as follows. But to answer your question fully, I'd need to know what's in model_fn_builder() function. I'll show how you can do this with Keras model building API.
class BertTextProcessor(tf.keras.layers.Layer):
def __init__(self, max_length):
super().__init__()
self.max_length = max_length
# Here I'm setting any preprocessing to none
# by default this layer lowers case and remove punctuation
# i.e. tokens like [CLS] would become cls
self.vectorizer = tf.keras.layers.TextVectorization(output_sequence_length=max_length, standardize=None)
def call(self, inputs):
inputs = "[CLS] " + inputs + " [SEP]"
tok_inputs = self.vectorizer(inputs)
return {
"input_ids": tok_inputs,
"input_mask": tf.cast(tok_inputs != 0, 'int32'),
"segment_ids": tf.zeros_like(tok_inputs)
}
def adapt(self, data):
data = "[CLS] " + data + " [SEP]"
self.vectorizer.adapt(data)
def get_config(self):
return {
"max_length": self.max_length
}
Usage,
input_str = tf.constant(["movie is okay good plot very nice", "terrible movie bad actors not good"])
proc = BertTextProcessor(8)
# You need to call this so that the vectorizer layer learns the vocabulary
proc.adapt(input_str)
print(proc(input_str))
which outputs,
{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[ 5, 2, 12, 9, 3, 8, 6, 11, 4, 0],
[ 5, 7, 2, 13, 14, 10, 3, 4, 0, 0]])>, 'input_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]], dtype=int32)>, 'segment_ids': <tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
You can use this layer as an input for a Keras model as you would use any layer.
You can also get the vocabulary using, proc.vectorizer.get_vocabulary() which returns,
['',
'[UNK]',
'movie',
'good',
'[SEP]',
'[CLS]',
'very',
'terrible',
'plot',
'okay',
'not',
'nice',
'is',
'bad',
'actors']
Alternative with tf-models-official
To get data in a format accepted by BERT, you can also use the tf-models-official library. Specifically, you can use the BertPackInputs object.
I recently updated code for one of my books and in Chapter 13/13.1_Spam_Classification you can see how it is used. The section Generating the correct input format for BERT shows how this could be done.
Edit: How to do this in tensorflow==1.15.0
In order to do this in TensorFlow 1.x you will need some reworking as lot of functionality in the original answer is missing. Here's an example of how you can do this, you will need to adapt this code accordingly to your specific usecase/method.
lookup_layer = tf.lookup.StaticHashTable(
tf.lookup.TextFileInitializer(
"vocab.txt", tf.string, tf.lookup.TextFileIndex.WHOLE_LINE,
tf.int64, tf.lookup.TextFileIndex.LINE_NUMBER, delimiter=" "),
100
)
text = tf.constant(["bad film", "movie is okay good plot very nice", "terrible movie bad actors not good"])
text = "[CLS]" + text + "[SEP]"
text = tf.strings.split(text, result_type="RaggedTensor")
text_dense = text.to_tensor("[PAD]")
out = lookup_layer.lookup(text_dense)
with tf.Session() as sess:
sess.run(tf.tables_initializer())
print(sess.run(out))

Why does the same Tensorflow model work with a list of arrays but doesn't work with tf.data.Dataset unbatched?

I have the following simple set up:
import tensorflow as tf
def make_mymodel():
class MyModel(tf.keras.models.Model):
def __init__(self):
super(MyModel, self).__init__()
self.model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(1, 2)),
tf.keras.layers.Dense(1)
])
def call(self, x):
return self.model(x)
mymodel = MyModel()
return mymodel
model = make_mymodel()
X = [[[1, 1]],
[[2, 2]],
[[10, 10]],
[[20, 20]],
[[50, 50]]]
y = [1, 2, 10, 20, 50]
# ds_n_X = tf.data.Dataset.from_tensor_slices(X)
# ds_n_Y = tf.data.Dataset.from_tensor_slices(y)
# ds = tf.data.Dataset.zip((ds_n_X, ds_n_Y))
#
# for input, label in ds:
# print(input.numpy(), label.numpy())
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=False)
model.build((1, 2))
model.compile(optimizer='adam',
loss=loss_fn)
model.summary()
model.fit(X, y, epochs=10)
print(model.predict([
[[25, 25]]
]))
This works fine (although I get strange predictions), but when I uncomment the ds lines and change model.fit(X, y, epochs=10) to model.fit(ds, epochs=10), I get the following error:
Traceback (most recent call last):
File "example_dataset.py", line 51, in <module>
model.fit(ds, epochs=10)
...
ValueError: slice index 0 of dimension 0 out of bounds. for '{{node strided_slice}} = StridedSlice[Index=DT_INT32, T=DT_INT32, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](Shape, strided_slice/stack, strided_slice/stack_1, strided_slice/stack_2)' with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.
The error gets solved when I run model.fit(ds.batch(2), epochs=10) (I added a batch instruction to the dataset).
I expect to be able to use a list of arrays and tf.data.Dataset interchangeably but, for some reason, I need to add a batch dimension to the dataset in order to use tf.data.Dataset. Is this expected behavior or am I conceptually missing something?
Because the model expects input as (batch_dim, input_dim). So, for your data, each input to the model should be like (None, 1, 2).
Let's explore the dimensions of your data by array and by dataset. While you define your input as array the shape is:
>>> print(np.array(X).shape)
(5, 1, 2)
It is compatible with what the model expects. But when you define a dataset using your array the shape is:
>>> for input, label in ds.take(1):
print(input.numpy().shape)
(1, 2)
And this is incompatible with what model expects, and if we batch the data:
>>> ds = ds.batch(1)
>>> for input, label in ds.take(1):
print(input.numpy().shape)
(1, 1, 2)
Then, it will be fine to pass dataset to the model.fit().

Tensorflow embeddings InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905) [[node sequential_1/embedding_1/embedding_lookup

I am using TF 2.2.0 and trying to create a Word2Vec CNN text classification model. But however I tried there has been always an issue with the model or embedding layers. I could not found clear solutions in the internet so decided to ask it.
import multiprocessing
modelW2V = gensim.models.Word2Vec(filtered_stopwords_list, size= 100, min_count = 5, window = 5, sg=0, iter = 10, workers= multiprocessing.cpu_count() - 1)
model_save_location = "3000tweets_notbinary"
modelW2V.wv.save_word2vec_format(model_save_location)
word2vec = {}
with open('3000tweets_notbinary', encoding='UTF-8') as f:
for line in f:
values = line.split()
word = values[0]
vec = np.asarray(values[1:], dtype='float32')
word2vec[word] = vec
num_words = len(list(tokenizer.word_index))
embedding_matrix = np.random.uniform(-1, 1, (num_words, 100))
for word, i in tokenizer.word_index.items():
if i < num_words:
embedding_vector = word2vec.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
else:
embedding_matrix[i] = np.zeros((100,))
I have created my word2vec weights by the code above and then converted it to embedding_matrix as I followed on many tutorials. But since there are a lot of words seen by word2vec but not available in embeddings, if there is no embedding I assign 0 vector. And then fed data and this embedding to tf sequential model.
seq_leng = max_tokens
vocab_size = num_words
embedding_dim = 100
filter_sizes = [3, 4, 5]
num_filters = 512
drop = 0.5
epochs = 5
batch_size = 32
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(input_dim= vocab_size,
output_dim= embedding_dim,
weights = [embedding_matrix],
input_length= max_tokens,
trainable= False),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(2),
tf.keras.layers.Conv1D(num_filters, 7, activation= "relu", padding= "same"),
tf.keras.layers.MaxPool1D(),
tf.keras.layers.Dropout(drop),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(32, activation= "relu", kernel_regularizer= tf.keras.regularizers.l2(1e-4)),
tf.keras.layers.Dense(3, activation= "softmax")
])
model.compile(loss= "categorical_crossentropy", optimizer= tf.keras.optimizers.Adam(learning_rate= 0.001, epsilon= 1e-06),
metrics= ["accuracy", tf.keras.metrics.Precision(), tf.keras.metrics.Recall()])
model.summary()
history = model.fit(x_train_pad, y_train2, batch_size= 60, epochs= epochs, shuffle= True, verbose= 1)
But when I run this code, tensorflow gives me the following error in any random time of the training process. But I could not find any solution to it. I have tried adding + 1 to vocab_size but when I do that I get size mismatch error which does not let me even compile my model. Can anyone please help me?
InvalidArgumentError: indices[18,16] = 11905 is not in [0, 11905)
[[node sequential_1/embedding_1/embedding_lookup (defined at <ipython-input-26-ef1b16cf85bf>:1) ]] [Op:__inference_train_function_1533]
Errors may have originated from an input operation.
Input Source operations connected to node sequential_1/embedding_1/embedding_lookup:
sequential_1/embedding_1/embedding_lookup/991 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function
I solved this solution. I was adding a new dimension to vocab_size by doing it vocab_size + 1 as suggested by others. However, since sizes of layer dimensions and embedding matrix don't match I got this issue in my hands. I added a zero vector at the end of my embedding matrix which solved the issue.

Tensorflow: Input pipeline with sparse data for the SVM estimator

Introduction:
I am trying to train the tensorflow svm estimator tensorflow.contrib.learn.python.learn.estimators.svm with sparse data. Sample usage with sparse data at the github repo at tensorflow/contrib/learn/python/learn/estimators/svm_test.py#L167 (I am not allowed to post more links, so here the relative path).
The svm estimator expects as parameter example_id_column and feature_columns, where the feature columns should be derived of class FeatureColumn such as tf.contrib.layers.feature_column.sparse_column_with_hash_bucket. See Github repo at tensorflow/contrib/learn/python/learn/estimators/svm.py#L85 and the documentation at tensorflow.org at python/contrib.layers#Feature_columns.
Question:
How do I have to set up my input pipeline to format sparse data in such a way that I can use one of the tf.contrib.layers feature_columns as input for the svm estimator.
How would a dense input function with many features look like?
Background
The data that I use is the a1a dataset from the LIBSVM website. The data set has 123 features (that would correspond to 123 feature_columns if the data would be dense). I wrote an user op to read the data like tf.decode_csv() but for the LIBSVM format. The op returns the labels as dense tensor and the features as sparse tensor. My input pipeline:
NUM_FEATURES = 123
batch_size = 200
# my op to parse the libsvm data
decode_libsvm_module = tf.load_op_library('./libsvm.so')
def input_pipeline(filename_queue, batch_size):
with tf.name_scope('input'):
reader = tf.TextLineReader(name="TextLineReader_")
_, libsvm_row = reader.read(filename_queue, name="libsvm_row_")
min_after_dequeue = 1000
capacity = min_after_dequeue + 3 * batch_size
batch = tf.train.shuffle_batch([libsvm_row], batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
name="text_line_batch_")
labels, sp_indices, sp_values, sp_shape = \
decode_libsvm_module.decode_libsvm(records=batch,
num_features=123,
OUT_TYPE=tf.int64,
name="Libsvm_decoded_")
# Return the features as sparse tensor and the labels as dense
return tf.SparseTensor(sp_indices, sp_values, sp_shape), labels
Here is an example batch with batch_size = 5.
def input_fn(dataset_name):
maybe_download()
filename_queue_train = tf.train.string_input_producer([dataset_name],
name="queue_t_")
features, labels = input_pipeline(filename_queue_train, batch_size)
return {
'example_id': tf.as_string(tf.range(1,123,1,dtype=tf.int64)),
'features': features
}, labels
This is what I tried so far:
with tf.Session().as_default() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
feature_column = tf.contrib.layers.sparse_column_with_hash_bucket(
'features', hash_bucket_size=1000, dtype=tf.int64)
svm_classifier = svm.SVM(feature_columns=[feature_column],
example_id_column='example_id',
l1_regularization=0.0,
l2_regularization=1.0)
svm_classifier.fit(input_fn=lambda: input_fn(TRAIN),
steps=30)
accuracy = svm_classifier.evaluate(
input_fn= lambda: input_fn(features, labels),
steps=1)['accuracy']
print(accuracy)
coord.request_stop()
coord.join(threads)
sess.close()
Here's an example, with made up data, that works for me in TensorFlow 1.1.0-rc2. I think my comment was misleading; you're best off converting ~100 binary features to real valued features (tf.sparse_tensor_to_dense) and using a real_valued_column, since sparse_column_with_integerized_feature is hiding most of the useful information from the SVM Estimator.
import tensorflow as tf
batch_size = 10
num_features = 123
num_examples = 100
def input_fn():
example_ids = tf.random_uniform(
[batch_size], maxval=num_examples, dtype=tf.int64)
# Construct a SparseTensor with features
dense_features = (example_ids[:, None]
+ tf.range(num_features, dtype=tf.int64)[None, :]) % 2
non_zeros = tf.where(tf.not_equal(dense_features, 0))
sparse_features = tf.SparseTensor(
indices=non_zeros,
values=tf.gather_nd(dense_features, non_zeros),
dense_shape=[batch_size, num_features])
features = {
'some_sparse_features': tf.sparse_tensor_to_dense(sparse_features),
'example_id': tf.as_string(example_ids)}
labels = tf.equal(dense_features[:, 0], 1)
return features, labels
svm = tf.contrib.learn.SVM(
example_id_column='example_id',
feature_columns=[
tf.contrib.layers.real_valued_column(
'some_sparse_features')],
l2_regularization=0.1, l1_regularization=0.5)
svm.fit(input_fn=input_fn, steps=1000)
positive_example = lambda: {
'some_sparse_features': tf.sparse_tensor_to_dense(
tf.SparseTensor([[0, 0]], [1], [1, num_features])),
'example_id': ['a']}
print(svm.evaluate(input_fn=input_fn, steps=20))
print(next(svm.predict(input_fn=positive_example)))
negative_example = lambda: {
'some_sparse_features': tf.sparse_tensor_to_dense(
tf.SparseTensor([[0, 0]], [0], [1, num_features])),
'example_id': ['b']}
print(next(svm.predict(input_fn=negative_example)))
Prints:
{'accuracy': 1.0, 'global_step': 1000, 'loss': 1.0645389e-06}
{'logits': array([ 0.01612902], dtype=float32), 'classes': 1}
{'logits': array([ 0.], dtype=float32), 'classes': 0}
Since TensorFlow 1.5.0 there is an inbuilt function to read LIBSVM data,
refer to my answer here
https://stackoverflow.com/a/56354308/3885491

Using Estimator for building an LSTM network

I am trying to build an LSTM network using an Estimator. My data looks like
X = [[1,2,3], [2,3,4], ... , [98,99,100]]
y = [2, 3, ... , 99]
I am using an Estimator:
regressor = learn.Estimator(model_fn=lstm_model,
params=model_params,
)
where the lstm_model function is
def lstm_model(features, targets, mode, params):
def lstm_cells(layers):
if isinstance(layers[0], dict):
return [tf.nn.rnn_cell.BasicLSTMCell(layer['steps'],state_is_tuple=True) for layer in layers]
return [tf.nn.rnn_cell.BasicLSTMCell(steps, state_is_tuple=True) for steps in layers]
stacked_lstm = tf.nn.rnn_cell.MultiRNNCell(lstm_cells(params['rnn_layers']), state_is_tuple=True)
output, layers = tf.nn.rnn(stacked_lstm, [features], dtype=tf.float32)
return learn.models.linear_regression(output, targets)
and params are
model_params = {
'steps': 1000,
'learning_rate': 0.03,
'batch_size': 24,
'time_steps': 3,
'rnn_layers': [{'steps': 3}],
'dense_layers': [10, 10]
}
and then I do the fitting
regressor.fit(X, y)
The issue I am facing is
output, layers = tf.nn.rnn(stacked_lstm, [features], dtype=tf.float32)
requires a sequence but I am not sure how to split my features to into list of tensors. The shape of features inside the lstm_model function is (?, 3)
I have two questions, how do I do the training in batches? and how do I split 'features' so
output, layers = tf.nn.rnn(stacked_lstm, [features], dtype=tf.float32)
doesn't throw and error. The error I am getting is
raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'values' of 'Concat' Op have types [float64, float32] that don't all match.
I am using tensorflow 0.12
I had to set the shape for features to be
(batch_size, time_step, 1) or (None, time_step, 1) and then unstack the features to go in the rnn. Unstacking the features in the "time_step" so you have a list of tensors with the size of time steps and the shape for each tensor should be (None, 1) or (batch_size, 1)