Saving and running wide_deep.py model - tensorflow

I've been playing around with the Tensorflow Wide and Deep tutorial using the census dataset.
The linear/wide tutorial states:
We will train a logistic regression model, and given an individual's information our model will output a number between 0 and 1
At the moment, I can't work out how to predict the output of an individual input (copied from the unit test):
TEST_INPUT_VALUES = {
'age': 18,
'education_num': 12,
'capital_gain': 34,
'capital_loss': 56,
'hours_per_week': 78,
'education': 'Bachelors',
'marital_status': 'Married-civ-spouse',
'relationship': 'Husband',
'workclass': 'Self-emp-not-inc',
'occupation': 'abc',
}
How can we predict and output whether this person is likely to earn <50k (0) or >=50k (1)?

The function is predict, but I didn't figure out how to input one example data directly (I tried numpy_input_fn and dict of tensors).
Instead, using the input function in wide_deep.py to write down the data into a temporary csv file then read it, the predict function can be used:
TEST_INPUT = ('18,Self-emp-not-inc,987,Bachelors,12,Married-civ-spouse,abc,'
'Husband,zyx,wvu,34,56,78,tsr,<=50K')
# Create temporary CSV file
input_csv = '/tmp/census_model/test.csv'
with tf.gfile.Open(input_csv, 'w') as temp_csv:
temp_csv.write(TEST_INPUT)
# restore model trained by wide_deep.py with same model_dir and model_type
model = wide_deep.build_estimator(FLAGS.model_dir, FLAGS.model_type)
pred_iter = model.predict(input_fn=lambda: wide_deep.input_fn(input_csv, 1, False, 1))
for pred in pred_iter:
# print(pred)
print(pred['classes'])
There are other attributes like probability, logits etc in pred.

Hookay, I can answer this now.. So if you want to evaluate the test set accuracy, you can follow the accepted answer, but if you want to make your own predictions, here are the steps.
First, construct a new input_fn, notice you need to alter the columns and the default column values since the label column wouldn't be there.
def parse_csv(value):
print('Parsing', data_file)
columns = tf.decode_csv(value, record_defaults=_PREDICT_COLUMNS_DEFAULTS)
features = dict(zip(_PREDICT_COLUMNS, columns))
return features
def predict_input_fn(data_file):
assert tf.gfile.Exists(data_file), ('%s not found. Please make sure the path is correct.' % data_file)
dataset = tf.data.TextLineDataset(data_file)
dataset = dataset.map(parse_csv, num_parallel_calls=5)
dataset = dataset.batch(1) # => This is very important to get the rank correct
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
return features
Then you can just call it simply by
results = model.predict(
input_fn=lambda: predict_input_fn(data_file='test.csv')
)

Related

Run prediction from saved model in tensorflow 2.0

I have a saved model (a directory with model.pd and variables) and wanted to run predictions on a pandas data frame.
I've unsuccessfully tried a few ways to do this:
Attempt 1: Restore the estimator from the saved model
estimator = tf.estimator.LinearClassifier(
feature_columns=create_feature_cols(),
model_dir=path,
warm_start_from=path)
Where path is the directory that has a model.pd and variables folder. I got an error
ValueError: Tensor linear/linear_model/dummy_feature1/weights is not found in
gs://bucket/Trainer/output/2013/20191008T170504.583379-63adee0eaee0/serving_model_dir/export/1570554483/variables/variables
checkpoint {'linear/linear_model/dummy_feature1/weights': [1, 1], 'linear/linear_model/dummy_feature2/weights': [1, 1]
}
Attempt 2: Run prediction directly from the saved model by running
imported = tf.saved_model.load(path) # path is the directory that has a `model.pd` and variables folder
imported.signatures["predict"](example)
But has not successfully passed the argument - looks like the function is looking for a tf.example and I am not sure how to convert a data frame to tf.example.
My attempt to convert is below but got an error that df[f] is not a tensor:
for f in features:
example.features.feature[f].float_list.value.extend(df[f])
I've seen solutions on StackOverflow but they are all tensorflow 1.14. Greatly appreciate it if someone can help with tensorflow 2.0.
Considering you have your saved model present like this:
my_model
assets saved_model.pb variables
You can load your saved model using:
new_model = tf.keras.models.load_model('saved_model/my_model')
# Check its architecture
new_model.summary()
To perform prediction on a DataFrame you need to:
Wrap scalars into a list so as to have a batch dimension (models only process batches of data, not single samples)
Call convert_to_tensor on each feature
Example 1:
If you have values for the first test row as
sample = {
'Type': 'Cat',
'Age': 3,
'Breed1': 'Tabby',
'Gender': 'Male',
'Color1': 'Black',
'Color2': 'White',
'MaturitySize': 'Small',
'FurLength': 'Short',
'Vaccinated': 'No',
'Sterilized': 'No',
'Health': 'Healthy',
'Fee': 100,
'PhotoAmt': 2,
}
input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = new_model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])
print(
"This particular pet had a %.1f percent probability "
"of getting adopted." % (100 * prob)
)
Example 2:
Or if you have multiple rows present in the same order as the train data
predict_dataset = tf.convert_to_tensor([
[5.1, 3.3, 1.7, 0.5,],
[5.9, 3.0, 4.2, 1.5,],
[6.9, 3.1, 5.4, 2.1]
])
# training=False is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = new_model(predict_dataset, training=False)
for i, logits in enumerate(predictions):
class_idx = tf.argmax(logits).numpy()
p = tf.nn.softmax(logits)[class_idx]
name = class_names[class_idx]
print("Example {} prediction: {} ({:4.1f}%)".format(i, name, 100*p))

preprocess images with tf.data.experimental.make_csv_dataset or with read_csv option

I am adding this summarization of my issue to make it easier to understand:
I want to do exactly what is done in the following tensorflow example:
https://www.tensorflow.org/guide/datasets
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
return image_resized, label
# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])
# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
The only differences are: I read the data from CSV that has many more features and then I call the map method:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(_parse_function)
How does my _parse_function need to look like? How do I access the image path features, updates it to be an image presentation and return a modified numeric matrix feature of the image without changing anything at the other features?
thanks,
eilalan
==================Here are my code tries:==================
My code reads a CSV with feature columns and label. One of the features is image path, the others are strings.
The image path need to be processed into image numbers matrix.
I have tried doing so with the following options. In both ways tf.read_file fails with the input dimension error.
My question is how to pass one image at a time into the map methods
def read_image_png_option_1(image_path, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(image_path), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
return image
def read_image_png_option_2(features, depth=3, scale=False):
"""Reads the image from image_path (tf.string tensor) [jpg image].
Cast the result to float32 and if scale=True scale it in [-1,1]
using scale_image. Otherwise the values are in [0,1]
Reuturn:
the decoded jpeg image, casted to float32
"""
image = tf.image.convert_image_dtype(
tf.image.decode_png(tf.read_file(features['image']), channels=depth),
dtype=tf.float32)
if scale:
image = scale_image(image)
features['image'] = image
return features
def make_input_fn(fileName,batch_size=8, perform_shuffle=True):
"""An input function for training """
def _input_fn():
def decode_csv(line):
print('line is ',line)
filename_col,label_col,gender_col,ethinicity = tf.decode_csv(line,
[[""]]*amount_of_columns_csv,
field_delim=",",
na_value='NA',
select_cols=None)
image_col = read_image_png_option_1(filename_col)
d = dict(zip(['image','label','gender','ethinicity'], [image_col,label_col,gender_col,ethinicity])), label
return d
## OPTION 1:
# filenames could be more than one
# dataset = tf.data.TextLineDataset(filenames=fileName).skip(1).batch(batch_size).map(decode_csv)
## OPTION 2:
dataset = tf.data.experimental.make_csv_dataset(file_pattern=CSV_PATH_TRAIN,
batch_size=2,
header=True,
label_name = 'label').map(read_image_png_option_2)
#select_columns=[0,1]) #[tf.string,tf.string,tf.string,tf.string])
if perform_shuffle:
dataset = dataset.shuffle(buffer_size=256)
return dataset
return _input_fn()
train_input_fn = lambda: make_input_fn(CSV_PATH_TRAIN)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=50)
eval_input_fn = lambda: make_input_fn(CSV_PATH_VAL)
eval_spec = tf.estimator.EvalSpec(eval_input_fn)
feature_columns = [tf.feature_column.numeric_column("image",shape=(224,224)), # here i need a pyhton method to transform
tf.feature_column.categorical_column_with_vocabulary_list("gender", ["ww","ee"]),
tf.feature_column.categorical_column_with_vocabulary_list("ethinicity",["xx","yy"])]
estimator = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[1024, 512, 256],warm_start_from=ws)
tf.estimator.train_and_evaluate(estimator, train_spec=train_spec, eval_spec=eval_spec)
Error for option 2:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [2].
Error for option 1:
ValueError: Shape must be rank 0 but is rank 1 for 'ReadFile' (op: 'ReadFile') with input shapes: [?].
Any help is appreciated.
Thanks
First you need to read the CSV file into dataset.
Then for each row in your CSV you can call your parse function.
def getInput(fileList):
# returns a dataset containing list of filenames
files = tf.data.Dataset.from_tensor_slices(fileList)
# Returs a dataset containing list of rows taken from all the files in file list.
# dataset is filled dynamically and not all entries are read at once
dataset = files.interleave(tf.data.TextLineDataset)
# call parse function for each row
# returned dataset will contain list of whatever the parse function is returning for the row
# we want the image path to be converted to decoded image in parse function
dataset = dataset.map(_parse_function, num_parallel_calls=8)
# return an iterator for the dataset which will be used to get elements.
return dataset.make_one_shot_iterator().get_next()
The parse function will be passed only one parameter that will be a single row from the CSV file. You need to decode the CSV and do further processing on each value.
Let's say you have 3 columns in your CSV each being a string.
def _parse_function(value):
columns_default = [[""], [""], [""]]
# this will be a tensor of columns in the row
columns = tf.decode_csv(value, record_defaults=columns_default,
field_delim=',')
col_names = ["label", "imagepath", "c3"]
features = dict(zip(col_names, columns))
for f, tensor in features.items():
# process imagepath to decoded image
if f == "imagepath":
image_string = tf.read_file(tensor)
image_decoded = tf.image.decode_jpeg(image_string)
image_resized = tf.image.resize_images(image_decoded, [28, 28])
features[f] = image_resized
labels = tf.equal(features.pop('label'), "1")
labels = tf.expand_dims(labels, 0)
return features, labels
Edit:
Explanation for comment:
Dataset object simply contains a list of elements. The elements can be tensors or a tuple of tensors etc. Tensor object can contain anything. It could represent a single feature, a single record or a batch of record. Further dataset API provide handy methods to manipulate the elements within.
If you are using dataset with another API like estimator then they expect the dataset elements to be in specific format which is what need to return from our input function for eg.
https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
I have edited my code block above to describe what dataset object at each step will contain.
From what I understand is that you have image path as one of the field in your CSV and you want to convert that path into an actual decoded image which you will use as one of the feature.
Since the image is going to be just one of the feature, you should not try to create a dataset using image files alone. Dataset object will include all your features at once.
So doing this would be incorrect:
files = tf.data.Dataset.from_tensor_slices(ds['imagepath'])
dataset = files.interleave(tf.data.TextLineDataset)
If you are using make_csv() function to read your csv then it will convert each row of your csv into one record where one record will contain list of all features, same as columns of csv.
So each element in the returned dataset should contain a single tensor containing all your features.
Here your image path will be one of the features. now you want to transform that image path to decoded image.
I suppose you can do it by applying a parse function to elements of dataset using map() function but it will be slightly tricky as all your features are already packed inside a single tensor.

How to read (decode) tfrecords with tf.data API

I have a custom dataset, that I then stored as tfrecord, doing
# toy example data
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
def bytes_feature(values):
"""Returns a TF-Feature of bytes.
Args:
values: A string.
Returns:
A TF-Feature.
"""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[values]))
def labeled_image_to_tfexample(sample_binary_string, label_binary_string):
return tf.train.Example(features=tf.train.Features(feature={
'sample/image': bytes_feature(sample_binary_string),
'sample/label': bytes_feature(label_binary_string)
}))
def _write_to_tf_record():
with tf.Graph().as_default():
image_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_image = tf.image.encode_png(image_placeholder)
label_placeholder = tf.placeholder(dtype=tf.uint16)
encoded_label = tf.image.encode_png(image_placeholder)
with tf.python_io.TFRecordWriter("./toy.tfrecord") as writer:
with tf.Session() as sess:
feed_dict = {image_placeholder: sample,
label_placeholder: label}
# Encode image and label as binary strings to be written to tf_record
image_string, label_string = sess.run(fetches=(encoded_image, encoded_label),
feed_dict=feed_dict)
# Define structure of what is going to be written
file_structure = labeled_image_to_tfexample(image_string, label_string)
writer.write(file_structure.SerializeToString())
return
However I cannot read it. First I tried (based on http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html , https://medium.com/coinmonks/storage-efficient-tfrecord-for-images-6dc322b81db4 and https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564)
def read_tfrecord_low_level():
data_path = "./toy.tfrecord"
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
reader = tf.TFRecordReader()
_, raw_records = reader.read(filename_queue)
decode_protocol = {
'sample/image': tf.FixedLenFeature((), tf.int64),
'sample/label': tf.FixedLenFeature((), tf.int64)
}
enc_example = tf.parse_single_example(raw_records, features=decode_protocol)
recovered_image = enc_example["sample/image"]
recovered_label = enc_example["sample/label"]
return recovered_image, recovered_label
I also tried variations casting enc_example and decoding it, such as in Unable to read from Tensorflow tfrecord file However when I try to evaluate them my python session just freezes and gives no output or traceback.
Then I tried using eager execution to see what is happening, but apparently it is only compatible with tf.data API. However as far as I understand transformations on tf.data API are made on the whole dataset. https://www.tensorflow.org/api_guides/python/reading_data mentions that a decode function must be written, but doesn't give an example on how to do that. All the tutorials I have found are made for TFRecordReader (which doesn't work for me).
Any help (pinpointing what I am doing wrong/ explaining what is happening/ indications on how to decode tfrecords with tf.data API) is highly appreciated.
According to https://www.youtube.com/watch?v=4oNdaQk0Qv4 and https://www.youtube.com/watch?v=uIcqeP7MFH0 tf.data is the best way to create input pipelines, so I am highly interested on learning that way.
Thanks in advance!
I am not sure why storing the encoded png causes the evaluation to not work, but here is a possible way of working around the problem. Since you mentioned that you would like to use the tf.data way of creating input pipelines, I'll show how to use it with your toy example:
label = np.asarray([[1,2,3],
[4,5,6]]).reshape(2, 3, -1)
sample = np.stack((label + 200).reshape(2, 3, -1))
First, the data has to be saved to the TFRecord file. The difference from what you did is that the image is not encoded to png.
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
writer = tf.python_io.TFRecordWriter("toy.tfrecord")
example = tf.train.Example(features=tf.train.Features(feature={
'label_raw': _bytes_feature(tf.compat.as_bytes(label.tostring())),
'sample_raw': _bytes_feature(tf.compat.as_bytes(sample.tostring()))}))
writer.write(example.SerializeToString())
writer.close()
What happens in the code above is that the arrays are turned into strings (1d objects) and then stored as bytes features.
Then, to read the data back using the tf.data.TFRecordDataset and tf.data.Iterator class:
filename = 'toy.tfrecord'
# Create a placeholder that will contain the name of the TFRecord file to use
data_path = tf.placeholder(dtype=tf.string, name="tfrecord_file")
# Create the dataset from the TFRecord file
dataset = tf.data.TFRecordDataset(data_path)
# Use the map function to read every sample from the TFRecord file (_read_from_tfrecord is shown below)
dataset = dataset.map(_read_from_tfrecord)
# Create an iterator object that enables you to access all the samples in the dataset
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
label_tf, sample_tf = iterator.get_next()
# Similarly to tf.Variables, the iterators have to be initialised
iterator_init = iterator.make_initializer(dataset, name="dataset_init")
with tf.Session() as sess:
# Initialise the iterator passing the name of the TFRecord file to the placeholder
sess.run(iterator_init, feed_dict={data_path: filename})
# Obtain the images and labels back
read_label, read_sample = sess.run([label_tf, sample_tf])
The function _read_from_tfrecord() is:
def _read_from_tfrecord(example_proto):
feature = {
'label_raw': tf.FixedLenFeature([], tf.string),
'sample_raw': tf.FixedLenFeature([], tf.string)
}
features = tf.parse_example([example_proto], features=feature)
# Since the arrays were stored as strings, they are now 1d
label_1d = tf.decode_raw(features['label_raw'], tf.int64)
sample_1d = tf.decode_raw(features['sample_raw'], tf.int64)
# In order to make the arrays in their original shape, they have to be reshaped.
label_restored = tf.reshape(label_1d, tf.stack([2, 3, -1]))
sample_restored = tf.reshape(sample_1d, tf.stack([2, 3, -1]))
return label_restored, sample_restored
Instead of hard-coding the shape [2, 3, -1], you could also store that too into the TFRecord file, but for simplicity I didn't do it.
I made a little gist with a working example.
Hope this helps!

Tensorflow dataset batching for complex data

I tried to follow the example in this link:
https://www.tensorflow.org/programmers_guide/datasets
but I am totally lost about how to run the session. I understand the first argument is the operations to run, and feed_dict is the placeholders (my understanding is the batches of the training or test dataset),
So, here is my code:
batch_size = 100
handle_mix = tf.placeholder(tf.float64, shape=[])
handle_src0 = tf.placeholder(tf.float64, shape=[])
handle_src1 = tf.placeholder(tf.float64, shape=[])
handle_src2 = tf.placeholder(tf.float64, shape=[])
handle_src3 = tf.placeholder(tf.float64, shape=[])
I create the dataset from mp4 tracks and stems, reading mixture and sources magnitudes, and pad them to be suitable to batching
dataset = tf.data.Dataset.from_tensor_slices(
{"x_mixed":padded_lbl, "y_src0": padded_src[0], "y_src1":
padded_src[1],"y_src2": padded_src[1], "y_src3": padded_src[1]})
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
from the example I should do:
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(dataset)
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element)
However, I have a loss, summaries, and optimiser operations and need to feed the data as batches, following another example as:
l, _, summary = sess.run([loss_fn, optimizer, summary_op], feed_dict= {handle_mix: batch_mix, handle_src0: batch_src0, handle_src1: batch_src1, handle_src2: batch_src2, handle_src3: batch_src3})
So I thought something like:
batch_mix, batch_src0, batch_src1, batch_src2, batch_src3 = data.train.next_batch(batch_size)
or maybe a separate run to fetch the batches first, then run the optimisation as above, such as:
batch_mix, batch_src0, batch_src1, batch_src2, batch_src3 = sess.run(next_element)
l, _, summary = sess.run([loss_fn, optimizer, summary_op], feed_dict={handle_mix: batch_mix, handle_src0: batch_src0, handle_src1: batch_src1, handle_src2: batch_src2, handle_src3: batch_src3})
This last attempt, returned string names of the batches as created in the tf.data.Dataset.from_tensor_slices ("x_mixed", "y_src0", ... etc) and failed to cast to tf.float64 placeholders in the session.
Can you please let me know how to create this dataset, there might be an error in the structure from tensor slices in the first place, then how to batch them,
thank you very much,
The issue is that you packed your data into a dict when creating the dataset from tensor slices. This will result in iterator.get_next() returning each batch as a dict as well. If we do something like
d = {"a": 1, "b": 2}
k1, k2 = d
we get k1 == "a" and k2 == "b" (or the other way around due to unordered dict keys). That is, your attempt at unpacking the result of sess.run(next_element) just gives you the dict keys whereas you are interested in the dict values (tensors). This should work instead:
next_element = iterator.get_next()
x_mixed = next_element["x_mixed"]
y_src0 = next_element["y_src0"]
...
If you then build your model based on the variables x_mixed etc, it should work fine. Note that with the tf.data API you don't need placeholders! Tensorflow will see that your model output requires e.g. x_mixed, which is gotten from iterator.get_next(), so it will simply execute this op whenever you try to sess.run() your loss function/optimizer etc. If you're more comfortable with placeholders you can of course keep using them, just remember to unpack the dict properly. This should be about right:
batch_dict = sess.run(next_element)
l, _, summary = sess.run([loss_fn, optimizer, summary_op], feed_dict={handle_mix: batch_dict["x_mixed"], ... })

TensorFlow input function for reading sparse data (in libsvm format)

I'm new to TensorFlow and trying to use the Estimator API for some simple classification experiments. I have a sparse dataset in libsvm format. The following input function works for small datasets:
def libsvm_input_function(file):
def input_function():
indexes_raw = []
indicators_raw = []
values_raw = []
labels_raw = []
i=0
for line in open(file, "r"):
data = line.split(" ")
label = int(data[0])
for fea in data[1:]:
id, value = fea.split(":")
indexes_raw.append([i,int(id)])
indicators_raw.append(int(1))
values_raw.append(float(value))
labels_raw.append(label)
i=i+1
indexes = tf.SparseTensor(indices=indexes_raw,
values=indicators_raw,
dense_shape=[i, num_features])
values = tf.SparseTensor(indices=indexes_raw,
values=values_raw,
dense_shape=[i, num_features])
labels = tf.constant(labels_raw, dtype=tf.int32)
return {"indexes": indexes, "values": values}, labels
return input_function
However, for a dataset of a few GB size I get the following error:
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
How can I avoid this error? How should I write an input function to read medium-sized sparse datasets (in libsvm format)?
When use estimator, for libsvm data input, you can create dense index list, dense value list, then use feature_column.categorical_column_with_identity and feature_column.weighted_categorical_column to create feature column, finally, put feature columns to estimator. Maybe your input features length is variable, you can use padded_batch to handle it.
here some codes:
## here is input_fn
def input_fn(data_dir, is_training, batch_size):
def parse_csv(value):
## here some process to create feature_indices list, feature_values list and labels
return {"index": feature_indices, "value": feature_values}, labels
dataset = tf.data.Dataset.from_tensor_slices(your_filenames)
ds = dataset.flat_map(
lambda f: tf.data.TextLineDataset(f).map(parse_csv)
)
ds = ds.padded_batch(batch_size, ds.output_shapes, padding_values=(
{
"index": tf.constant(-1, dtype=tf.int32),
"value": tf.constant(0, dtype=tf.float32),
},
tf.constant(False, dtype=tf.bool)
))
return ds.repeat().prefetch(batch_size)
## create feature column
def build_model_columns():
categorical_column = tf.feature_column.categorical_column_with_identity(
key='index', num_buckets=your_feature_dim)
sparse_columns = tf.feature_column.weighted_categorical_column(
categorical_column=categorical_column, weight_feature_key='value')
dense_columns = tf.feature_column.embedding_column(sparse_columns, your_embedding_dim)
return [sparse_columns], [dense_columns]
## when created feature column, you can put them into estimator, eg. put dense_columns into DNN, and sparse_columns into linear model.
## for export savedmodel
def raw_serving_input_fn():
feature_spec = {"index": tf.placeholder(shape=[None, None], dtype=tf.int32),
"value": tf.placeholder(shape=[None, None], dtype=tf.float32)}
return tf.estimator.export.build_raw_serving_input_receiver_fn(feature_spec)
Another way, you can create your custom feature column, like this: _SparseArrayCategoricalColumn
I have been using tensorflow.contrib.libsvm. Here's an example (i am using eager execution with generators)
import os
import tensorflow as tf
import tensorflow.contrib.libsvm as libsvm
def all_libsvm_files(folder_path):
for file in os.listdir(folder_path):
if file.endswith(".libsvm"):
yield os.path.join(folder_path, file)
def load_libsvm_dataset(path_to_folder):
return tf.data.TextLineDataset(list(all_libsvm_files(path_to_folder)))
def libsvm_iterator(path_to_folder):
dataset = load_libsvm_dataset(path_to_folder)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
yield libsvm.decode_libsvm(tf.reshape(next_element, (1,)),
num_features=666,
dtype=tf.float32,
label_dtype=tf.float32)
libsvm_iterator gives you a feature-label pair back on each iteration, from multiple files inside a folder that you specify.