Using tensorflow dataset with stratified sampling - tensorflow

Given a tensorflow dataset
Train_dataset = tf.data.Dataset.from_tensor_slices((Train_Image_Filenames,Train_Image_Labels))
Train_dataset = Train_dataset.map(Parse_JPEG_Augmented)
...
I would like to stratify my batches to deal with class imbalance. I found tf.contrib.training.stratified_sample and thought I could use it in the following way:
Train_dataset_iter = Train_dataset.make_one_shot_iterator()
Train_dataset_Image_Batch,Train_dataset_Label_Batch = Train_dataset_iter.get_next()
Train_Stratified_Images,Train_Stratified_Labels = tf.contrib.training.stratified_sample(Train_dataset_Image_Batch,Train_dataset_Label_Batch,[1/Classes]*Classes,Batch_Size)
But it gives the following error and I'm not sure that this would allow me to keep the performance benefits of tensorflow dataset as I may have then have to pass Train_Stratified_Images and Train_Stratified_Labels via feed_dict ?
File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/sampling_ops.py", line 192, in stratified_sample
with ops.name_scope(name, 'stratified_sample', list(tensors) + [labels]):
File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 459, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.
What would be the "best practice" way of using dataset with stratified batches?

Here is below a simple example to demonstrate the usage of sample_from_datasets (thanks #Agade for the idea).
import math
import tensorflow as tf
import numpy as np
def print_dataset(name, dataset):
elems = np.array([v.numpy() for v in dataset])
print("Dataset {} contains {} elements :".format(name, len(elems)))
print(elems)
def combine_datasets_balanced(dataset_smaller, size_smaller, dataset_bigger, size_bigger, batch_size):
ds_smaller_repeated = dataset_smaller.repeat(count=int(math.ceil(size_bigger / size_smaller)))
# we repeat the smaller dataset so that the 2 datasets are about the same size
balanced_dataset = tf.data.experimental.sample_from_datasets([ds_smaller_repeated, dataset_bigger], weights=[0.5, 0.5])
# each element in the resulting dataset is randomly drawn (without replacement) from dataset even with proba 0.5 or from odd with proba 0.5
balanced_dataset = balanced_dataset.take(2 * size_bigger).batch(batch_size)
return balanced_dataset
N, M = 3, 10
even = tf.data.Dataset.range(0, 2 * N, 2).repeat(count=int(math.ceil(M / N)))
odd = tf.data.Dataset.range(1, 2 * M, 2)
even_odd = combine_datasets_balanced(even, N, odd, M, 2)
print_dataset("even", even)
print_dataset("odd", odd)
print_dataset("even_odd_all", even_odd)
Output :
Dataset even contains 12 elements : # 12 = 4 x N (because of .repeat)
[0 2 4 0 2 4 0 2 4 0 2 4]
Dataset odd contains 10 elements :
[ 1 3 5 7 9 11 13 15 17 19]
Dataset even_odd contains 10 elements : # 10 = 2 x M / 2 (2xM because of .take(2 * M) and /2 because of .batch(2))
[[ 0 2]
[ 1 4]
[ 0 2]
[ 3 4]
[ 0 2]
[ 4 0]
[ 5 2]
[ 7 4]
[ 0 9]
[ 2 11]]

Related

How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset?

I have a huge TFRecord file with more than 4M entries. It is a very unbalanced dataset containing many more entries of some labels and few others - compare to the whole dataset. I want to filter a limited number of entries of some of these labels in order to have a balanced dataset. Below, you can see my attempt, but it takes more than 24 hours to filter 1k from each label (33 different labels).
import tensorflow as tf
tf.compat.as_str(
bytes_or_text='str', encoding='utf-8'
)
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())
strategy = tf.distribute.TPUStrategy(tpu)
except:
strategy = tf.distribute.get_strategy()
print("Number of replicas:", strategy.num_replicas_in_sync)
ignore_order = tf.data.Options()
ignore_order.experimental_deterministic = False
dataset = tf.data.TFRecordDataset('/test.tfrecord')
dataset = dataset.with_options(ignore_order)
features, feature_lists = detect_schema(dataset)
#Decodings TFRecord serialized data
def decode_data(serialized):
X, y = tf.io.parse_single_sequence_example(
serialized,
context_features=features,
sequence_features=feature_lists)
return X['title'], y['subject']
dataset = dataset.map(lambda x: tf.py_function(func=decode_data, inp=[x], Tout=(tf.string, tf.string)))
#Filtering and concatenating the samples
def balanced_dataset(dataset, labels_list, sample_size=1000):
datasets_list = []
for label in labels_list:
#Filtering the chosen labels
locals()[label] = dataset.filter(lambda x, y: tf.greater(tf.reduce_sum(tf.cast(tf.equal(tf.constant(label, dtype=tf.int64), y), tf.float32)), tf.constant(0.)))
#appending a limited sample
datasets_list.append(locals()[label].take(sample_size))
concat_dataset = datasets_list[0]
#concatenating the datasets
for dset in datasets_list[1:]:
concat_dataset = concat_dataset.concatenate(dset)
return concat_dataset
balanced_data = balanced_dataset(tabledataset, labels_list=list(decod_dic.values()), sample_size=1000)
One way to solve this is by using group_by_window method where the window_size would be the sample size of each class (in your case 1k).
ds = ds.group_by_window(
# Use label as key
key_func=lambda _, l: l,
# Convert each window to a sample_size
reduce_func=lambda _, window: window.batch(sample_size),
# Use window size as sample_size
window_size=sample_size)
This will form batches of single classes of size sample_size. But there is one problem, there will be multiple batches of same class, but you just need one of the batches in each class.
To solve the above problem, we need to add a count for each of the batches and then filter out count==0, which will fetch the first batch of all the classes.
Lets define an example:
labels = np.array(sum([[label]*repeat for label, repeat in zip([0, 1, 2], [100, 200, 15])], []))
features = np.arange(len(labels))
np.unique(labels, return_counts=True)
#(array([0, 1, 2]), array([100, 200, 15]))
# There are 3 labels chosen for simplicity and each of their counts are shown along.
sample_size = 15 # we choose to pick sample of 15 from each class
We create a dataset from the above inputs,
ds = tf.data.Dataset.from_tensor_slices((features, labels))
In the above window function we modify the reduce_func to make the counter, so the batch will have 3 elements (X_batch, y_batch, label_counter) :
def reduce_func(x, y):
#class_count[y] += 1
z = table.lookup(x)
table.insert(x, z+1)
return y.batch(sample_size).map(lambda a,b: (a, b, z))
# Group by window
ds = tf.data.Dataset.from_tensor_slices((features, labels))
ds = ds.group_by_window(
# Use label as key
key_func=lambda _, l: l,
# Convert each window to a sample_size
reduce_func=reduce_func,
# Use window size as sample_size
window_size=sample_size)
The counter logic in reduce_func is implemented as a table lookup where the counter needs to be updated and read from a lookup table. Its initialized as shown below:
n_classes = 3
keys = tf.range(0,n_classes, dtype=tf.int64)
vals = tf.zeros_like(keys, dtype=tf.int64)
table = tf.lookup.experimental.MutableHashTable(key_dtype=tf.int64,
value_dtype=tf.int64,
default_value=-1)
table.insert(keys, vals)
Now we filter out the batch where the count==0 and remove the count element to form (X, y) batch pairs:
ds = ds.filter(lambda x, y, count: count==0)
ds = ds.map(lambda x, y, count: (x, y))
Output,
for x, y in ds:
print(x.numpy(), y.numpy())
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[100 101 102 103 104 105 106 107 108 109 110 111 112 113 114] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[300 301 302 303 304 305 306 307 308 309 310 311 312 313 314] [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Julia - Gurobi Callbacks on array of JuMP variables

In Gurobi and JuMP 0.21, it is well documented here on how you would access a variable with a callback:
using JuMP, Gurobi, Test
model = direct_model(Gurobi.Optimizer())
#variable(model, 0 <= x <= 2.5, Int)
#variable(model, 0 <= y <= 2.5, Int)
#objective(model, Max, y)
cb_calls = Cint[]
function my_callback_function(cb_data, cb_where::Cint)
# You can reference variables outside the function as normal
push!(cb_calls, cb_where)
# You can select where the callback is run
if cb_where != GRB_CB_MIPSOL && cb_where != GRB_CB_MIPNODE
return
end
# You can query a callback attribute using GRBcbget
if cb_where == GRB_CB_MIPNODE
resultP = Ref{Cint}()
GRBcbget(cb_data, cb_where, GRB_CB_MIPNODE_STATUS, resultP)
if resultP[] != GRB_OPTIMAL
return # Solution is something other than optimal.
end
end
# Before querying `callback_value`, you must call:
Gurobi.load_callback_variable_primal(cb_data, cb_where)
x_val = callback_value(cb_data, x)
y_val = callback_value(cb_data, y)
# You can submit solver-independent MathOptInterface attributes such as
# lazy constraints, user-cuts, and heuristic solutions.
if y_val - x_val > 1 + 1e-6
con = #build_constraint(y - x <= 1)
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
elseif y_val + x_val > 3 + 1e-6
con = #build_constraint(y + x <= 3)
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
end
if rand() < 0.1
# You can terminate the callback as follows:
GRBterminate(backend(model))
end
return
end
# You _must_ set this parameter if using lazy constraints.
MOI.set(model, MOI.RawParameter("LazyConstraints"), 1)
MOI.set(model, Gurobi.CallbackFunction(), my_callback_function)
optimize!(model)
#test termination_status(model) == MOI.OPTIMAL
#test primal_status(model) == MOI.FEASIBLE_POINT
#test value(x) == 1
#test value(y) == 2
i.e., you would use x_val = callback_value(cb_data, x). However, how should you do when you have an array of variables with specific indexes not starting at 1, i.e. my variables are not in a vector but declared thanks to:
#variable(m, x[i=1:n, j=i+1:n], Bin)
Should I access x with double for loops on its two dimensions and call multiple times callback_value? If so, the indexes for j will not be the same, won't they?
Use broadcasting:
x_val = callback_value.(Ref(cb_data), x)
Or just call callback_value(cb_data, x[i, j]) when you need the value.
For example:
using JuMP, Gurobi
model = Model(Gurobi.Optimizer)
#variable(model, 0 <= x[i=1:3, j=i+1:3] <= 2.5, Int)
function my_callback_function(cb_data)
x_val = callback_value.(Ref(cb_data), x)
display(x_val)
for i=1:3, j=i+1:3
con = #build_constraint(x[i, j] <= floor(Int, x_val[i, j]))
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
end
end
MOI.set(model, MOI.LazyConstraintCallback(), my_callback_function)
optimize!(model)
yields
julia> optimize!(model)
Gurobi Optimizer version 9.1.0 build v9.1.0rc0 (mac64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model fingerprint: 0x5d543c3a
Variable types: 0 continuous, 3 integer (0 binary)
Coefficient statistics:
Matrix range [0e+00, 0e+00]
Objective range [0e+00, 0e+00]
Bounds range [2e+00, 2e+00]
RHS range [0e+00, 0e+00]
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = -0.0
[2, 3] = -0.0
[1, 3] = -0.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = 2.0
[1, 3] = 2.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = 2.0
[1, 3] = 2.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = -0.0
[1, 3] = -0.0
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (0 binary)
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = -0.0
[2, 3] = -0.0
[1, 3] = -0.0
Found heuristic solution: objective 0.0000000
Explored 0 nodes (0 simplex iterations) in 0.14 seconds
Thread count was 8 (of 8 available processors)
Solution count 1: 0
Optimal solution found (tolerance 1.00e-04)
Best objective 0.000000000000e+00, best bound 0.000000000000e+00, gap 0.0000%
User-callback calls 31, time in user-callback 0.14 sec

return the top_k items of each row for sparse tensor

For the dense tensor, we can use tf.nn.topk to find values and indices of the k largest entries for the last dimension.
For the sparse tensor, I would like to efficiently get the top n items of each row, without converting the sparse tensor to dense.
This was kind of tricky but here is something that works (assumes 2D sparse tensor, although I think should work the same for more outer dimensions). The idea is to first sort the whole sparse tensor (without making it dense) and then slice the first columns. To do that, I needed something like np.lexsort, which as far as I know is not provided in TensorFlow as such - however, tf.sparse.reorder actually does something like a lexsort, so I made another intermediate sparse tensor to take advantage of that.
import tensorflow as tf
import numpy as np
np.random.seed(0)
# Input data
k = 3
r = np.random.randint(10, size=(6, 8))
r[np.random.rand(*r.shape) < .5] = 0
sp = tf.sparse.from_dense(r)
print(tf.sparse.to_dense(sp).numpy())
# [[0 0 0 0 0 0 3 0]
# [2 4 0 6 8 0 0 6]
# [7 0 0 1 5 9 8 9]
# [4 0 0 3 0 0 0 3]
# [8 1 0 3 3 7 0 1]
# [0 0 0 0 7 0 0 7]]
# List of value indices
n = tf.size(sp.values, out_type=sp.indices.dtype)
r = tf.range(n)
# Sort values
s = tf.dtypes.cast(tf.argsort(sp.values, direction='DESCENDING'), sp.indices.dtype)
# Find destination index of each sorted value
si = tf.scatter_nd(tf.expand_dims(s, 1), r, [n])
# Abuse sparse tensor functionality to do lexsort with column and destination index
sp2 = tf.sparse.SparseTensor(indices=tf.stack([sp.indices[:, 0], si], axis=1),
values=r,
dense_shape=[sp.dense_shape[0], n])
sp2 = tf.sparse.reorder(sp2)
# Build top-k result
row = sp.indices[:, 0]
# Make column indices
d = tf.dtypes.cast(row[1:] - row[:-1] > 0, r.dtype)
m = tf.pad(r[1:] * d, [[1, 0]])
col = r - tf.scan(tf.math.maximum, m)
# Get only up to k elements per row
m = col < k
row_m = tf.boolean_mask(row, m)
col_m = tf.boolean_mask(col, m)
idx_m = tf.boolean_mask(sp2.values, m)
# Make result
scatter_idx = tf.stack([row_m, col_m], axis=-1)
scatter_shape = [sp.dense_shape[0], k]
# Use -1 for rows with less than k values
# (0 is ambiguous)
values = tf.tensor_scatter_nd_update(-tf.ones(scatter_shape, sp.values.dtype),
scatter_idx, tf.gather(sp.values, idx_m))
indices = tf.tensor_scatter_nd_update(-tf.ones(scatter_shape, sp.indices.dtype),
scatter_idx, tf.gather(sp.indices[:, 1], idx_m))
print(values.numpy())
# [[ 3 -1 -1]
# [ 8 6 6]
# [ 9 9 8]
# [ 4 3 3]
# [ 8 7 3]
# [ 7 7 -1]]
print(indices.numpy())
# [[ 6 -1 -1]
# [ 4 3 7]
# [ 5 7 6]
# [ 0 3 7]
# [ 0 5 3]
# [ 4 7 -1]]
EDIT: Here is an alternative possibility, which may work well if your tensor is very sparse in all rows. The idea is to "condense" all the sparse tensor values into the first columns (like the previous snippet already did for sp3) and then make that into a dense tensor and apply top-k as usual. The caveat is that the indices would be referred to the condensed tensor, so you have to take yet another step if you want to get the right indices with respect to initial sparse tensor.
import tensorflow as tf
import numpy as np
np.random.seed(0)
# Input data
k = 3
r = np.random.randint(10, size=(6, 8))
r[np.random.rand(*r.shape) < .8] = 0
sp = tf.sparse.from_dense(r)
print(tf.sparse.to_dense(sp).numpy())
# [[0 0 0 0 0 0 3 0]
# [0 4 0 6 0 0 0 0]
# [0 0 0 0 5 0 0 9]
# [0 0 0 0 0 0 0 0]
# [8 0 0 0 0 7 0 0]
# [0 0 0 0 7 0 0 0]]
# Build "condensed" sparse tensor
n = tf.size(sp.values, out_type=sp.indices.dtype)
r = tf.range(n)
# Make indices
row = sp.indices[:, 0]
d = tf.dtypes.cast(row[1:] - row[:-1] > 0, r.dtype)
m = tf.pad(r[1:] * d, [[1, 0]])
col = r - tf.scan(tf.math.maximum, m)
# At least as many columns as k
ncols = tf.maximum(tf.math.reduce_max(col) + 1, k)
sp2 = tf.sparse.SparseTensor(indices=tf.stack([row, col], axis=1),
values=sp.values,
dense_shape=[sp.dense_shape[0], ncols])
# Get in dense form
condensed = tf.sparse.to_dense(sp2)
# Top-k (indices do not correspond to initial sparse matrix)
values, indices = tf.math.top_k(condensed, k)
print(values.numpy())
# [[3 0 0]
# [6 4 0]
# [9 5 0]
# [0 0 0]
# [8 7 0]
# [7 0 0]]
# Now get the right indices
sp3 = tf.sparse.SparseTensor(indices=tf.stack([row, col], axis=1),
values=sp.indices[:, 1],
dense_shape=[sp.dense_shape[0], ncols])
condensed_idx = tf.sparse.to_dense(sp3)
actual_indices = tf.gather_nd(condensed_idx, tf.expand_dims(indices, axis=-1),
batch_dims=1)
print(actual_indices.numpy())
# [[6 0 0]
# [3 1 0]
# [7 4 0]
# [0 0 0]
# [0 5 0]
# [4 0 0]]
Not sure whether this would be faster or not though.

Tensorflow: How to bucket my examples using the new Data API

I'm trying to group my training examples by their length: https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.training/bucketing
But I want to use the new Data API. So I'm wondering is there a way to do it.
Here is my code:
import tensorflow as tf
vocabulary = ["This", "is", "my", "first", "example",
"the", "second", "one","How", "to", "bucket",
"examples", "using", "new", "Data", "API"]
data = ["This is my first example",
"How to bucket my examples using the new Data API",
"This is the second one",
"How to bucket my examples using the new Data API"]
BATCH_SIZE = 2
lookup_table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
dataset = tf.data.Dataset.from_tensor_slices(data)
def tokenize(x):
words = tf.string_split([x], " ").values
return words
def lookup(x):
ids = lookup_table.lookup(x)
return ids
bucket_boundaries = [5, 10]
def bucketing(x):
return tf.contrib.training.bucket_by_sequence_length(
input_length=10,
tensors=[x],
batch_size=1,
bucket_boundaries=bucket_boundaries,
dynamic_pad=True
)
# dataset = (dataset
# .map(tokenize)
# .map(lookup)
# # .padded_batch(BATCH_SIZE, padded_shapes=[?])
# )
dataset = (dataset
.map(tokenize)
.map(lookup)
.map(bucketing)
)
iterator = dataset.make_initializable_iterator()
next_batch = iterator.get_next()
init_op = tf.group(tf.global_variables_initializer(),
tf.tables_initializer(),
iterator.initializer)
sess = tf.Session()
sess.run(init_op)
for i in range(len(data)):
batch = sess.run(next_batch)
print(batch)
The expected output should be something like this:
[[0 1 2 3 4], [0 1 5 6 7]]
[[8 9 10 2 11 12 5 13 14 15], [8 9 10 2 11 12 5 13 14 15]]
The code above throws OutOfRangeError.
OutOfRangeError (see above for traceback): End of sequence

Multi-Target and Multi-Class prediction

I am relatively new to machine learning as well as tensorflow. I would like to train the data so that predictions with 2 targets and multiple classes could be made. Is this something that can be done? I was able to implement the algorithm for 1 target but don't know how I need to do it for a second target as well.
An example dataset:
DayOfYear Temperature Flow Visibility
316 8 1 4
285 -1 1 4
326 8 2 5
323 -1 0 3
10 7 3 6
62 8 0 3
56 8 1 4
347 7 2 5
363 7 0 3
77 7 3 6
1 7 1 4
308 -1 2 5
364 7 3 6
If I train (DayOfYear Temperature Flow) I can predict the Visibility quite well. But I need to predict Flow as well somehow. I am pretty sure that Flow will influence Visibility so I am not sure how to go with that.
This is the implementation that I have
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import urllib
import numpy as np
import tensorflow as tf
# Data sets
TRAINING = "/ml_baetterich_learn.csv"
TEST = "/ml_baetterich_test.csv"
VALIDATION = "/ml_baetterich_validation.csv"
def main():
# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TRAINING,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TEST,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
validation_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=VALIDATION,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=3)]
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=9,
model_dir="/tmp/iris_model")
# Define the training inputs
def get_train_inputs():
x = tf.constant(training_set.data)
y = tf.constant(training_set.target)
return x, y
# Fit model.
classifier.fit(input_fn=get_train_inputs, steps=4000)
# Define the test inputs
def get_test_inputs():
x = tf.constant(test_set.data)
y = tf.constant(test_set.target)
return x, y
# Define the test inputs
def get_validation_inputs():
x = tf.constant(validation_set.data)
y = tf.constant(validation_set.target)
return x, y
# Evaluate accuracy.
accuracy_test_score = classifier.evaluate(input_fn=get_test_inputs,
steps=1)["accuracy"]
accuracy_validation_score = classifier.evaluate(input_fn=get_validation_inputs,
steps=1)["accuracy"]
print ("\nValidation Accuracy: {0:0.2f}\nTest Accuracy: {1:0.2f}\n".format(accuracy_validation_score,accuracy_test_score))
# Classify two new flower samples.
def new_samples():
return np.array(
[[327,8,3],
[47,8,0]], dtype=np.float32)
predictions = list(classifier.predict_classes(input_fn=new_samples))
print(
"New Samples, Class Predictions: {}\n"
.format(predictions))
if __name__ == "__main__":
main()
Option 1: multi-headed model
You could use a multi-headed DNNEstimator model. This treats Flow and Visibility as two separate softmax classification targets, each with their own set of classes. I had to modify the load_csv_without_header helper function to support multiple targets (which could be cleaner, but is not the point here - feel free to ignore its details).
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import csv
import collections
num_flow_classes = 4
num_visib_classes = 7
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
def load_csv_without_header(fn, target_dtype, features_dtype, target_columns):
with gfile.Open(fn) as csv_file:
data_file = csv.reader(csv_file)
data = []
targets = {
target_cols: []
for target_cols in target_columns.keys()
}
for row in data_file:
cols = sorted(target_columns.items(), key=lambda tup: tup[1], reverse=True)
for target_col_name, target_col_i in cols:
targets[target_col_name].append(row.pop(target_col_i))
data.append(np.asarray(row, dtype=features_dtype))
targets = {
target_col_name: np.array(val, dtype=target_dtype)
for target_col_name, val in targets.items()
}
data = np.array(data)
return Dataset(data=data, target=targets)
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1),
tf.contrib.layers.real_valued_column("", dimension=2),
]
head = tf.contrib.learn.multi_head([
tf.contrib.learn.multi_class_head(
num_flow_classes, label_name="Flow", head_name="Flow"),
tf.contrib.learn.multi_class_head(
num_visib_classes, label_name="Visibility", head_name="Visibility"),
])
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=head,
)
def get_input_fn(filename):
def input_fn():
dataset = load_csv_without_header(
fn=filename,
target_dtype=np.int,
features_dtype=np.int,
target_columns={"Flow": 2, "Visibility": 3}
)
x = tf.constant(dataset.data)
y = {k: tf.constant(v) for k, v in dataset.target.items()}
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
Option 2: multi-labeled head
If you keep your CSV data separated by commas, and keep the last column for all the classes a row might have (separated by some token such as space), you can use the following code:
import numpy as np
import tensorflow as tf
all_classes = ["0", "1", "2", "3", "4", "5", "6"]
def k_hot(classes_col, all_classes, delimiter=' '):
table = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(all_classes)
)
classes = tf.string_split(classes_col, delimiter)
ids = table.lookup(classes)
num_items = tf.cast(tf.shape(ids)[0], tf.int64)
num_entries = tf.shape(ids.indices)[0]
y = tf.SparseTensor(
indices=tf.stack([ids.indices[:, 0], ids.values], axis=1),
values=tf.ones(shape=(num_entries,), dtype=tf.int32),
dense_shape=(num_items, len(all_classes)),
)
y = tf.sparse_tensor_to_dense(y, validate_indices=False)
return y
def feature_engineering_fn(features, labels):
labels = k_hot(labels, all_classes)
return features, labels
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1), # DayOfYear
tf.contrib.layers.real_valued_column("", dimension=2), # Temperature
]
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=tf.contrib.learn.multi_label_head(n_classes=len(all_classes)),
feature_engineering_fn=feature_engineering_fn,
)
def get_input_fn(filename):
def input_fn():
dataset = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=filename,
target_dtype="S100", # strings of length up to 100 characters
features_dtype=np.int,
target_column=-1
)
x = tf.constant(dataset.data)
y = tf.constant(dataset.target)
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
We are using DNNEstimator with a multi_label_head, which uses sigmoid crossentropy rather than softmax crossentropy as a loss function. This means that each of the output units/logits are passed through the sigmoid function, which gives the likelihood of the data point belonging to that class, i.e. the classes are computed independently and are not mutually exclusive as they are with softmax crossentropy. This means that you could have between 0 and len(all_classes) classes set for each row in the training set and final predictions.
Also notice that the classes are represented as strings (and k_hot makes the conversion to token indices), so that you could use arbitrary class identifiers such as category UUIDs in e-commerce settings. If the categories in the 3rd and 4th column are different (Flow ID 1 != Visibility ID 1), you could prepend the column name to each class ID, e.g.
316,8,flow1 visibility4
285,-1,flow1 visibility4
326,8,flow2 visibility5
For a description of how k_hot works, see my other SO answer. I decided to use k_hot as a separate function (rather than define it directly in feature_engineering_fn because it's a distinct piece of functionality, and probably TensorFlow will soon have a similar utility function.
Note that if you're now using the first two columns to predict the last two columns, your accuraccy will certainly go down, as the last two columns are highly correlated and using one of them will give you a lot of information about the other. Actually, your code was using only the 3rd column, which was kind of a cheat anyway if the goal is to predict the 3rd and 4th columns.