How do I set the pso objective function to the estimated distribution function obtained from Gaussian process regression? - particle-swarm

I am trying to create an estimated distribution function from the data in data.dat using Gaussian process regression and set it as the objective function of pso.but I keep getting the error below.
I would like to create a function that returns LD when I pass x, but it doesn't work.
In the case of pso, x is passed at the same time, so I split it into columns and return LD.
I tried to define a function with the result in an empty "result".Might be wrong to put in empty data.
Code
import numpy as np
import pandas as pd
import pyswarms as ps
import GPy
# define the random seed to fix
np.random.seed(0)
# optimization conditions
n_particles = 5
n = n_particles
iters = 10
bounds = (np.array([10,6,2,12,38,0,3.6,4,8,0]),np.array([18,13,7,18,42,9,9,10,18,7.5]))#(min,max)
# Determine hyperparameter
options = {"c1": 0.5, "c2": 0.3, "w":0.9}
# Input the dimensions of design parameters
ndim = 10
Ndim = np.arange(ndim)
# import the data
data = pd.read_csv('data.dat', header=None, sep=" ")
data.columns = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"]
print(data.shape) # (96,12)
N = len(data)
# import the input values to X and the output values to LD
X = data.iloc[:, 1:11]
fn = data.loc[:, "12"]
# change the data type from 'list' to 'numpy.ndarray'
x = np.array(X, dtype=float)
Y = np.array(fn, dtype=float)
# reshape the dimension od output from
y = Y.reshape(N, 1)
# choose the kernel to regression
kernel = GPy.kern.RBF(ndim, ARD=True)
# create a model from the data by Gaussian Progress Regression with a kernel
m = GPy.models.GPRegression(x, y, kernel)
#define an objective function
def func2(x):
print(x.shape) #(5,10)
for i in range(n):
xi = x[i,:]
xi = xi.reshape(1,len(xi))
print(xi.shape) #(1,10)
LD = m.predict(xi, include_likelihood = False)
# print('LD =',LD)#tuple
# print('LD0 =',LD[0])#list
result = np.empty((n,1))
result[i, 0] = LD[0]
print(result.shape) #(5,1)
return result
# start optimization
optimizer = ps.single.GlobalBestPSO(n_particles = n, dimensions = ndim, options = options, bounds = bounds)
cost, pos = optimizer.optimize (objective_func = func2 , iters = iters)
Error
Traceback (most recent call last):
File "e:\PSO\approx4pso.py", line 132, in <module>
cost, pos = optimizer.optimize (objective_func = func2 , iters = iters)
File "C:\Users\taku_\anaconda3\lib\site-packages\pyswarms\single\global_best.py", line 210, in optimize
self.swarm.pbest_pos, self.swarm.pbest_cost = compute_pbest(self.swarm)
File "C:\Users\taku_\anaconda3\lib\site-packages\pyswarms\backend\operators.py", line 69, in compute_pbest
new_pbest_pos = np.where(~mask_pos, swarm.pbest_pos, swarm.position)
File "<__array_function__ internals>", line 180, in where
ValueError: operands could not be broadcast together with shapes (5,10,5) (5,10) (5,10)
How do I define a function?I don't have to stick to this approach, so if this code doesn't work, please let me know of other ways to define the function using Gaussian process regression.

Related

tensorflow Exception encountered when calling layer (type CategoryEncoding)

I'm trying to code a layer to interface between a data set (numerical and categorical features) so it can be fed into a model.
I can't understand the error I get when it comes to categorical columns.
ValueError: Exception encountered when calling layer (type CategoryEncoding).
When output_mode is not 'int', maximum supported output rank is 2. Received
output_mode multi_hot and input shape (10, 7, 1), which would result in output rank 3.
From what I understand, the batch size should not have been counted in, but it is. And that seems to break.
Note that reproducing with only numerical features works fine.
Thank you for your help.
import tensorflow as tf
import pandas as pd
import numpy as np
# Simulate a data set of categorical and numerical values
# Configure simulation specifications: {feature: number of unique categories or None for numerical}
theSimSpecs = {'Cat1': 54, 'Cat2': 2, 'Cat3': 4, 'Num1': None, 'Num2': None}
# theSimSpecs = {'Num1': None, 'Num2': None}
# batch size and timesteps
theBatchSz, theTimeSteps = 10, 4
# Creation of the dataset as pandas.DataFrame
theDFs = []
for theFeature, theUniques in theSimSpecs.items():
if theUniques is None:
theDF = pd.DataFrame(np.random.random(size=theBatchSz * theTimeSteps), columns=[theFeature])
else:
theDF = pd.DataFrame(np.random.randint(low=0, high=theUniques, size=theBatchSz * theTimeSteps),
columns=[theFeature]).astype('category')
theDFs.append(theDF)
theDF = pd.concat(theDFs, axis=1)
# code excerpt
# inventory of the categorical features' values ( None for the numerical)
theCatCodes = {theCol: (theDF[theCol].unique().tolist() if str(theDF[theCol].dtypes) == "category" else None)
for theCol in theDF.columns}
# Creation of the batched tensorflow.data.Dataset
theDS = tf.data.Dataset.from_tensor_slices(dict(theDF))
theDS = theDS.window(size=theTimeSteps, shift=1, stride=1, drop_remainder=True)
theDS = theDS.flat_map(lambda x: tf.data.Dataset.zip(x))
theDS = theDS.batch(batch_size=theTimeSteps, drop_remainder=True)
theDS = theDS.batch(batch_size=theBatchSz, drop_remainder=True)
# extracting one batch
theBatch = next(iter(theDS))
tf.print(theBatch)
# Creation of the components for the interface layer
theFeaturesInputs = {}
theFeaturesEncoded = {}
for theFeature, theCodes in theCatCodes.items():
if theCodes is None: # Pass-through for numerical features
theNumInput = tf.keras.layers.Input(shape=[], dtype=tf.float32, name=theFeature)
theFeaturesInputs[theFeature] = theNumInput
theFeatureExp = tf.expand_dims(input=theNumInput, axis=-1)
theFeaturesEncoded[theFeature] = theFeatureExp
else: # Process for categorical features
theCatInput = tf.keras.layers.Input(shape=[], dtype=tf.int64, name=theFeature)
theFeaturesInputs[theFeature] = theCatInput
theFeatureExp = tf.expand_dims(input=theCatInput, axis=-1)
theEncodingLayer = tf.keras.layers.CategoryEncoding(num_tokens=theSimSpecs[theFeature], name=f"{theFeature}_enc",
output_mode="multi_hot", sparse=False)
theFeaturesEncoded[theFeature] = theEncodingLayer(theFeatureExp)
theStackedInputs = tf.concat(tf.nest.flatten(theFeaturesEncoded), axis=1)
theModel = tf.keras.Model(inputs=theFeaturesInputs, outputs=theStackedInputs)
theOutput = theModel(theBatch)
tf.print(theOutput)

Incremental PCA on big dataset, with large component demand

I am trying to find the main 200 components of a datasets of 846 images (2048x2048x3 RGB) with sklearn.decomposition.IncrementalPCA.
Data are read by cv2 and reshaped into a 2d np array ([846,2048x2048x3] size, float16)
To ensure a smaller memory cost, I used partial_fit() and divide the original data into smaller chunks (batches) in both partial_fit() and transform() steps.
just like the way in this problem's solution:
Python PCA on Matrix too large to fit into memory
Now my code works well for relative smaller size computations, like computing 20 components for 200 images in the datasets. It outputs right outcomes.
However, the tasks demands me to compute 200 components, which leads to the limit that my batch's size should be larger or at least equal to 200. (according to sklearn's document and the information in the terminal when running the code)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html
With such big chunk size,I can finish the IPCA model set, but always face MemoryError when doing partial_fit()
What's more, another problem is:
I need to use inverse_transform later, I am not sure if I can use chunk-style compute in this step or not. (In the code below I did not use it.)
What can I do to avoid this MemoryError? Or should I replace IncrementalPCA with some other method instead ? (these alternatives should have some method like inverse_transform())
The all memory I can access to is 131661572 kB(~127GB)
My code:
from sklearn.decomposition import PCA, IncrementalPCA
import numpy as np
import cv2
import os
folder_path = "./output_img"
input=[]
for i in range(1, 847):
if i%10 == 0: print("loading",i,"th image")
# if i == 60: continue #special case, should be skipped
image_path = folder_path+f"/{i}neutral.jpg"
img = cv2.imread(image_path)
input.append(img.reshape(-1))
print("Loaded all",i,"images")
# change into numpy matrix
all_image = np.stack(input,axis=0)
# trans to 0-1 format float64
all_image = (all_image.astype(np.float16))
### shape: #_of_imag x image_pixel_num (50331648 for img_normals case)
# print(all_image)
# print(all_image.shape)
# PCA, keeps 200 features
COM_NUM=200
pca=IncrementalPCA(n_components = COM_NUM)
print("finished IPCA model set")
saving_path = "./principle847"
element_num = all_image.shape[0] # how many elements(rows) we have in the dataset
chunk_size = 220 # how many elements we feed to IPCA at a time
for i in range(0, element_num//chunk_size):
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
print("finished PCA fit:",i*chunk_size,"to",(i+1)*chunk_size)
pca.partial_fit(all_image[(i+1)*chunk_size : element_num]) #tail
print("finished PCA fit:",(i+1)*chunk_size,"to",element_num)
for i in range(0, element_num//chunk_size):
if i==0:
result = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
else:
tmp = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",i*chunk_size,"to",(i+1)*chunk_size)
tmp = pca.transform(all_image[(i+1)*chunk_size : element_num]) #tail
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",(i+1)*chunk_size,"to",element_num)
result = pca.inverse_transform(result)
print("PCA mean:",pca.mean_)
mean_img = pca.mean_
mean_img = mean_img.reshape(2048,2048,3)
mean_img = mean_img.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("mean.png")),mean_img)
result=result.reshape(-1,2048,2048,3)
# result shape: #_of_componets * 2048 * 2048 * 3
dst = result
# dst=result/np.linalg.norm(result,axis=(3),keepdims=True)
for j in range(0,COM_NUM):
reconImage = (dst)[j]
# reconImage = reconImage.reshape(4096,4096,3)
reconImage = np.clip(reconImage,0,255)
reconImage = reconImage.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("p"+str(j)+".png")),reconImage)
print("Saved",j+1,"principle imgs")
The error goes like:
File "model_generate.py", line 36, in <module>
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/decomposition/_incremental_pca.py", line 299, in partial_fit
U, V = svd_flip(U, V, u_based_decision=False)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 538, in svd_flip
max_abs_rows = np.argmax(np.abs(v), axis=1)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1103, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
MemoryError

Tensorflow : Predict in Recurrent Neural Networks for Drawing Classification tutorial

I used the tutorial code from https://www.tensorflow.org/tutorials/recurrent_quickdraw and all works fine until I tried to make a prediction instead of just evaluate it.
I wrote a new input function for prediction, based on the code in create_dataset.py
def predict_input_fn():
def parse_line(stroke_points):
"""Parse an ndjson line and return ink (as np array) and classname."""
inkarray = json.loads(stroke_points)
stroke_lengths = [len(stroke[0]) for stroke in inkarray]
total_points = sum(stroke_lengths)
np_ink = np.zeros((total_points, 3), dtype=np.float32)
current_t = 0
for stroke in inkarray:
for i in [0, 1]:
np_ink[current_t:(current_t + len(stroke[0])), i] = stroke[i]
current_t += len(stroke[0])
np_ink[current_t - 1, 2] = 1 # stroke_end
# Preprocessing.
# 1. Size normalization.
lower = np.min(np_ink[:, 0:2], axis=0)
upper = np.max(np_ink[:, 0:2], axis=0)
scale = upper - lower
scale[scale == 0] = 1
np_ink[:, 0:2] = (np_ink[:, 0:2] - lower) / scale
# 2. Compute deltas.
np_ink = np_ink[1:, 0:2] - np_ink[0:-1, 0:2]
np_ink = np_ink[1:, :]
features = {}
features["ink"] = tf.train.Feature(float_list=tf.train.FloatList(value=np_ink.flatten()))
features["shape"] = tf.train.Feature(int64_list=tf.train.Int64List(value=np_ink.shape))
f = tf.train.Features(feature=features)
example = tf.train.Example(features=f)
#t = tf.constant(np_ink)
return example
def parse_example(example):
"""Parse a single record which is expected to be a tensorflow.Example."""
# feature_to_type = {
# "ink": tf.VarLenFeature(dtype=tf.float32),
# "shape": tf.FixedLenFeature((0,2), dtype=tf.int64)
# }
feature_to_type = {
"ink": tf.VarLenFeature(dtype=tf.float32),
"shape": tf.FixedLenFeature([2], dtype=tf.int64)
}
example_proto = example.SerializeToString()
parsed_features = tf.parse_single_example(example_proto, feature_to_type)
parsed_features["ink"] = tf.sparse_tensor_to_dense(parsed_features["ink"])
#parsed_features["shape"].set_shape((2))
return parsed_features
example = parse_line(FLAGS.predict_input_stroke_data)
features = parse_example(example)
dataset = tf.data.Dataset.from_tensor_slices(features)
# Our inputs are variable length, so pad them.
dataset = dataset.padded_batch(FLAGS.batch_size, padded_shapes=dataset.output_shapes)
iterator = dataset.make_one_shot_iterator()
next_feature_batch = iterator.get_next()
return next_feature_batch, None # In prediction, we have no labels
I modified the existing model_fn() function and added below at appropirate place
predictions = tf.argmax(logits, axis=1)
if mode == tf.estimator.ModeKeys.PREDICT:
preds = {
"class_index": predictions,
"probabilities": tf.nn.softmax(logits),
'logits': logits
}
return tf.estimator.EstimatorSpec(mode, predictions=preds)
However when i call the following the code
if (FLAGS.predict_input_stroke_data != None):
# prepare_input_tfrecord_for_prediction()
# predict_results = estimator.predict(input_fn=get_input_fn(
# mode=tf.estimator.ModeKeys.PREDICT,
# tfrecord_pattern=FLAGS.predict_input_temp_file,
# batch_size=FLAGS.batch_size))
predict_results = estimator.predict(input_fn=predict_input_fn)
for idx, prediction in enumerate(predict_results):
type = prediction["class_ids"][0] # Get the predicted class (index)
print("Prediction Type: {}\n".format(type))
I get the following error, what is wrong in my code could anyone please help me. I have tried quite a few things to get the shape right but i am unable to. I also tried to first write my strokes data as a tfrecord and then use the existing input_fn to read from the tfrecord that gives me similar errors but slighly different
File "/Users/farooq/.virtualenvs/tensor1.0/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "/Users/farooq/.virtualenvs/tensor1.0/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Shape must be rank 2 but is rank 1 for 'Slice' (op: 'Slice') with input shapes: [?], [2], [2].
I finally solved the problem by taking my input keystrokes, writing them to disk as a TFRecord. I also had to write the same inputstrokes batch_size times to same TFRecord, else i got the shape mismatch errors. And then invoking predict worked.
The main addition for prediction was the following function
def create_tfrecord_for_prediction(batch_size, stoke_data, tfrecord_file):
def parse_line(stoke_data):
"""Parse provided stroke data and ink (as np array) and classname."""
inkarray = json.loads(stoke_data)
stroke_lengths = [len(stroke[0]) for stroke in inkarray]
total_points = sum(stroke_lengths)
np_ink = np.zeros((total_points, 3), dtype=np.float32)
current_t = 0
for stroke in inkarray:
if len(stroke[0]) != len(stroke[1]):
print("Inconsistent number of x and y coordinates.")
return None
for i in [0, 1]:
np_ink[current_t:(current_t + len(stroke[0])), i] = stroke[i]
current_t += len(stroke[0])
np_ink[current_t - 1, 2] = 1 # stroke_end
# Preprocessing.
# 1. Size normalization.
lower = np.min(np_ink[:, 0:2], axis=0)
upper = np.max(np_ink[:, 0:2], axis=0)
scale = upper - lower
scale[scale == 0] = 1
np_ink[:, 0:2] = (np_ink[:, 0:2] - lower) / scale
# 2. Compute deltas.
#np_ink = np_ink[1:, 0:2] - np_ink[0:-1, 0:2]
#np_ink = np_ink[1:, :]
np_ink[1:, 0:2] -= np_ink[0:-1, 0:2]
np_ink = np_ink[1:, :]
features = {}
features["ink"] = tf.train.Feature(float_list=tf.train.FloatList(value=np_ink.flatten()))
features["shape"] = tf.train.Feature(int64_list=tf.train.Int64List(value=np_ink.shape))
f = tf.train.Features(feature=features)
ex = tf.train.Example(features=f)
return ex
if stoke_data is None:
print("Error: Stroke data cannot be none")
return
example = parse_line(stoke_data)
#Remove the file if it already exists
if tf.gfile.Exists(tfrecord_file):
tf.gfile.Remove(tfrecord_file)
writer = tf.python_io.TFRecordWriter(tfrecord_file)
for i in range(batch_size):
writer.write(example.SerializeToString())
writer.flush()
writer.close()
Then in the main function you just have to invoke estimator.predict() reusing the same input_fn=get_input_fn(...)argument except point it to the temporary created tfrecord_file
Hope this helps

Feed a Tensor of SparseTensors to estimators

To get started with TF, I wanted to learn a predictor of match outcomes for a game. There are three features: the 5 heros on team 0, the 5 heroes on team 1, and the map. The winner is the label, 0 or 1. I want to represent the teams and the maps as SparseTensors. Out of a possible 71 heroes, five will be selected. Likewise for maps, out of a possible 13, one will be selected.
import tensorflow as tf
import packunpack as source
import tempfile
from collections import namedtuple
GameRecord = namedtuple('GameRecord', 'team_0 team_1 game_map winner')
def parse(line):
parts = line.rstrip().split("\t")
return GameRecord(
game_map = parts[1],
team_0 = parts[2].split(","),
team_1 = parts[3].split(","),
winner = int(parts[4]))
def conjugate(record):
return GameRecord(
team_0 = record.team_1,
team_1 = record.team_0,
game_map = record.game_map,
winner = 0 if record.winner == 1 else 1)
def sparse_team(team):
indices = list(map(lambda x: [x], map(source.encode_hero, team)))
return tf.SparseTensor(indices=indices, values = [1] * len(indices), dense_shape=[len(source.heroes_array)])
def sparse_map(map_name):
return tf.SparseTensor(indices=[[source.encode_hero(map_name)]], values = [1], dense_shape=[len(source.maps_array)])
def make_input_fn(filename, shuffle = True, add_conjugate_games = True):
def _fn():
records = []
with open(filename, "r") as raw:
i = 0
for line in raw:
record = parse(line)
records.append(record)
if add_conjugate_games:
# since 0 and 1 are arbitrary team labels, learn and test the conjugate game whenever
# learning the original inference
records.append(conjugate(record))
print("Making team 0")
team_0s = tf.constant(list(map(lambda r: sparse_team(r.team_0), records)))
print("Making team 1")
team_1s = tf.constant(list(map(lambda r: sparse_team(r.team_1), records)))
print("making maps")
maps = tf.constant(list(map(lambda r: sparse_map(r.game_map), records)))
print("Making winners")
winners = tf.constant(list(map(lambda r: tf.constant([r.winner]), records)))
return {
"team_0": team_0s,
"team_1": team_1s,
"game_map": maps,
}, winners
#Please help me finish this function?
return _fn
team_0 = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("team_0", source.heroes_array), len(source.heroes_array))
team_1 = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("team_1", source.heroes_array), len(source.heroes_array))
game_map = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("game_map", source.maps_array), len(source.maps_array))
model_dir = tempfile.mkdtemp()
m = tf.estimator.DNNClassifier(
model_dir=model_dir,
hidden_units = [1024, 512, 256],
feature_columns=[team_0, team_1, game_map])
def main():
m.train(input_fn=make_input_fn("tiny.txt"), steps = 100)
if __name__ == "__main__":
main()
This fails on team_0s = tf.constant(list(map(lambda r: sparse_team(r.team_0), records)))
It's very difficult to understand what tf wants me to return in my input_fn, because all of the examples I can find in the docs ultimately call out to a pandas or numpy helper function, and I'm not familiar with those frameworks. I thought that each dictionary value should be a Tensor containing all examples of a single feature. Each of my examples is a SparseTensor, and I want to simply embed them as their dense versions for the sake of the DNNClassifier.
I'm sure my mental model is horribly broken right now, and I appreciate any help setting it straight.
Error output:
python3 estimator.py
Making team 0
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 468, in make_tensor_proto
str_values = [compat.as_bytes(x) for x in proto_values]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 468, in <listcomp>
str_values = [compat.as_bytes(x) for x in proto_values]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fe8
b4d7aef0>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "estimator.py", line 79, in <module>
main()
File "estimator.py", line 76, in main
m.train(input_fn=make_input_fn("tiny.txt"), steps = 100)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 709, in _train_model
input_fn, model_fn_lib.ModeKeys.TRAIN)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 577, in _get_features_and_l
abels_from_input_fn
result = self._call_input_fn(input_fn, mode)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 663, in _call_input_fn
return input_fn(**kwargs)
File "estimator.py", line 44, in _fn
team_0s = tf.constant(list(map(lambda r: sparse_team(r.team_0), records)))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 472, in make_tensor_proto
"supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [<tensorflow.python.framework.sparse_tenso
r.SparseTensor object at 0x7fe8b4d7aef0>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fe8b4d7af28
>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fe8b4d7af60>, <tensorflow.python.framework.sparse_
tensor.SparseTensor object at 0x7fe8b4d7aeb8> ... ]
Ultimately it wasn't necessary to convert my text representation into sparse vectors in my input_fn. Instead I had to tell the model to expect an input of an array of strings, which it understands how to convert into a "bag of words" or n-hot vector and how to embed as dense vectors.
import tensorflow as tf
import tempfile
import os
from collections import namedtuple
GameRecord = namedtuple('GameRecord', 'team_0 team_1 game_map winner')
def parse(line):
parts = line.rstrip().split("\t")
return GameRecord(
game_map = parts[1],
team_0 = parts[2].split(","),
team_1 = parts[3].split(","),
winner = int(parts[4]))
def conjugate(record):
return GameRecord(
team_0 = record.team_1,
team_1 = record.team_0,
game_map = record.game_map,
winner = 0 if record.winner == 1 else 1)
def make_input_fn(filename, batch_size=128, shuffle = True, add_conjugate_games = True, epochs=1):
def _fn():
records = []
with open(filename, "r") as raw:
i = 0
for line in raw:
record = parse(line)
records.append(record)
if add_conjugate_games:
records.append(conjugate(record))
team_0s = tf.constant(list(map(lambda r: r.team_0, records)))
team_1s = tf.constant(list(map(lambda r: r.team_1, records)))
maps = tf.constant(list(map(lambda r: r.game_map, records)))
winners = tf.constant(list(map(lambda r: [r.winner],
return {
"team_0": team_0s,
"team_1": team_1s,
"game_map": maps,
}, winners
return _fn
team_0 = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("team_0", source.heroes_array), dimension=len(source.heroes_array))
team_1 = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("team_1", source.heroes_array), dimension=len(source.heroes_array))
game_map = tf.feature_column.embedding_column(
tf.feature_column.categorical_column_with_vocabulary_list("game_map", source.maps_array), dimension=len(source.maps_array))
model_dir = "DNNClassifierModel_00"
os.mkdir(model_dir)
m = tf.estimator.DNNClassifier(
model_dir=model_dir,
hidden_units = [1024, 512, 256],
feature_columns=[team_0, team_1, game_map])
def main():
m.train(input_fn=make_input_fn("training.txt"))
results = m.evaluate(input_fn=make_input_fn("validation.txt"))
print("model directory = %s" % model_dir)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
if __name__ == "__main__":
main()
Note that this code isn't perfect yet. I need to add in batching.

Multi-Target and Multi-Class prediction

I am relatively new to machine learning as well as tensorflow. I would like to train the data so that predictions with 2 targets and multiple classes could be made. Is this something that can be done? I was able to implement the algorithm for 1 target but don't know how I need to do it for a second target as well.
An example dataset:
DayOfYear Temperature Flow Visibility
316 8 1 4
285 -1 1 4
326 8 2 5
323 -1 0 3
10 7 3 6
62 8 0 3
56 8 1 4
347 7 2 5
363 7 0 3
77 7 3 6
1 7 1 4
308 -1 2 5
364 7 3 6
If I train (DayOfYear Temperature Flow) I can predict the Visibility quite well. But I need to predict Flow as well somehow. I am pretty sure that Flow will influence Visibility so I am not sure how to go with that.
This is the implementation that I have
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import urllib
import numpy as np
import tensorflow as tf
# Data sets
TRAINING = "/ml_baetterich_learn.csv"
TEST = "/ml_baetterich_test.csv"
VALIDATION = "/ml_baetterich_validation.csv"
def main():
# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TRAINING,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TEST,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
validation_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=VALIDATION,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=3)]
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=9,
model_dir="/tmp/iris_model")
# Define the training inputs
def get_train_inputs():
x = tf.constant(training_set.data)
y = tf.constant(training_set.target)
return x, y
# Fit model.
classifier.fit(input_fn=get_train_inputs, steps=4000)
# Define the test inputs
def get_test_inputs():
x = tf.constant(test_set.data)
y = tf.constant(test_set.target)
return x, y
# Define the test inputs
def get_validation_inputs():
x = tf.constant(validation_set.data)
y = tf.constant(validation_set.target)
return x, y
# Evaluate accuracy.
accuracy_test_score = classifier.evaluate(input_fn=get_test_inputs,
steps=1)["accuracy"]
accuracy_validation_score = classifier.evaluate(input_fn=get_validation_inputs,
steps=1)["accuracy"]
print ("\nValidation Accuracy: {0:0.2f}\nTest Accuracy: {1:0.2f}\n".format(accuracy_validation_score,accuracy_test_score))
# Classify two new flower samples.
def new_samples():
return np.array(
[[327,8,3],
[47,8,0]], dtype=np.float32)
predictions = list(classifier.predict_classes(input_fn=new_samples))
print(
"New Samples, Class Predictions: {}\n"
.format(predictions))
if __name__ == "__main__":
main()
Option 1: multi-headed model
You could use a multi-headed DNNEstimator model. This treats Flow and Visibility as two separate softmax classification targets, each with their own set of classes. I had to modify the load_csv_without_header helper function to support multiple targets (which could be cleaner, but is not the point here - feel free to ignore its details).
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import csv
import collections
num_flow_classes = 4
num_visib_classes = 7
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
def load_csv_without_header(fn, target_dtype, features_dtype, target_columns):
with gfile.Open(fn) as csv_file:
data_file = csv.reader(csv_file)
data = []
targets = {
target_cols: []
for target_cols in target_columns.keys()
}
for row in data_file:
cols = sorted(target_columns.items(), key=lambda tup: tup[1], reverse=True)
for target_col_name, target_col_i in cols:
targets[target_col_name].append(row.pop(target_col_i))
data.append(np.asarray(row, dtype=features_dtype))
targets = {
target_col_name: np.array(val, dtype=target_dtype)
for target_col_name, val in targets.items()
}
data = np.array(data)
return Dataset(data=data, target=targets)
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1),
tf.contrib.layers.real_valued_column("", dimension=2),
]
head = tf.contrib.learn.multi_head([
tf.contrib.learn.multi_class_head(
num_flow_classes, label_name="Flow", head_name="Flow"),
tf.contrib.learn.multi_class_head(
num_visib_classes, label_name="Visibility", head_name="Visibility"),
])
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=head,
)
def get_input_fn(filename):
def input_fn():
dataset = load_csv_without_header(
fn=filename,
target_dtype=np.int,
features_dtype=np.int,
target_columns={"Flow": 2, "Visibility": 3}
)
x = tf.constant(dataset.data)
y = {k: tf.constant(v) for k, v in dataset.target.items()}
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
Option 2: multi-labeled head
If you keep your CSV data separated by commas, and keep the last column for all the classes a row might have (separated by some token such as space), you can use the following code:
import numpy as np
import tensorflow as tf
all_classes = ["0", "1", "2", "3", "4", "5", "6"]
def k_hot(classes_col, all_classes, delimiter=' '):
table = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(all_classes)
)
classes = tf.string_split(classes_col, delimiter)
ids = table.lookup(classes)
num_items = tf.cast(tf.shape(ids)[0], tf.int64)
num_entries = tf.shape(ids.indices)[0]
y = tf.SparseTensor(
indices=tf.stack([ids.indices[:, 0], ids.values], axis=1),
values=tf.ones(shape=(num_entries,), dtype=tf.int32),
dense_shape=(num_items, len(all_classes)),
)
y = tf.sparse_tensor_to_dense(y, validate_indices=False)
return y
def feature_engineering_fn(features, labels):
labels = k_hot(labels, all_classes)
return features, labels
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1), # DayOfYear
tf.contrib.layers.real_valued_column("", dimension=2), # Temperature
]
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=tf.contrib.learn.multi_label_head(n_classes=len(all_classes)),
feature_engineering_fn=feature_engineering_fn,
)
def get_input_fn(filename):
def input_fn():
dataset = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=filename,
target_dtype="S100", # strings of length up to 100 characters
features_dtype=np.int,
target_column=-1
)
x = tf.constant(dataset.data)
y = tf.constant(dataset.target)
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
We are using DNNEstimator with a multi_label_head, which uses sigmoid crossentropy rather than softmax crossentropy as a loss function. This means that each of the output units/logits are passed through the sigmoid function, which gives the likelihood of the data point belonging to that class, i.e. the classes are computed independently and are not mutually exclusive as they are with softmax crossentropy. This means that you could have between 0 and len(all_classes) classes set for each row in the training set and final predictions.
Also notice that the classes are represented as strings (and k_hot makes the conversion to token indices), so that you could use arbitrary class identifiers such as category UUIDs in e-commerce settings. If the categories in the 3rd and 4th column are different (Flow ID 1 != Visibility ID 1), you could prepend the column name to each class ID, e.g.
316,8,flow1 visibility4
285,-1,flow1 visibility4
326,8,flow2 visibility5
For a description of how k_hot works, see my other SO answer. I decided to use k_hot as a separate function (rather than define it directly in feature_engineering_fn because it's a distinct piece of functionality, and probably TensorFlow will soon have a similar utility function.
Note that if you're now using the first two columns to predict the last two columns, your accuraccy will certainly go down, as the last two columns are highly correlated and using one of them will give you a lot of information about the other. Actually, your code was using only the 3rd column, which was kind of a cheat anyway if the goal is to predict the 3rd and 4th columns.