Dask DataFrame - Prediction of Keras Model - tensorflow

I am working for the first time with dask and trying to run predict() from a trained keras model.
If I dont use dask, the function works fine (i.e. pd.DataFrame() versus dd.DataFrame () ). With Dask the error is below. Is this not a common use case (aside from scoring a groupby perhaps)
def calc_HR_ind_dsk(grp):
model=keras.models.load_model('/home/embedding_model.h5')
topk=10
x=[grp['user'].values,grp['item'].values]
pred_act=list(zip(model.predict(x)[:,0],grp['respond'].values))
top=sorted(pred_act, key=lambda x: -x[0])[0:topk]
hit=sum([x[1] for x in top])
return(hit)
import dask.dataframe as dd
#step 1 - read in data as a dask df. We could reference more than 1 files using '*' wildcard
df = dd.read_csv('/home/test_coded_final.csv',dtype='int64')
results=df.groupby('user').apply(calc_HR_ind_dsk).compute()
TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("Placeholder_30:0", shape=(55188, 32), dtype=float32) is not an element of this graph.

I found the answer. It is an issue with keras or tensorflow: https://github.com/keras-team/keras/issues/2397
Below code worked and using dask shaved 50% from the time versus standard pandas groupby.
#dask
model=keras.models.load_model('/home/embedding_model.h5')
#this part
import tensorflow as tf
global graph
graph = tf.get_default_graph()
def calc_HR_ind_dsk(grp):
topk=10
x=[grp['user'].values,grp['item'].values]
with graph.as_default(): #and this part from https://github.com/keras-team/keras/issues/2397
pred_act=list(zip(model.predict(x)[:,0],grp['respond'].values))
top=sorted(pred_act, key=lambda x: -x[0])[0:topk]
hit=sum([x[1] for x in top])
return(hit)
import dask.dataframe as dd
df = dd.read_csv('/home/test_coded_final.csv',dtype='int64')
results=df.groupby('user').apply(calc_HR_ind_dsk).compute()

Have a look at:
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply
Unlike pandas, in dask many function, which let you define your own custom op, needs the meta parameter. Without this dask will sonehow test your custom function and pass weird things to keras which would might not be happening during calling compute.

A different answer I wrote might help here (use-case was using a Dask with a pre-trained ML model to predict on 1,000,000 examples): https://stackoverflow.com/a/59015702/4900327

Related

K-Means of Tensorflow - Graph disconnected error

I am trying to write a function that runs KMeans on a dataset and outputs the cluster centroids. My aim is to use this in a custom keras layer, so I am using TensorFlow's implementation of KMeans that takes a tensor as the input dataset.
My problem however is that I can't make it work even as a standalone function. The problem comes from the fact that KMeans accepts a generator function that provides mini-batches instead of a plain tensor, but when I am using closure to do that, I get a graph disconnected error:
import tensorflow as tf # version: 2.4.1
from tensorflow.compat.v1.estimator.experimental import KMeans
#tf.function
def KMeansCentroids(inputs, num_clusters, steps, use_mini_batch=False):
# `inputs` is a 2D tensor
def input_fn():
# Each one of the lines below results in the same "Graph Disconnected" error. Tuples don't really needed but just to be consistent with the documentation
return (inputs, None)
return (tf.data.Dataset.from_tensor_slices(inputs), None)
return (tf.convert_to_tensor(inputs), None)
kmeans = KMeans(
num_clusters=num_clusters,
use_mini_batch=use_mini_batch)
kmeans.train(input_fn, steps=steps) # This is where the error happens
return kmeans.cluster_centers()
>>> x = tf.random.uniform((100, 2))
>>> c = KMeansCentroids(x, 5, 10)
The exact error is:
ValueError:
Tensor("strided_slice:0", shape=(), dtype=int32)
must be from the same graph as
Tensor("Equal:0", shape=(), dtype=bool)
(graphs are FuncGraph(name=KMeansCentroids, id=..) and <tensorflow.python.framework.ops.Graph object at ...>).
If I were to use a numpy dataset and convert to tensor inside the function, the code would work just fine.
Also, making input_fn() return directly tf.random.uniform((100, 2)) (ignoring the inputs argument), would again work. That's why I am guessing that tensorflow doesn't support closures since it needs to build the computation graph at the beginning.
But I don't see how to work around that.
Could it be a version error due to KMeans being a compat.v1.experimental module?
Note that the documentation of KMeans states for the input_fn():
The function should construct and return one of the following:
A tf.data.Dataset object: Outputs of Dataset object must be a tuple (features, labels) with same constraints as below.
A tuple (features, labels): Where features is a tf.Tensor or a dictionary of string feature name to Tensor and labels is a Tensor or a dictionary of string label name to Tensor. Both features and labels are consumed by model_fn. They should satisfy the expectation of model_fn from inputs.
The problem you're facing is more about invoking tensor outside the created graph. Basically, when you called the .train function, a new graph will be created and that is with the graph defined in that input_fn and the graph defined in the model_fn.
kmeans.train(input_fn, steps=steps)
And, after that all the tensors those coming outside these functions will be treated as outsiders and won't part of this new graph. That's why you're getting a graph disconnected error for trying to use outsider tensor. To resolve this, you need to create the necessary tensors within these graphs.
import tensorflow as tf
from tensorflow.compat.v1.estimator.experimental import KMeans
#tf.function
def KMeansCentroids(num_clusters, steps, use_mini_batch=False):
def input_fn(batch_size):
pinputs = tf.random.uniform((100, 2))
dataset = tf.data.Dataset.from_tensor_slices((pinputs))
dataset = dataset.shuffle(1000).repeat()
return dataset.batch(batch_size)
kmeans = KMeans(
num_clusters=num_clusters,
use_mini_batch=use_mini_batch)
kmeans.train(input_fn = lambda: input_fn(5),
steps=steps)
return kmeans.cluster_centers()
c = KMeansCentroids(5, 10)
Here is some more info for reading, 1. FYI, I tested your code with a few versions of tf > 2, and I don't think it's related to version error or something.
Re-mentioning here for future readers. An alternative of using KMeans within Keras layers:
tf_kmeans.py
ClusteringLayer

Simple way to convert tensor to numpy array without eager mode in TF 2.2

I can't find a simple way to convert a tensor to a NumPy array without enabling eager mode, which gives a nice .numpy() method, but also slows down my model training.
I'd be super grateful for your suggestions. For context, I'm writing a custom metric for my TensorFlow model that relies on a scikit learn function, which only takes numpy arrays.
I've tried wrapping the tensors with np.array(), which throws a not implemented error. Also gave sessions and .eval() a go, but didn't get it to work either and seemed like too much for this simple job.
My specific error:
NotImplementedError: Cannot convert a symbolic Tensor (model_17/dense_17/Sigmoid:0) to a numpy array.
# Custom metric
def accuracy_ml(y_true, y_pred):
return accuracy_score(y_true, np.round(y_pred)) # ERROR here feeding tensor to sklearn function
# Model
cnn = simple_model(input_shape=(224, 224, 3),
num_classes=10,
base_model = base_ResNet101)
lr = 1e-2
loss_fn = tf.keras.losses.BinaryCrossentropy()
metrics = [accuracy_ml]
cnn.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
loss=loss_fn,
metrics=metrics)
# Simple baseline eval that fails
validation_steps=17
loss0, accuracy0 = cnn.evaluate(validation_batches, steps = validation_steps)
Wrapping my NumPy metric with tf.numpy_function() solved it. https://www.tensorflow.org/api_docs/python/tf/numpy_function

Using What if tool with xgboost

I am trying to use what if tool on my xgboost model.
But on the link I am only able to find examples of xgboost used through google AI Platform. Is there any way we can use whatif tool on XGboost without Google AI platform
I tried the functions that were used in examples for tensorflow and keras and used functions set_estimator_and_feature_spec and set_compare_custom_predict_fn
bst = xgb.XGBClassifier(
objective='reg:logistic'
)
bst.fit(x_train, y_train)
test_examples = df_to_examples(df_test)
config_builder = WitConfigBuilder(test_examples).set_custom_predict_fn(xg.predict)
WitWidget(config_builder)
When trying to perform run inference, an error msg is displayed cannot initialize DMatrix from a list and I am not unable to do it
After a lot of trial and error I finally got it to work with XGBoost using the following code:
# first argument in my case is a Pandas dataframe of all features and the target 'label';
# extract just the numpy array with 'values', then convert that to a list
# second argument is the column name as a list
# I use a sklearn pipeline, so here I am just accessing the classifier model which is an XGBClassifier instance
# Wit tool expects a 2D array, where the 1st dimension is each sample, and 2nd dimension is probabilities of
# each class; so use 'predict_proba' over 'predict'
config_builder = (WitConfigBuilder(df_sample.values.tolist(), df_sample.columns.tolist())
.set_custom_predict_fn(clf['classifier'].predict_proba)
.set_target_feature('label')
.set_label_vocab(['No Churn', 'Churn']))
This eliminated the need to use their suggested helper functions and works out-of-the-box with Pandas DataFrames and Sklearn ML Models

Dataframes, csv, and CNTK

I have been playing around with CNTK and am finding that models can only be trained using numpy arrays. Is this correct?
This makes sense for image recognition etc.
How would I turn my tidy dataset (read in as a dataframe using pandas) into a format that can train a logistic regression with? I have tried to read it into a numpy array
np.genfromtxt(“My.csv",delimiter=',' , dtype=float)
and I have also tried to wrap the variable with
np.array.MyVeriable.astype('float32')
But I do not get the result I want to be able to feed a model.
I also cannot find anything in the tutorial about how to do ML on tabular dataframes in CNTK.
Is it not supported?
CNTK 104 shows how to use pandas dataframes and numpy.
https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_104_Finance_Timeseries_Basic_with_Pandas_Numpy.ipynb
CNTK 106B shows how you could read data using csv files.
https://github.com/Microsoft/CNTK/blob/master/Tutorials/CNTK_106B_LSTM_Timeseries_with_IOT_Data.ipynb
Thanks for these links. This is how I ended up reading in the csv it seemed to work but Sayan please correct as needed:
def generate_data_from_csv():
# try to find the data file local. If it doesn't report "file does not exists" if it does report "using loacl file"
data_path = os.path.join("MyPath")
csv_file = os.path.join(data_path, "My.csv")
if not os.path.exists(data_path):
os.makedirs(data_path)
if not os.path.exists(data_file):
print("file does not exists")
else:
print("using loacl file")
df = pd.read_csv(csy_file, usecols = ["predictor1", "predictor2",
"predictor3", "predictor4", "dependent_variable"], dtype=np.float32)
return df
Then I saved that dataframe as training_data
training_data = generate_data_from_csv()
I then turned that dataframe into an numpy array as follows
training_features = np.asarray(training_data[[["predictor1",
"predictor2", "predictor3", "predictor4",]], dtype = "float32")
training_labels = np.asarray(training_data[["dependent_variable"]],
dtype="float32")
The to train the model I used this code:
features, labels = training_features[:,[0,1,2,3]], training_labels

writing a custom cost function in tensorflow

I'm trying to write my own cost function in tensor flow, however apparently I cannot 'slice' the tensor object?
import tensorflow as tf
import numpy as np
# Establish variables
x = tf.placeholder("float", [None, 3])
W = tf.Variable(tf.zeros([3,6]))
b = tf.Variable(tf.zeros([6]))
# Establish model
y = tf.nn.softmax(tf.matmul(x,W) + b)
# Truth
y_ = tf.placeholder("float", [None,6])
def angle(v1, v2):
return np.arccos(np.sum(v1*v2,axis=1))
def normVec(y):
return np.cross(y[:,[0,2,4]],y[:,[1,3,5]])
angle_distance = -tf.reduce_sum(angle(normVec(y_),normVec(y)))
# This is the example code they give for cross entropy
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
I get the following error:
TypeError: Bad slice index [0, 2, 4] of type <type 'list'>
At present, tensorflow can't gather on axes other than the first - it's requested.
But for what you want to do in this specific situation, you can transpose, then gather 0,2,4, and then transpose back. It won't be crazy fast, but it works:
tf.transpose(tf.gather(tf.transpose(y), [0,2,4]))
This is a useful workaround for some of the limitations in the current implementation of gather.
(But it is also correct that you can't use a numpy slice on a tensorflow node - you can run it and slice the output, and also that you need to initialize those variables before you run. :). You're mixing tf and np in a way that doesn't work.
x = tf.Something(...)
is a tensorflow graph object. Numpy has no idea how to cope with such objects.
foo = tf.run(x)
is back to an object python can handle.
You typically want to keep your loss calculation in pure tensorflow, so do the cross and other functions in tf. You'll probably have to do the arccos the long way, as tf doesn't have a function for it.
just realized that the following failed:
cross_entropy = -tf.reduce_sum(y_*np.log(y))
you cant use numpy functions on tf objects, and the indexing my be different too.
I think you can use "Wraps Python function" method in tensorflow. Here's the link to the documentation.
And as for the people who answered "Why don't you just use tensorflow's built in function to construct it?" - sometimes the cost function people are looking for cannot be expressed in tf's functions or extremely difficult.
This is because you have not initialized your variable and because of this it does not have your Tensor there right now (can read more in my answer here)
Just do something like this:
def normVec(y):
print y
return np.cross(y[:,[0,2,4]],y[:,[1,3,5]])
t1 = normVec(y_)
# and comment everything after it.
To see that you do not have a Tensor now and only Tensor("Placeholder_1:0", shape=TensorShape([Dimension(None), Dimension(6)]), dtype=float32).
Try initializing your variables
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
and evaluate your variable sess.run(y). P.S. you have not fed your placeholders up till now.