feature computing using feature tools - data-science

I am trying to compute features and define feature matrix on a dataset.
The code is as follows:
def compute_features(features, cutoff_time):
np.random.shuffle(features)
feature_matrix = ft.calculate_feature_matrix(features,
cutoff_time=cutoff_time,
approximate='36d',
verbose=True,entities=entities, relationships=relationships)
print("Finishing computing...")
feature_matrix, features = ft.encode_features(feature_matrix, features,
to_encode=["pickup_neighborhood", "dropoff_neighborhood"],
include_unknown=False)
return feature_matrix
Up until this point, The code is working fine. However, the next line after
feature_matrix1 = compute_features(features, cutoff_time)
I am getting the following error:
KeyError: "['time'] not in index"

Related

Convert an TF Agents ActorDistributionNetwork into a Tensorflow lite model

I would like to convert the ActorDistributionModel from a trained PPOClipAgent into a Tensorflow Lite model for deployment. How should I accomplish this?
I have tried following this tutorial (see section at bottom converting policy to TFLite), but the network outputs a single action (the policy) rather than the density function over actions that I desire.
I think perhaps something like this could work:
tf.compat.v2.saved_model.save(actor_net, saved_model_path, signature=?)
... if I knew how to set the signature parameter. That line of code executes without error when I omit the signature parameter, but I get the following error on load (I assume because the signature is not set up correctly):
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
File "/home/ais/salesmentor.ai/MDPSolver/src/solver/ppo_budget.py", line 336, in train_eval
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
File "/home/ais/.local/lib/python3.9/site-packages/tensorflow/lite/python/lite.py", line 1275, in from_saved_model
raise ValueError("Only support a single signature key.")
ValueError: Only support a single signature key.
This appears to work. I won't accept the answer until I have completed an end-to-end test, though.
def export_model(actor_net, observation_spec, saved_model_path):
predict_signature = {
'action_pred':
tf.function(func=lambda x: actor_net(x, None, None)[0].logits,
input_signature=(tf.TensorSpec(shape=observation_spec.shape),)
)
}
tf.saved_model.save(actor_net, saved_model_path, signatures=predict_signature)
# Convert to TensorFlow Lite model.
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path,
signature_keys=["action_pred"])
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
tflite_policy = converter.convert()
with open(os.path.join(saved_model_path, 'policy.tflite'), 'wb') as f:
f.write(tflite_policy)
The solution wraps the actor_net in a lambda because I was unable to figure out how to specify the signature with all three expected arguments. Through the lambda, I convert the function into using a single argument (a tensor). I expect to pass None to the other two arguments in my use case, so there is nothing lost in this approach.
I see you using CartPole as the model simulation, Agent DQN, and Model learning and Evaluation from links provided TF-Agent Checkpointer. For simple understanding, you need to understand about the distributions and your model limits ( less than 6 actions determining at a time ).
Discretes Distribution, answer the question to the points but the links is how they implement AgentDQN on TF- Agent.
temp = tf.random.normal([10], 1, 0.2, tf.float32), mean is one and the standard deviation is 0.2. Overall of result summation product is nearby one and its variance is 0.2, when they have 10 actions to determine the possibility of the result is the same action is 1 from 5 or 0.5. random normal
Coefficient is ladder steps or you understand as IF and ELSE conditions or SWITCH conditions such as at the gap of 0 to 5, 5 to 10, 10 to 15, and continue.
The matrixes product from the Matrix coefficients and randoms is selected 4 - 5 actions sorted by priority, significant and select the most effects in rows.
The ArgMax is 0 to 9 which is actions 0 - 9 that respond to the environment input co-variances.
Sample: To the points, random distributions and selective agents ( we call selective agent maybe the questioner has confused with NN DQN )
temp = tf.random.normal([10], 1, 0.2, tf.float32)
temp = np.asarray(temp) * np.asarray([ coefficient_0, coefficient_1, coefficient_2, coefficient_3, coefficient_4, coefficient_5, coefficient_6, coefficient_7, coefficient_8, coefficient_9 ])
temp = tf.nn.softmax(temp)
action = int(np.argmax(temp))

H2O | ExtendedIsolation Forest | model.explain() gives, KeyError: 'response_column'

I have been struggling with this error for a few hours now, but seem lost even after reading through the documentation.
I'm using H2O's Extended Isolation Forest (EIF), an unsupervised model, to detect anomalies in an unlabelled dataset. Which is working as intended, however for the project i'm working on the model explainability is extremely important. I discovered the explain function, which supposedly returns several explainablity methods for a model. I'm particularly interested in the SHAP values from this function.
The documentation states
The main functions, h2o.explain() (global explanation) and h2o.explain_row() (local explanation) work for individual H2O models, as well a list of models or an H2O AutoML object. The h2o.explain() function generates a list of explanations.
Since the H2O models link brings me to a page which covers both supervised and unsupervised models I assume the explain function would work for both types of models.
When trying to run my code the following code works just fine.
import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator
h2o.init()
df_EIF = h2o.H2OFrame(df_EIF)
predictors = df_EIF.columns[0:37]
eif = H2OExtendedIsolationForestEstimator(ntrees = 75, sample_size =500, extension_level = (len(predictors) -1) )
eif.train(x=predictors, training_frame = df_EIF)
eif_result = eif.predict(df_EIF)
df_EIF['anomaly_score_EIF') = eif_result['anomaly_score']
However when trying to call explain over the model (eif)
eif.explain(df_EIF)
Gives me the following KeyError
KeyError Traceback (most recent call last)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.py in <module>
----> 1 eif.explain(df_EIF)
2
3
4
5
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in explain(models, frame, columns, top_n_features, include_explanations, exclude_explanations, plot_overrides, figsize, render, qualitative_colormap, sequential_colormap)
2895 plt = get_matplotlib_pyplot(False, raise_if_not_available=True)
2896 (is_aml, models_to_show, classification, multinomial_classification, multiple_models, targets,
-> 2897 tree_models_to_show, models_with_varimp) = _process_models_input(models, frame)
2898
2899 if top_n_features < 0:
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _process_models_input(models, frame)
2802 models_with_varimp = [model for model in models if _has_varimp(model)]
2803 tree_models_to_show = _get_tree_models(models, 1 if is_aml else float("inf"))
-> 2804 y = _get_xy(models_to_show[0])[1]
2805 classification = frame[y].isfactor()[0]
2806 multinomial_classification = classification and frame[y].nlevels()[0] > 2
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _get_xy(model)
1790 """
1791 names = model._model_json["output"]["original_names"] or model._model_json["output"]["names"]
-> 1792 y = model.actual_params["response_column"]
1793 not_x = [
1794 y,
KeyError: 'response_column
From my understanding this response column refers to a column that you are trying to predict. However, since i'm dealing with an unlabelled dataset this response column doesn't exist. Is there a way for me to bypass this error? Is it even possible to utilize the explain() function on unsupervised models? If, so how do I do this? If it is not possible, is there another way to extract the Shap values of each variable from the model? Since the shap.TreeExplainer also doesn't seem to work on a H2O model.
TL;DR: Is it possible to use the .explain() function from h2o on an (Extended) Isolation forest? If so how?
Unfortunately, the explain method in H2O-3 is supported only for the supervised algorithms.
What you could do is to use a surrogate model and look at explanations on it.
Basically, you'd fit a GBM (or DRF as those 2 models support the TreeSHAP) on the data + the prediction of the Extended Isolation Forest which would be the response.
Here is another approach how to explain prediction of (E)IF: https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/isolation-forest/interpreting_isolation-forest.ipynb

Broadcast error when using autofeat for automated feature engineering

When trying to use autofeat(https://github.com/cod3licious/autofeat) to automatically generate new features, I am receiving the following error:
operands could not be broadcast together with shapes (963,) (962,)
simple code:
model = AutoFeatRegression(n_jobs=-1,verbose=1,max_gb=1)
X_new = model.fit_transform(X,y)
where X and y are both pandas data frame.

Dataset API 'flat_map' method producing error for same code which works with 'map' method

I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here.
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name,rows=100):#
print(file_name.decode())
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols =['Wind_MWh','Actual_Load_MWh'],nrows = rows)
X_data = df_input.as_matrix()
X_data.astype('float32', copy=False)
return X_data
dataset = tf.data.Dataset.from_tensor_slices(file_names)
dataset = dataset.flat_map(lambda file_name: tf.py_func(_get_data_for_dataset,
[file_name], tf.float64))
dataset= dataset.batch(2)
fiter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
I get the following error: map_func must return a Dataset object. The pipeline works without error when I use map but it doesn't give the output I want. For example, if Pandas is reading N rows from each of my CSV files I want the pipeline to concatenate data from B files and give me an array with shape (N*B, 2). Instead, it is giving me (B, N,2) where B is the Batch size. map is adding another axis instead of concatenating on the existing axis. From what I understood in the documentation flat_map is supposed to give a flatted output. In the documentation, both map and flat_map returns type Dataset. So how is my code working with map and not with flat_map?
It would also great if you could point me towards code where Dataset API has been used with Pandas module.
As mikkola points out in the comments, the Dataset.map() and Dataset.flat_map() expect functions with different signatures: Dataset.map() takes a function that maps a single element of the input dataset to a single new element, whereas Dataset.flat_map() takes a function that maps a single element of the input dataset to a Dataset of elements.
If you want each row of the array returned by _get_data_for_dataset() to
become a separate element, you should use Dataset.flat_map() and convert the output of tf.py_func() to a Dataset, using Dataset.from_tensor_slices():
folder_name = './data/power_data/'
file_names = os.listdir(folder_name)
def _get_data_for_dataset(file_name, rows=100):
df_input=pd.read_csv(os.path.join(folder_name, file_name.decode()),
usecols=['Wind_MWh', 'Actual_Load_MWh'], nrows=rows)
X_data = df_input.as_matrix()
return X_data.astype('float32', copy=False)
dataset = tf.data.Dataset.from_tensor_slices(file_names)
# Use `Dataset.from_tensor_slices()` to make a `Dataset` from the output of
# the `tf.py_func()` op.
dataset = dataset.flat_map(lambda file_name: tf.data.Dataset.from_tensor_slices(
tf.py_func(_get_data_for_dataset, [file_name], tf.float32)))
dataset = dataset.batch(2)
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()

TensorFlow example, MemoryError while run text_classification_character_cnn.py

I'm trying to run https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification_character_cnn.py for learning, but I get an error message:
File "C:\Users\natlun\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\base.py", line 72, in load_csv_without_header
data = np.array(data)
MemoryError
I use CPU installation of TensorFlow and Python 3.5. Any ideas how to solve the problem?? Other scripts using a csv-file for input work fine.
I was having the same issue. And after many hours of reading and googling (and seeing your unanswered question), and just comparing the example with other examples that do run, I noticed that
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data, size='large')
should just be
dbpedia = tf.contrib.learn.datasets.load_dataset(
'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
Based off of what I've read about numpy, I'd bet the "size='large'" parameter causes an over allocation to a numpy array (which throws the memory error).
Or, when you don't set that parameter perhaps the input data is truncated.
Or some other thing. Anyway, I hope this helps others attempting to run this useful example!
--- Update ---
Without "size='large'" the load_dataset functions appears to create smaller training and test data sets (like 1/1000 the size).
After playing around with the example I realized I could manually load and use the whole data set without getting the memory error (assume it is saving the whole data set as it appears).
# Prepare training and testing data
##This was the provided method for setting up the data.
# dbpedia = tf.contrib.learn.datasets.load_dataset(
# 'dbpedia', test_with_fake_data=FLAGS.test_with_fake_data)
# x_trainz = pandas.DataFrame(dbpedia.train.data)[1]
# y_trainz = pandas.Series(dbpedia.train.target)
# x_testz = pandas.DataFrame(dbpedia.test.data)[1]
# y_testz = pandas.Series(dbpedia.test.target)
##And this is my replacement.
x_train = []
y_train = []
x_test = []
y_test = []
with open("dbpedia_data/dbpedia_csv/train.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_train.append(row[2])
y_train.append(int(row[0]))
with open("dbpedia_data/dbpedia_csv/test.csv", encoding='utf-8') as filex:
reader = csv.reader(filex)
for row in reader:
x_test.append(row[2])
y_test.append(int(row[0]))
x_train = pandas.Series(x_train)
y_train = pandas.Series(y_train)
x_test = pandas.Series(x_test)
y_test = pandas.Series(y_test)
The example seems to now be evaluating the whole training data set. But, the original code will probably need to be run once to get/put the data in the correct sub-folders. Also, even while evaluating the whole data set little memory is used (just a few hundred MB). Which, makes me think that the load_dataset function is broken in some way.