Link Probability prediction between two nodes using Machine Learning or Deep Learning where node to node mapping is given - tensorflow

Can someone please direct me to a tutorial provide a starting idea for the problem given below.
I have a mapping of Authors to co authors given as follows:
mapping
>>
{0: [2860, 3117],
1: [318, 1610, 1776, 1865, 2283, 2507, 3076, 3108, 3182, 3357, 3675, 4040],
2: [164, 413, 1448, 1650, 3119, 3238],
} # this is just sample
link_attributes.iloc[:5,:7]
>>
first id keyword_0 keyword_10 keyword_13 keyword_15 keyword_2
0 4 0 1.0 1.0 1.0 1.0
1 9 1 1.0 1.0 1.0 1.0
2 7 2 1.0 NaN 1.0 1.0
3 6 3 1.0 1.0 NaN 1.0
4 9 4 1.0 1.0 1.0 1.0
I have to predict the probability of having a link between a Source and Sink
For example if I am given a Source=13 and Sink=31 then I have to find the probability of having a link between 13 and 31. All the links are un-directed.

import json
import numpy
from keras import Sequential
from keras.layers import Dense
def get_keys(data, keys): # get all keys from json file
if isinstance(data, list):
for item in data:
get_keys(item, keys)
if isinstance(data, dict):
sub_keys = data.keys()
for sub_key in sub_keys:
keys.append(sub_key)
# get all keys, each key is a feature of instances
json_data = open("nodes.json") # read 4016 instances
jdata = json.load(json_data)
keys = []
get_keys(jdata, keys)
keys = set(keys)
print(set(keys))
def build_instance(json_object): # use to build instance from json object, ex: instance = [f0,f1,f2,f3,....f404]
features = []
features.append(json_object.get('id'))
for key in keys:
value = json_object.get(key)
if value is None:
value = 0
elif key == 'id':
continue
features.append(value)
return features
# read all instances and format them, each instance will be [f0,f1, f2,...], as i read from json file, each instance will have 405 features
instances = []
num_of_instances = 0
for item in jdata:
features = build_instance(item)
instances.append(features)
num_of_instances = num_of_instances + 1
print(num_of_instances)
# read "author_id - co author ids" file
traintxt = open('train.txt', 'r')
lines = traintxt.readlines()
au_vs_co_auth_list = []
for line in lines:
line = line.split('\t', 200)
print(line)
# convert value from string to int
string = line[0] # example line[0] = '14 445'
id_vs_coauthor = string.split(" ", 200)
id = id_vs_coauthor[0]
co_author = id_vs_coauthor[1]
line[0:1] = [int(id), int(co_author)]
for i in range(2, len(line)):
line[i] = int(line[i])
au_vs_co_auth_list.append(line)
print(len(au_vs_co_auth_list)) # we have 4016 authors
X_train = []
Y_train = []
generated_train_pairs = []
train_num = 30000 # choose 30000 random training instances
for i in range(train_num):
print(i)
index1 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
co_authors_of_index1 = au_vs_co_auth_list[index1]
author_id_of_index_1 = au_vs_co_auth_list[index1][0]
if index1 % 2 == 0: # try to create a sample that two author is not related
index2 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
author_id_of_index_2 = au_vs_co_auth_list[index2][0]
# make sure id1 != id2 and auth 1 and auth2 are not related
while (index1 == index2) or (author_id_of_index_2 in co_authors_of_index1):
index2 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
author_id_of_index_2 = au_vs_co_auth_list[index2][0]
y = [0, 1] # [relative=FALSE,non-related = TRUE]
else: # try to create a sample that two author is related
author_id_of_index_2 = numpy.random.randint(1, len(co_authors_of_index1),size=1)[0]
y = [1, 0] # [relative=TRUE,non-related = FALSE]
x = instances[author_id_of_index_1][1:] + instances[author_id_of_index_2][
1:] # x = [feature1, feature2,...feature404',feature1', feature2',...feature404']
X_train.append(x)
Y_train.append(y)
X_train = numpy.asarray(X_train)
Y_train = numpy.asarray(Y_train)
print(X_train.shape)
print(Y_train.shape)
# now we have x_train, y_train, build model right now
model = Sequential()
model.add(Dense(512, input_shape=X_train[0].shape, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=512, epochs=3, verbose=2)
model.save("model.h5")
# now to predict probability of linking between two author ids
id1 = 11 # just random
id2 = 732 # just random
author1 = None
author2 = None
for item in jdata:
if item.get('id') == id1:
author1 = build_instance(item)
if item.get('id') == id2:
author2 = build_instance(item)
if author1 is not None and author2 is not None:
break
x_test = author1[1:] + author2[1:]
x_test = numpy.expand_dims(numpy.asarray(x_test), axis=0)
probability = model.predict(x_test)
print("author id ", id1, " and author id ", id2, end=" ")
if probability[0][1] > probability[0][0]:
print("Not related")
else:
print("Related")
print(probability)
Output:
author id 11 and author id 732 related

Before diving into how to find a solution, I recommend understand your data well, and spend a good part of your time digesting the problem and preparing a dataset.
So from the scenario you described it seems to me your problem is given two Nodes and their attributes predict if there is a link this can interpreted as a binary classification task. I will provide an initial minimalistic simple solution.
what confused me is that you mentioned you have only link_attributes.iloc[:5,:7] link_attributes but not node attributes. In the case you have node attributes it makes more sense because then we just make a combinations of pairs of nodes, and label the pairs wich are not connected as 0 or not_connected and the ones connected as 1 or connected.
So let's make a dataset. As I'm didn't exactly understand what the link attributes mean, let's generate some random data but we can adapt a better example if you edit your question with more details about your data.
About creating a Dataset
For every nodes in the mapping we will create 10 dummy random columns just for the sake of demonstrating.
Then we will create a list of all authors and coauthor called list_of_authors and generate pairs out of this calling it pair_of_authors.
for every pair of authors we will label them as linked or not linked using mapping, for that I created a function called check_if_pair_is_linked.
after this I will show how to create a simple baseline solution for the task. We will use scikit-learn with has a big list of easy to use models for classification.
Code
I folded the code and describe it in 3 major simple steps:
prepare your inputs to create a dataset (using mappings and attributes)
create dataset (for every pair of authors, label then as linked or not and concatenate their attributes)
Use sci-kit learn to fit, predict and evaluate a a model
import itertools
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn import svm
######## first preparing data to create a dataset
+-- 17 lines: ### we already have mappings {---------------------------------------------------------------------------------------------------
This part creates -----> mappings => {
0:[coauthor12,coauthor17231,...],
1:[...],
...,
732: [...]
}
i author_attributes => {
0:[a0_1,attr0_2,...,attr0_10],
1:[a1_1,attr1_2,...,attr1_10],
...,
732: [...]
}
#### Generating our Dataset and preparing dataset to the scikit-learn library(and most other) format
### The idea is generating pairs of authors regardless if they're linked or not and label if a pair is linked
+-- 24 lines: {--------------------------------------------------------------------------------------------------------------------------------
This part creates, a list of pairs of authors containing (attributes_of_both_authors, is_linked_label)
-----> dataset = [
([a0_1,...,a0_10,a1_1,...,a1_10],label_pair0_1)),
([a0_1,...,a0_10,a2_1,...,a2_10],label_pair1_2),
...
([a142_1,...,a142_10,a37_1,...,a37_10],label_pair142_37),
]
#### Training and evaluating a simple machine learning solution
+-- 12 lines: ---------------------------------------------------------------------------------------------------------------------------------This part outputs a report about the model after training the model with a training dataset and evaluate the model in a test dataset (I used the same train data and test data but dont ever do that in a real scenario)
-----> precision recall f1-score support
0 0.93 1.00 0.96 466
1 1.00 0.10 0.18 40
Solution:
import itertools
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn import svm
######## first preparing data to create a dataset
#### we already have mappings
def generate_random_author_attributes(mapping):
author_attributes = {}
for author in mapping.keys():
author_attributes[author] = np.random.random(10).tolist()
for coauthors in mapping.values():
for coauthor in coauthors:
if author_attributes.get(coauthor, False):
pass # check if this coauthor was alredy added
else:
author_attributes[coauthor] = np.random.random(10).tolist()
return author_attributes
mapping = {
0: [2860, 3117],
1: [318, 1610, 1776, 1865, 2283, 2507, 3076, 3108, 3182, 3357, 3675, 4040],
2: [164, 413, 1448, 1650, 3119, 3238],
}
#### hopefully you have attributes like this, for each author you have some attributes, I created 10 random values for each for demonstrating
author_attributes = generate_random_author_attributes(mapping)
#### Generating our Dataset and preparing dataset to the scikit-learn library(and most other) format
### The idea is generating pairs of authors regardless if they're linked or not and label if a pair is linked
def check_if_pair_is_linked(pair_of_authors, mapping):
''' return 1 if linked, 0 if not linked'''
coauthors_of_author0 = mapping.get(pair_of_authors[0],[])
coauthors_of_author1 = mapping.get(pair_of_authors[1],[])
if(pair_of_authors[1] in coauthors_of_author0) or (pair_of_authors[0] in coauthors_of_author1):
return 1
else:
return 0
def create_dataset(author_attributes, mapping):
list_of_all_authors = list(itertools.chain.from_iterable([coauthors for coauthors in mapping.values()]))
list_of_all_authors.extend(mapping.keys())
dataset = []
for pair_of_authors in itertools.permutations(list_of_all_authors, 2):
pair_label = check_if_pair_is_linked(pair_of_authors, mapping)
pair_attributes = author_attributes[pair_of_authors[0]] + author_attributes[pair_of_authors[1]]
dataset.append((pair_attributes,pair_label))
return dataset
dataset=create_dataset(author_attributes, mapping)
X_train = [pair_attributes for pair_attributes,_ in dataset]
y_train = [pair_label for _,pair_label in dataset]
#### Training and evaluating a simple machine learning solution
binary_classifier = svm.SVC()
binary_classifier.fit(X_train, y_train)
#### Checking if the model is good
X_test = X_train # never use you train data as test data
y_test = y_train
true_pairs_label = y_test
predicted_pairs_label = binary_classifier.predict(X_test)
print(classification_report(true_pairs_label, predicted_pairs_label))
OUTPUT
precision recall f1-score support
0 0.93 1.00 0.96 466
1 1.00 0.15 0.26 40
accuracy 0.93 506
macro avg 0.97 0.57 0.61 506
weighted avg 0.94 0.93 0.91 506

Related

How to perform custom operations in between keras layers?

I have one input and one output neural network and in between I need to perform small operation. I have two inputs (from the same distribution of either mean 0 or mean 1) which I need to fed to the neural network one at a time and compare the output of each input. After the comparison, I am finally generating the prediction of the model. The implementation is as follows:
from tensorflow import keras
import tensorflow as tf
import numpy as np
#define network
x1 = keras.Input(shape=(1), name="x1")
x2 = keras.Input(shape=(1), name="x2")
model = keras.layers.Dense(20)
model1 = keras.layers.Dense(1)
x11 = model1(model(x1))
x22 = model1(model(x2))
After this I need to perform following operations:
if x11>=x22:
Vm=x1
else:
Vm=x2
Finally I need to do:
out = Vm - 0.5
out= keras.activations.sigmoid(out)
model = keras.Model([x1,x2], out)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.binary_crossentropy,
metrics=['accuracy']
)
model.summary()
tf.keras.utils.plot_model(model) #visualize model
I have normally distributed pair of data with same mean (mean 0 and mean 1 as generated below:
#Generating training dataset
from scipy.stats import skewnorm
n=1000 #sample each
s = 1 # scale to change o/p range
X1_0 = skewnorm.rvs(a = 0 ,loc=0, size=n)*s; X1_1 = skewnorm.rvs(a = 0 ,loc=1, size=n)*s #Skewnorm function
X2_0 = skewnorm.rvs(a = 0 ,loc=0, size=n)*s; X2_1 = skewnorm.rvs(a = 0 ,loc=1, size=n)*s #Skewnorm function
X1_train = list(X1_0) + list(X1_1) #append both data
X2_train = list(X2_0) + list(X2_1) #append both data
y_train = [x for x in (0,1) for i in range(0, n)] #make Y for above conditions
#reshape to proper format
X1_train = np.array(X1_train).reshape(-1,1)
X2_train = np.array(X2_train).reshape(-1,1)
y_train = np.array(y_train)
#train model
model.fit([X1_train, X2_train], y_train, epochs=10)
I am not been able to run the program if I include operation
if x11>=x22:
Vm=x1
else:
Vm=x2
in between layers. If I directly work with maximum of outputs as:
Vm = keras.layers.Maximum()([x11,x22])
The program is working fine. But I need to select either x1 or x2 based on the value of x11 and x22.
The problem might be due to the inclusion of the comparison operation while defining structure of the model where there is no value for x11 and x22 (I guess). I am totally new to all these stuffs and so I could not resolve this. I would greatly appreciate any help/suggestions. Thank you.
You can add this functionality via a Lambda layer.
Vm = tf.keras.layers.Lambda(lambda x: tf.where(x[0]>=x[1], x[2], x[3]))([x11, x22, x1, x2])

Is there a way to get statistics of weights obtained from Tensorflow?

I am interested in developing a logit-based choice model using Tensorflow.
I am fairly new to this tool, so I was wondering if there is a way to get the statistics (i.e., the p-value) of the weights obtained from Tensorflow , just like someone would get from Stata or SPSS.
The code does run, but cannot be sure if the model is valid unless I can compare the p-values of the variables from the estimation result from STATA.
The data structure is simple; it's a form of a survey, where a respondent chooses an alternative out of 4 options, each with different feature levels (a.k.a. a conjoint analysis).
(I am trying something new; that's why I am not using pylogit of xlogit packages.)
Below is the code I wrote:
mport numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
np.random.seed(0)
tf.random.set_seed(0)
variables = pd.read_excel('file.xls')
target_vars = ['A','B','C','D','E']
X = pd.DataFrame()
for i in target_vars:
X[i]=variables[i]
y = variables['choice']
X_tn, X_te, y_tn, y_te = train_test_split(X, y, random_state=0)
n_feat = X_tn.shape[1]
epo = 100
model = Sequential()
model.add(Dense(1, input_dim=n_feat, activation='sigmoid'))
model.add(Dense(1))
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
metrics = ['mean_squared_error'])
hist = model.fit(X_tn, y_tn, epochs=epo, batch_size=4)
model.summary()
model.get_weights()
some other optional questions only if you are familiar with discrete choice models...
i) the original dataset is a conjoint survey with 4 alternatives at each choice situation - that's why I put batch_size=4. Am I doing it right?
ii) have I set the epoch too large?
First of all your question is about p-value significant where they are significant againts all input data in scopes !
The idea is you may applied many of the functions or custom functions but avtivation layer is asynchornize or fairly chances based on your target.
( 1 ) You can have model with 2-classes, 4-classes or 10 classes output to perform simiarlities significant or maximum, minumum, average maximum or last changes based on your selected function.
( 2 ) Prediction is a result from your input and none sigficant, significant relationship learning develop.
( 3 ) Compares of them possibile by make it into same ranges of expectation otherwises it is value for it subset.
sample output:
F:\temp\Python>python test_read_excel.py
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 0 0 0 0 0 1
(6, 6)
none significant:
[array([[-0.6489598]], dtype=float32), array([-0.0998228], dtype=float32), array([[1.7546983e-05]], dtype=float32), array([-3.6847262e-06], dtype=float32)]
** sample code **
variables = pd.read_excel('F:\\temp\\20220305\\Book 2.xlsx', index_col=None, header=None)
list_of_X = [ ]
list_of_Y = [ ]
for i in range(np.asarray(variables).shape[0]):
for j in range(np.asarray(variables).shape[1]):
if variables[j][i] == "X" :
print('found: ' + str(i) + ":" + str(j))
list_of_X.append(i)
list_of_Y.append(1)
else :
list_of_X.append(i)
list_of_Y.append(0)
list_of_X = np.reshape(list_of_X, (1, 36, 1))
list_of_Y = np.reshape(list_of_Y, (1, 36, 1))
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Initialize
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(36, 1)),
tf.keras.layers.Dense(1 , activation='sigmoid' ),
])
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
metrics = ['mean_squared_error'])
history = model.fit(list_of_X, list_of_Y, epochs=1000, batch_size=4)

deep feature synthesis depth for transformation primitives | featuretools

I am trying to use the featuretools library to make new features on a simple dataset, however, whenever I try to use a bigger max_depth, nothing happens... Here is my code so far:
# imports
import featuretools as ft
# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')
# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', max_depth=3,
trans_primitives=['add_numeric', 'multiply_numeric'])
When I look at the features created, I get the basic things f1*f2 and f1+f2, but I would like more complex engineered features like f2*(f1+f2) or f1+(f2+f1). I thought increasing max_depth would do this but apparently not.
How could I do this, if at all?
I have managed to answer my own question, so I'll post it here.
You can create deeper features by running "Deep Feature Synthesis" on already generated features. Here is an example:
# imports
import featuretools as ft
# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')
# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data',
trans_primitives=['add_numeric','multiply_numeric'])
# creating an EntitySet from the new features
deep_es = ft.EntitySet()
deep_es.entity_from_dataframe(entity_id='data', index='index', dataframe=feature_matrix)
# Run deep feature synthesis with transformation primitives
deep_feature_matrix, deep_feature_defs=ft.dfs(entityset=deep_es, target_entity='data',
trans_primitives=['add_numeric','multiply_numeric'])
Now, looking at the columns of deep_feature_matrix here is what we see (assuming a dataset with 2 features):
"f1", "f2", "f1+f2", "f1*f2", "f1+f1*f2", "f1+f1+f2", "f1*f2+f1+f2", "f1*f2+f2", "f1+f2+f2", "f1*f1*f2", "f1*f1+f2", "f1*f2*f1+f2", "f1*f2*f2", "f1+f2*f2"
I have also made a function that automatically does this (includes a full docstring):
def auto_feature_engineering(X, y, selection_percent=0.1, selection_strategy="best", num_depth_steps=2, transformatives=['divide_numeric', 'multiply_numeric']):
"""
Automatically perform deep feature engineering and
feature selection.
Parameters
----------
X : pd.DataFrame
Data to perform automatic feature engineering on.
y : pd.DataFrame
Target variable to find correlations of all
features at each depth step to perform feature
selection, y is not needed if selection_percent=1.
selection_percent : float, optional
Defines what percent of all the new features to
keep for the next depth step.
selection_strategy : {'best', 'random'}, optional
Strategy used for feature selection, if 'best',
it will select the best features for the next depth
step, if 'random', it will select features at random.
num_depth_steps : integer, optional
The number of depth steps. Every depth step, the model
generates brand new features from the features made in
the last step, then selects a percent of these new
features.
transformatives : list, optional
List of all possible transformations of the data to use
when feature engineering, you can find the full list
of possible transformations as well as what each one
does using the following code:
`ft.primitives.list_primitives()[ft.primitives.list_primitives()["type"]=="transform"]`
make sure to `import featuretools as ft`.
Returns
-------
pd.DataFrame
a dataframe of the brand new features.
"""
from sklearn.feature_selection import mutual_info_classif
selected_feature_df = X.copy()
for i in range(num_depth_steps):
# Perform feature engineering
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=selected_feature_df,
make_index=True, index='index')
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', trans_primitives=transformatives)
# Remove features that are the same
feature_corrs = feature_matrix.corr()[list(feature_matrix.keys())[0]]
existing_corrs = []
good_keys = []
for key in feature_corrs.to_dict().keys():
if feature_corrs[key] not in existing_corrs:
existing_corrs.append(feature_corrs[key])
good_keys.append(key)
feature_matrix = feature_matrix[good_keys]
# Remove illegal features
legal_features = list(feature_matrix.columns)
for feature in list(feature_matrix.columns):
raw_feature_list = []
for j in range(len(feature.split(" "))):
if j%2==0:
raw_feature_list.append(feature.split(" ")[j])
if len(raw_feature_list) > i+2: # num_depth_steps = 1, means max_num_raw_features_in_feature = 2
legal_features.remove(feature)
feature_matrix = feature_matrix[legal_features]
# Perform feature selection
if int(selection_percent)!=1:
if selection_strategy=="best":
corrs = mutual_info_classif(feature_matrix.reset_index(drop=True), y)
corrs = pd.Series(corrs, name="")
selected_corrs = corrs[corrs>=corrs.quantile(1-selection_percent)]
selected_feature_df = feature_matrix.iloc[:, list(selected_corrs.keys())].reset_index(drop=True)
elif selection_strategy=="random":
selected_feature_df = feature_matrix.sample(frac=(selection_percent), axis=1).reset_index(drop=True)
else:
raise Exception("selection_strategy can be either 'best' or 'random', got '"+str(selection_strategy)+"'.")
else:
selected_feature_df = feature_matrix.reset_index(drop=True)
if num_depth_steps!=1:
rename_dict = {}
for col in list(selected_feature_df.columns):
rename_dict[col] = "("+col+")"
selected_feature_df = selected_feature_df.rename(columns=rename_dict)
if num_depth_steps!=1:
rename_dict = {}
for feature_name in list(selected_feature_df.columns):
rename_dict[feature_name] = feature_name[int(num_depth_steps-1):-int(num_depth_steps-1)]
selected_feature_df = selected_feature_df.rename(columns=rename_dict)
return selected_feature_df
Here is an example of using it:
# Imports
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.preprocessing import OrdinalEncoder
# Load the penguins dataset
>>> penguins = sns.load_dataset("penguins")
>>> penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
# Fill in NaN values of features using the distribution of the feature
>>> for feature in ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]:
... s = penguins[feature].value_counts(normalize=True)
... dist = penguins[feature].value_counts(normalize=True).values
... missing = penguins[feature].isnull()
... penguins.loc[missing, feature] = np.random.choice(s.index, size=len(penguins[missing]),p=s.values)
# Make X and y
>>> X = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
>>> y = penguins[["sex"]]
# Encode "sex" so that "Male" is 1 and "Female" is 0
>>> ord_enc = OrdinalEncoder()
>>> y = pd.DataFrame(ord_enc.fit_transform(y).astype(np.int8), columns=["sex"])
# Generate new dataset with more features
>>> penguins_with_more_features = auto_feature_engineering(X, y, selection_percent=1.)
# Correlations of the raw features
>>> find_correlations(X, y)
body_mass_g 0.422959
bill_depth_mm 0.353526
bill_length_mm 0.342109
flipper_length_mm 0.246944
Name: sex, dtype: float64
# Top 10% correlations of new features
>>> summarize_corr_series(find_top_percent(find_correlations(penguins_with_more_features, y), 0.1))
(flipper_length_mm / bill_depth_mm) / (body_mass_g): 0.7241123396175027
(bill_depth_mm * body_mass_g) / (flipper_length_mm): 0.7237223914820166
(bill_depth_mm * body_mass_g) * (bill_depth_mm): 0.7222108721971968
(bill_depth_mm * body_mass_g): 0.7202272416625914
(bill_depth_mm * body_mass_g) * (flipper_length_mm): 0.6425813490692588
(bill_depth_mm * bill_length_mm) * (body_mass_g): 0.6398235593646668
(bill_depth_mm * flipper_length_mm) * (flipper_length_mm): 0.6360645935216128
(bill_depth_mm * flipper_length_mm): 0.6083364815975281
(bill_depth_mm * body_mass_g) * (body_mass_g): 0.5888925994060027
In this example, we would like to predict the gender of penguins given their attributes body_mass_g, bill_depth_mm, bill_length_mm and flipper_length_mm.
You might notice these other mysterious functions I used in the example, namely find_correlations, summarize_corr_series and find_top_percent. These are other convenient functions I made to help summarize the results from auto_feature_engineering. Here is the code to them (note they haven't been documented):
def summarize_corr_series(feature_corr_series):
max_feature_name_size = 0
for key in feature_corr_series.to_dict().keys():
if len(key) > max_feature_name_size:
max_feature_name_size = len(key)
max_new_feature_corr = feature_corr_series.max()
for key in feature_corr_series.to_dict().keys():
whitespace = []
for i in range(max_feature_name_size-len(key)):
whitespace.append(" ")
whitespace = "".join(whitespace)
print(key+": "+whitespace+str(abs(feature_corr_series[key])))
def find_top_percent(series, percent):
return series[series>series.quantile(1-percent)]
def find_correlations(X, y):
return abs(pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1).corr())[y.columns[0]].drop(y.columns[0]).sort_values(ascending=False)
It is really unfortunate that featuretools does not easily support this use case since it appears to be quite common. The best way I've found to do this is to create the first order features you want using the dfs function and then add the second order features you want manually.
For instance the MWE below (using the iris dataset) performs the AddNumeric primitive using dfs and then applies the DivideNumeric primitive to the newly created features using only the original features (and avoids the same base feature appearing multiple times in a transformed feature).
import numpy as np
import pandas as pd
import sklearn
import featuretools as ft
iris = sklearn.datasets.load_iris()
data = pd.DataFrame(
data= np.c_[iris['data'],
iris['target']],
columns= iris['feature_names'] + ['target']
)
ignore_cols = ['target']
entity_set = ft.EntitySet(id="iris")
entity_set.entity_from_dataframe(
entity_id="iris_main",
dataframe=data,
index="index",
)
new_features = ft.dfs(
entityset=entity_set,
target_entity="iris_main",
trans_primitives=["add_numeric"],
features_only=True,
primitive_options={
"add_numeric": {
"ignore_variables": {"iris_main": ignore_cols},
},
},
)
transformed_features = [i for i in new_features if isinstance(i, ft.feature_base.feature_base.TransformFeature)]
original_features = [i for i in new_features if isinstance(i, ft.feature_base.feature_base.IdentityFeature) and i.get_name() not in ignore_cols]
depth_two_features = []
for trans_feat in transformed_features:
for orig_feat in original_features:
if orig_feat.get_name() not in [i.get_name() for i in trans_feat.base_features]:
feat = ft.Feature([trans_feat, orig_feat], primitive=ft.primitives.DivideNumeric)
depth_two_features.append(feat)
data = ft.calculate_feature_matrix(
features= original_features + transformed_features + depth_two_features,
entityset=entity_set,
verbose=True,
)
The benefit of this approach is that it gives you more fine grained control to customise this how you want and avoids the computational cost of creating unnecessary features you don't want.

Crossing string lists in Tensorflow tf.feature_column.crossed_column

I have two features post_tags and user_tags. These are padded strings of N and M words respectivley.
So for example we might have post_tag being "the-oscars brad-pitt xyzpadxyz xyzpadxyz xyzpadxyz". So this post has been tagged as oscars and brad pitt related.
Then for a user_tag we might have "brad-pitt universal sag the-academy-awards xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz xyzpadxyz"
user_tag here is a sample of tags from last 20 posts user last consumed say.
So post_tag is always 5 tags long and user_tag is always 10 tags long.
I'm splitting each string into a tensor as part of dataset api processing like this:
features['post_tag'] = tf.string_split([features['post_tag']])
features['post_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)
features['user_tag'] = tf.string_split([features['post_tag']])
features['user_tag'] = tf.sparse_tensor_to_dense(features['post_tag'], default_value=PADWORD)
I have a vocab file for the tags and am feeding each feature into the deep features of a wide and deep canned estimator.
user_tag = tf.feature_column.categorical_column_with_vocabulary_file(
key='user_tag',
vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
)
user_tag_embed = tf.feature_column.embedding_column(
categorical_column = user_tag ,
dimension = USER_TAG_EMBEDDING_SIZE
)
deep.append(user_tag_embed)
post_tag = tf.feature_column.categorical_column_with_vocabulary_file(
key='post_tag',
vocabulary_file='{}/tagvocab.csv'.format(INPUT_DIR)
)
post_tag _embed = tf.feature_column.embedding_column(
categorical_column = post_tag ,
dimension = POST_TAG_EMBEDDING_SIZE
)
deep.append(post_tag_embed)
This works however what i really want to do is to do a cross of post_tag and user_tag. But if i do something like this:
user_post_tag_cross = tf.feature_column.crossed_column(
keys = [user_tag, post_tag],
hash_bucket_size = 25
)
wide.append(user_post_tag_cross)
I'm getting this error:
InvalidArgumentError (see above for traceback): Expected D2 of index to be 2 got 3 at position 0
I can see this is coming from this line in the tf code
I have a feeling maybe crossing tensors like this might not be possible or it expecting just two strings. I have related features that are post_tag_first and user_tag_first which are just one random tag and just being passed in as strings. Then if i do:
user_post_first_tag_cross = tf.feature_column.crossed_column(
keys = ['user_tag_first', 'post_tag_first'],
hash_bucket_size = 25
)
wide.append(user_post_first_tag_cross)
It works and trains fine. So really i'm just trying to understand what the best way would be to do a cross of two tensors of tags.
Anyone any ideas - i'd imagine it's something someone might have dealt with before as sort of tag data like this is pretty common, or it could be search terms and document terms, keywords, etc.
Update - small reproducible example
Here is a small runable example that sets up what i'm looking to do.
import tensorflow as tf
import numpy as np
# set up tag vocab
tag_vocab = ['brad-pitt','usa','donald-trump','tpb','tensorflow','xyzpadxyz','UNK']
# make data
# train data
user_tag_train = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_train = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_train = np.array([1., 2., 3., 4., 5.]) # just another example numeric feature
y_train = np.array([0., -1., -2., -3., -4]) # outcome
# eval data
user_tag_eval = np.random.choice(tag_vocab,50).reshape((5, 10)) # 5 example rows with 10 tags per user
post_tag_eval = np.random.choice(tag_vocab,25).reshape((5, 5)) # 5 example rows with 5 tags per post
x_eval = np.array([2., 5., 8., 1., 5.])
y_eval = np.array([-1.01, -4.1, -7, 0., 9.])
# define feature cols
x_num = tf.feature_column.numeric_column("x", shape=[1])
user_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
key = 'user_tag',
vocabulary_list = tag_vocab)
user_tag_embed = tf.feature_column.embedding_column(
categorical_column = user_tag_cat,
dimension = 3
)
post_tag_cat = tf.feature_column.categorical_column_with_vocabulary_list(
key = 'post_tag',
vocabulary_list = tag_vocab)
post_tag_embed = tf.feature_column.embedding_column(
categorical_column = post_tag_cat,
dimension = 2
)
user_post_tag_cross = tf.feature_column.crossed_column(
keys = [ user_tag_cat, post_tag_cat ],
hash_bucket_size = 5
)
#user_post_tag_embed_cross = tf.feature_column.crossed_column(
# keys = [ user_tag_embed, post_tag_embed ],
# hash_bucket_size = 5
#)
feature_columns = [
x_num,
user_tag_cat,
post_tag_cat,
user_tag_embed,
post_tag_embed,
user_post_tag_cross,
#user_post_tag_embed_cross,
]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_train,
"user_tag": user_tag_train,
"post_tag": post_tag_train,
},
y_train,
batch_size=5,
num_epochs=None,
shuffle=True
)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_train,
"user_tag": user_tag_train,
"post_tag": post_tag_train,
},
y_train,
batch_size=5,
num_epochs=1000,
shuffle=False
)
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
{
"x": x_eval,
"user_tag": user_tag_eval,
"post_tag": post_tag_eval
},
y_eval,
batch_size=5,
num_epochs=1000,
shuffle=False
)
estimator.train(input_fn=input_fn, steps=1000)
train_metrics = estimator.evaluate(input_fn=train_input_fn)
eval_metrics = estimator.evaluate(input_fn=eval_input_fn)
print("\n\ntrain metrics: %r"% train_metrics)
print("eval metrics: %r"% eval_metrics)
The above code runs, but if you uncomment the parts relating to user_post_tag_embed_cross you get (as expected):
ValueError: Unsupported key type. All keys must be either string, or categorical column except _HashedCategoricalColumn. Given: _EmbeddingColumn(categorical_column=_VocabularyListCategoricalColumn(key='user_tag', vocabulary_list=('brad-pitt', 'usa', 'donald-trump', 'tpb', 'tensorflow', 'xyzpadxyz', 'UNK'), dtype=tf.string, default_value=-1, num_oov_buckets=0), dimension=3, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x0000021C9AA14860>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True)
In reality i don't think i'd be able to do user_post_tag_cross as the tag vocab is ~40k (i was surprised it worked here as i think this is what led to the error i linked to above - maybe its related to when you try cross two cats with big vocab's).
I think ideally what i'd want to do is just cross the embeddings in some way. If i put both into a embedding with same dimension then is there a way to cross that somehow using feature_columns() or some other approach?

Multi-Target and Multi-Class prediction

I am relatively new to machine learning as well as tensorflow. I would like to train the data so that predictions with 2 targets and multiple classes could be made. Is this something that can be done? I was able to implement the algorithm for 1 target but don't know how I need to do it for a second target as well.
An example dataset:
DayOfYear Temperature Flow Visibility
316 8 1 4
285 -1 1 4
326 8 2 5
323 -1 0 3
10 7 3 6
62 8 0 3
56 8 1 4
347 7 2 5
363 7 0 3
77 7 3 6
1 7 1 4
308 -1 2 5
364 7 3 6
If I train (DayOfYear Temperature Flow) I can predict the Visibility quite well. But I need to predict Flow as well somehow. I am pretty sure that Flow will influence Visibility so I am not sure how to go with that.
This is the implementation that I have
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import urllib
import numpy as np
import tensorflow as tf
# Data sets
TRAINING = "/ml_baetterich_learn.csv"
TEST = "/ml_baetterich_test.csv"
VALIDATION = "/ml_baetterich_validation.csv"
def main():
# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TRAINING,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TEST,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
validation_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=VALIDATION,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=3)]
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=9,
model_dir="/tmp/iris_model")
# Define the training inputs
def get_train_inputs():
x = tf.constant(training_set.data)
y = tf.constant(training_set.target)
return x, y
# Fit model.
classifier.fit(input_fn=get_train_inputs, steps=4000)
# Define the test inputs
def get_test_inputs():
x = tf.constant(test_set.data)
y = tf.constant(test_set.target)
return x, y
# Define the test inputs
def get_validation_inputs():
x = tf.constant(validation_set.data)
y = tf.constant(validation_set.target)
return x, y
# Evaluate accuracy.
accuracy_test_score = classifier.evaluate(input_fn=get_test_inputs,
steps=1)["accuracy"]
accuracy_validation_score = classifier.evaluate(input_fn=get_validation_inputs,
steps=1)["accuracy"]
print ("\nValidation Accuracy: {0:0.2f}\nTest Accuracy: {1:0.2f}\n".format(accuracy_validation_score,accuracy_test_score))
# Classify two new flower samples.
def new_samples():
return np.array(
[[327,8,3],
[47,8,0]], dtype=np.float32)
predictions = list(classifier.predict_classes(input_fn=new_samples))
print(
"New Samples, Class Predictions: {}\n"
.format(predictions))
if __name__ == "__main__":
main()
Option 1: multi-headed model
You could use a multi-headed DNNEstimator model. This treats Flow and Visibility as two separate softmax classification targets, each with their own set of classes. I had to modify the load_csv_without_header helper function to support multiple targets (which could be cleaner, but is not the point here - feel free to ignore its details).
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import csv
import collections
num_flow_classes = 4
num_visib_classes = 7
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
def load_csv_without_header(fn, target_dtype, features_dtype, target_columns):
with gfile.Open(fn) as csv_file:
data_file = csv.reader(csv_file)
data = []
targets = {
target_cols: []
for target_cols in target_columns.keys()
}
for row in data_file:
cols = sorted(target_columns.items(), key=lambda tup: tup[1], reverse=True)
for target_col_name, target_col_i in cols:
targets[target_col_name].append(row.pop(target_col_i))
data.append(np.asarray(row, dtype=features_dtype))
targets = {
target_col_name: np.array(val, dtype=target_dtype)
for target_col_name, val in targets.items()
}
data = np.array(data)
return Dataset(data=data, target=targets)
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1),
tf.contrib.layers.real_valued_column("", dimension=2),
]
head = tf.contrib.learn.multi_head([
tf.contrib.learn.multi_class_head(
num_flow_classes, label_name="Flow", head_name="Flow"),
tf.contrib.learn.multi_class_head(
num_visib_classes, label_name="Visibility", head_name="Visibility"),
])
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=head,
)
def get_input_fn(filename):
def input_fn():
dataset = load_csv_without_header(
fn=filename,
target_dtype=np.int,
features_dtype=np.int,
target_columns={"Flow": 2, "Visibility": 3}
)
x = tf.constant(dataset.data)
y = {k: tf.constant(v) for k, v in dataset.target.items()}
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
Option 2: multi-labeled head
If you keep your CSV data separated by commas, and keep the last column for all the classes a row might have (separated by some token such as space), you can use the following code:
import numpy as np
import tensorflow as tf
all_classes = ["0", "1", "2", "3", "4", "5", "6"]
def k_hot(classes_col, all_classes, delimiter=' '):
table = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(all_classes)
)
classes = tf.string_split(classes_col, delimiter)
ids = table.lookup(classes)
num_items = tf.cast(tf.shape(ids)[0], tf.int64)
num_entries = tf.shape(ids.indices)[0]
y = tf.SparseTensor(
indices=tf.stack([ids.indices[:, 0], ids.values], axis=1),
values=tf.ones(shape=(num_entries,), dtype=tf.int32),
dense_shape=(num_items, len(all_classes)),
)
y = tf.sparse_tensor_to_dense(y, validate_indices=False)
return y
def feature_engineering_fn(features, labels):
labels = k_hot(labels, all_classes)
return features, labels
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1), # DayOfYear
tf.contrib.layers.real_valued_column("", dimension=2), # Temperature
]
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=tf.contrib.learn.multi_label_head(n_classes=len(all_classes)),
feature_engineering_fn=feature_engineering_fn,
)
def get_input_fn(filename):
def input_fn():
dataset = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=filename,
target_dtype="S100", # strings of length up to 100 characters
features_dtype=np.int,
target_column=-1
)
x = tf.constant(dataset.data)
y = tf.constant(dataset.target)
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
We are using DNNEstimator with a multi_label_head, which uses sigmoid crossentropy rather than softmax crossentropy as a loss function. This means that each of the output units/logits are passed through the sigmoid function, which gives the likelihood of the data point belonging to that class, i.e. the classes are computed independently and are not mutually exclusive as they are with softmax crossentropy. This means that you could have between 0 and len(all_classes) classes set for each row in the training set and final predictions.
Also notice that the classes are represented as strings (and k_hot makes the conversion to token indices), so that you could use arbitrary class identifiers such as category UUIDs in e-commerce settings. If the categories in the 3rd and 4th column are different (Flow ID 1 != Visibility ID 1), you could prepend the column name to each class ID, e.g.
316,8,flow1 visibility4
285,-1,flow1 visibility4
326,8,flow2 visibility5
For a description of how k_hot works, see my other SO answer. I decided to use k_hot as a separate function (rather than define it directly in feature_engineering_fn because it's a distinct piece of functionality, and probably TensorFlow will soon have a similar utility function.
Note that if you're now using the first two columns to predict the last two columns, your accuraccy will certainly go down, as the last two columns are highly correlated and using one of them will give you a lot of information about the other. Actually, your code was using only the 3rd column, which was kind of a cheat anyway if the goal is to predict the 3rd and 4th columns.