How to get each individual tree's prediction in xgboost? - xgboost

Using xgboost.Booster.predict can only get the prediction result of all the tree or the predicted leaf of each tree. But how could I get the prediction value of each tree?

As of recently, xgboost has introduced a slicing API, and Raul's answer, while valid, is overly complicated.
To get individual predictions all you need is to iterate through the booster object.
individual_preds = []
for tree_ in model.get_booster():
individual_preds.append(
tree_.predict(xgb.DMatrix(X))
)
Note however, that those individual predictions are not individual contributions. E.g. summing them up will not get the final prediction. For that we need to transform them back into log-odds and then sum up:
from scipy.special import expit as sigmoid, logit as inverse_sigmoid
individual_preds = np.vstack(individual_preds)
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)
Fully reproducible example, replicating Raul's efforts
import numpy as np
import xgboost as xgb
from sklearn import datasets
from scipy.special import expit as sigmoid, logit as inverse_sigmoid
# Load data
iris = datasets.load_iris()
X, y = iris.data, (iris.target == 1).astype(int)
# Fit a model
model = xgb.XGBClassifier(
n_estimators=10,
max_depth=10,
use_label_encoder=False,
objective='binary:logistic'
)
model.fit(X, y)
booster_ = model.get_booster()
# Extract indivudual predictions
individual_preds = []
for tree_ in booster_:
individual_preds.append(
tree_.predict(xgb.DMatrix(X))
)
individual_preds = np.vstack(individual_preds)
# Aggregated individual predictions to final predictions
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)
# Verify correctness
xgb_preds = booster_.predict(xgb.DMatrix(X))
np.testing.assert_almost_equal(final_preds, xgb_preds)

The xgboost.core.Booster has two methods that allows you to do so:
First, xgboost.core.Booster.predict with the parameter pred_leaf set to True allows you to get the predicted leaf indices. Then, is just a matter of getting those indices scores.
To get the leaf scores, we resort to the method xgboost.core.Booster.dump_model, which dumps the structure of the tree ensemble as a plain text or json. The dump contains the leaf scores.
Below I show an example.
First, train a xgboost model on the Iris Dataset.
import os
import json
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
y = (y == 1).astype(int)
# Fit a model
n_estimators = 10
max_depth = 10
model = xgb.XGBClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_child_weight=1)
model.fit(X, y)
booster = model.get_booster()
Then get leaf indices predictions.
pred_leaf_index = booster.predict(
xgb.DMatrix(X),
pred_leaf=True
).reshape(X.shape[0], n_estimators)
To get the leaf scores we to dump the model as a json file. The resulting dump contains the tree structure.
# Dump the model and load the dump
model_json_path = '/tmp/model.json'
booster.dump_model(model_json_path, dump_format='json')
with open(model_json_path, 'r') as f:
model_dict = json.loads(f.read())
Now, the following is perhaps the most complex part of this process. The following functions are aimed to get only the leaf scores by each three then for the entire ensamble:
def get_tree_leaf_scores(tree):
"""Retrieve a single tree leaf scores.
Parameters
----------
tree : dict
A dictionary representing a single xgboost decision tree
(one item of the dump generated by `booster.dump_model`).
Returns
-------
leafs : list
Each item of the list is the left and right final leafs of
the final branch of a tree.
"""
if 'leaf' in tree:
return tree
else:
branch_0 = get_tree_leaf_scores(tree['children'][0])
branch_1 = get_tree_leaf_scores(tree['children'][1])
if not isinstance(branch_0, list):
branch_0 = [branch_0]
if not isinstance(branch_1, list):
branch_1 = [branch_1]
return branch_0 + branch_1
def get_trees_leaf_as_dataframe(model_dict):
"""Retrieve the tree ensemble leaf scores.
Parameters
----------
model_dict : dict
The dictionary from loading the dump resulting from:
`xgboost.core.Booster.dump_model`
Returns
-------
trees_leaf_df : pandas.DataFrame
Tree/node ids with their leaf score.
"""
# Get tree nodes
trees_leaf_df = []
for tree_idx, tree in enumerate(model_dict):
tree_leafs = get_tree_leaf_scores(tree)
tree_leafs = pd.DataFrame(tree_leafs)
tree_leafs['treeid'] = tree_idx
trees_leaf_df.append(tree_leafs)
trees_leaf_df = pd.concat(
trees_leaf_df
).sort_values(['treeid', 'nodeid'])
trees_leaf_df['id'] = \
trees_leaf_df.apply(
lambda x: '%s-%s' % (int(x['treeid']), int(x['nodeid'])), axis=1)
trees_leaf_df = trees_leaf_df[
['treeid', 'nodeid', 'id', 'leaf']
].set_index('id')
return trees_leaf_df
Here is how you get the leaf scores as a DataFrame:
trees_leaf_df = get_trees_leaf_as_dataframe(model_dict)
trees_leaf_df.head()
Out[1]:
nodeid leaf treeid id
0 1 -0.555556 0 0-1
4 4 -0.528000 0 0-4
3 6 -0.120000 0 0-6
1 7 0.150000 0 0-7
2 8 0.550000 0 0-8
At this point we are ready to get the model predicted leaf scores, with the help of the following function:
def get_pred_leaf_scores(pred_leaf_index, trees_leaf_df):
"""
Return
------
The predicted leaf scores.
"""
tree_ids = range(0, n_estimators)
pred_leaf_scores = []
for single_instance_pred_leafs in pred_leaf_index:
tree_node_id_predictions = [
'%s-%s' % (treeid, nodeid)
for treeid, nodeid in zip(tree_ids, single_instance_pred_leafs)]
single_instnace_pred_leaf_scores = trees_leaf_df.loc[
tree_node_id_predictions]['leaf'].values
pred_leaf_scores.append(single_instnace_pred_leaf_scores)
pred_leaf_scores = pd.DataFrame(pred_leaf_scores)
return pred_leaf_scores
pred_leaf_scores = get_pred_leaf_scores(pred_leaf_index, trees_leaf_df)
pred_leaf_scores
Out[2]:
0 1 2 ... 7 8 9
0 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
1 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
2 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
3 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
4 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
.. ... ... ... ... ... ... ...
145 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
146 -0.528000 -0.410725 -0.374272 ... -0.024406 -0.236201 -0.185685
147 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
148 -0.528000 -0.410725 -0.374272 ... -0.250879 -0.236201 -0.215589
149 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
[150 rows x 10 columns]
If you want to make sure that the leaf scores yield the same probability
predictions, do the following:
def from_leafs_scores_to_proba(pred_leaf_scores):
"""
"""
# Get logistic function logit.
logit = pred_leaf_scores.sum(axis=1)
# Compute the logistic function
pos_class_probability = 1 / (1 + np.exp(-logit))
# Get negative and positive class probabilities.
return pos_class_probability
y_scores_from_leafs = from_leafs_scores_to_proba(pred_leaf_scores)
y_scores_from_leafs.values[:10]
Out[9]:
array([0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579,
0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579])
y_scores = model.predict_proba(X)[:, 1]
y_scores[:10]
Out[10]:
array([0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578,
0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578],
dtype=float32)

Much better solution is this.
In Python, you can dump the trees as a list of strings:
example:
m = xgb.XGBClassifier(max_depth=2, n_estimators=3).fit(X, y)
m.get_booster().get_dump()`
this is what you'll get:
booster[0]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<18.0417] yes=3,no=4,missing=4
3:leaf=-0.0965415
4:leaf=-0.0679503
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0992546
6:leaf=-0.0984374
booster[1]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<16.8917] yes=3,no=4,missing=4
3:leaf=-0.0928132
4:leaf=-0.0676056
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0945284
6:leaf=-0.0937463
booster[2]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<18.175] yes=3,no=4,missing=4
3:leaf=-0.0878571
4:leaf=-0.0610089
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0904395
6:leaf=-0.0896808

Related

How efficiently filter a specific number of entries and concatenating them in a unique tf.data.Dataset?

I have a huge TFRecord file with more than 4M entries. It is a very unbalanced dataset containing many more entries of some labels and few others - compare to the whole dataset. I want to filter a limited number of entries of some of these labels in order to have a balanced dataset. Below, you can see my attempt, but it takes more than 24 hours to filter 1k from each label (33 different labels).
import tensorflow as tf
tf.compat.as_str(
bytes_or_text='str', encoding='utf-8'
)
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())
strategy = tf.distribute.TPUStrategy(tpu)
except:
strategy = tf.distribute.get_strategy()
print("Number of replicas:", strategy.num_replicas_in_sync)
ignore_order = tf.data.Options()
ignore_order.experimental_deterministic = False
dataset = tf.data.TFRecordDataset('/test.tfrecord')
dataset = dataset.with_options(ignore_order)
features, feature_lists = detect_schema(dataset)
#Decodings TFRecord serialized data
def decode_data(serialized):
X, y = tf.io.parse_single_sequence_example(
serialized,
context_features=features,
sequence_features=feature_lists)
return X['title'], y['subject']
dataset = dataset.map(lambda x: tf.py_function(func=decode_data, inp=[x], Tout=(tf.string, tf.string)))
#Filtering and concatenating the samples
def balanced_dataset(dataset, labels_list, sample_size=1000):
datasets_list = []
for label in labels_list:
#Filtering the chosen labels
locals()[label] = dataset.filter(lambda x, y: tf.greater(tf.reduce_sum(tf.cast(tf.equal(tf.constant(label, dtype=tf.int64), y), tf.float32)), tf.constant(0.)))
#appending a limited sample
datasets_list.append(locals()[label].take(sample_size))
concat_dataset = datasets_list[0]
#concatenating the datasets
for dset in datasets_list[1:]:
concat_dataset = concat_dataset.concatenate(dset)
return concat_dataset
balanced_data = balanced_dataset(tabledataset, labels_list=list(decod_dic.values()), sample_size=1000)
One way to solve this is by using group_by_window method where the window_size would be the sample size of each class (in your case 1k).
ds = ds.group_by_window(
# Use label as key
key_func=lambda _, l: l,
# Convert each window to a sample_size
reduce_func=lambda _, window: window.batch(sample_size),
# Use window size as sample_size
window_size=sample_size)
This will form batches of single classes of size sample_size. But there is one problem, there will be multiple batches of same class, but you just need one of the batches in each class.
To solve the above problem, we need to add a count for each of the batches and then filter out count==0, which will fetch the first batch of all the classes.
Lets define an example:
labels = np.array(sum([[label]*repeat for label, repeat in zip([0, 1, 2], [100, 200, 15])], []))
features = np.arange(len(labels))
np.unique(labels, return_counts=True)
#(array([0, 1, 2]), array([100, 200, 15]))
# There are 3 labels chosen for simplicity and each of their counts are shown along.
sample_size = 15 # we choose to pick sample of 15 from each class
We create a dataset from the above inputs,
ds = tf.data.Dataset.from_tensor_slices((features, labels))
In the above window function we modify the reduce_func to make the counter, so the batch will have 3 elements (X_batch, y_batch, label_counter) :
def reduce_func(x, y):
#class_count[y] += 1
z = table.lookup(x)
table.insert(x, z+1)
return y.batch(sample_size).map(lambda a,b: (a, b, z))
# Group by window
ds = tf.data.Dataset.from_tensor_slices((features, labels))
ds = ds.group_by_window(
# Use label as key
key_func=lambda _, l: l,
# Convert each window to a sample_size
reduce_func=reduce_func,
# Use window size as sample_size
window_size=sample_size)
The counter logic in reduce_func is implemented as a table lookup where the counter needs to be updated and read from a lookup table. Its initialized as shown below:
n_classes = 3
keys = tf.range(0,n_classes, dtype=tf.int64)
vals = tf.zeros_like(keys, dtype=tf.int64)
table = tf.lookup.experimental.MutableHashTable(key_dtype=tf.int64,
value_dtype=tf.int64,
default_value=-1)
table.insert(keys, vals)
Now we filter out the batch where the count==0 and remove the count element to form (X, y) batch pairs:
ds = ds.filter(lambda x, y, count: count==0)
ds = ds.map(lambda x, y, count: (x, y))
Output,
for x, y in ds:
print(x.numpy(), y.numpy())
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[100 101 102 103 104 105 106 107 108 109 110 111 112 113 114] [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[300 301 302 303 304 305 306 307 308 309 310 311 312 313 314] [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Is there a way to get statistics of weights obtained from Tensorflow?

I am interested in developing a logit-based choice model using Tensorflow.
I am fairly new to this tool, so I was wondering if there is a way to get the statistics (i.e., the p-value) of the weights obtained from Tensorflow , just like someone would get from Stata or SPSS.
The code does run, but cannot be sure if the model is valid unless I can compare the p-values of the variables from the estimation result from STATA.
The data structure is simple; it's a form of a survey, where a respondent chooses an alternative out of 4 options, each with different feature levels (a.k.a. a conjoint analysis).
(I am trying something new; that's why I am not using pylogit of xlogit packages.)
Below is the code I wrote:
mport numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
np.random.seed(0)
tf.random.set_seed(0)
variables = pd.read_excel('file.xls')
target_vars = ['A','B','C','D','E']
X = pd.DataFrame()
for i in target_vars:
X[i]=variables[i]
y = variables['choice']
X_tn, X_te, y_tn, y_te = train_test_split(X, y, random_state=0)
n_feat = X_tn.shape[1]
epo = 100
model = Sequential()
model.add(Dense(1, input_dim=n_feat, activation='sigmoid'))
model.add(Dense(1))
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
metrics = ['mean_squared_error'])
hist = model.fit(X_tn, y_tn, epochs=epo, batch_size=4)
model.summary()
model.get_weights()
some other optional questions only if you are familiar with discrete choice models...
i) the original dataset is a conjoint survey with 4 alternatives at each choice situation - that's why I put batch_size=4. Am I doing it right?
ii) have I set the epoch too large?
First of all your question is about p-value significant where they are significant againts all input data in scopes !
The idea is you may applied many of the functions or custom functions but avtivation layer is asynchornize or fairly chances based on your target.
( 1 ) You can have model with 2-classes, 4-classes or 10 classes output to perform simiarlities significant or maximum, minumum, average maximum or last changes based on your selected function.
( 2 ) Prediction is a result from your input and none sigficant, significant relationship learning develop.
( 3 ) Compares of them possibile by make it into same ranges of expectation otherwises it is value for it subset.
sample output:
F:\temp\Python>python test_read_excel.py
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 0 0 0 0 0 1
(6, 6)
none significant:
[array([[-0.6489598]], dtype=float32), array([-0.0998228], dtype=float32), array([[1.7546983e-05]], dtype=float32), array([-3.6847262e-06], dtype=float32)]
** sample code **
variables = pd.read_excel('F:\\temp\\20220305\\Book 2.xlsx', index_col=None, header=None)
list_of_X = [ ]
list_of_Y = [ ]
for i in range(np.asarray(variables).shape[0]):
for j in range(np.asarray(variables).shape[1]):
if variables[j][i] == "X" :
print('found: ' + str(i) + ":" + str(j))
list_of_X.append(i)
list_of_Y.append(1)
else :
list_of_X.append(i)
list_of_Y.append(0)
list_of_X = np.reshape(list_of_X, (1, 36, 1))
list_of_Y = np.reshape(list_of_Y, (1, 36, 1))
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Initialize
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(36, 1)),
tf.keras.layers.Dense(1 , activation='sigmoid' ),
])
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
metrics = ['mean_squared_error'])
history = model.fit(list_of_X, list_of_Y, epochs=1000, batch_size=4)

Link Probability prediction between two nodes using Machine Learning or Deep Learning where node to node mapping is given

Can someone please direct me to a tutorial provide a starting idea for the problem given below.
I have a mapping of Authors to co authors given as follows:
mapping
>>
{0: [2860, 3117],
1: [318, 1610, 1776, 1865, 2283, 2507, 3076, 3108, 3182, 3357, 3675, 4040],
2: [164, 413, 1448, 1650, 3119, 3238],
} # this is just sample
link_attributes.iloc[:5,:7]
>>
first id keyword_0 keyword_10 keyword_13 keyword_15 keyword_2
0 4 0 1.0 1.0 1.0 1.0
1 9 1 1.0 1.0 1.0 1.0
2 7 2 1.0 NaN 1.0 1.0
3 6 3 1.0 1.0 NaN 1.0
4 9 4 1.0 1.0 1.0 1.0
I have to predict the probability of having a link between a Source and Sink
For example if I am given a Source=13 and Sink=31 then I have to find the probability of having a link between 13 and 31. All the links are un-directed.
import json
import numpy
from keras import Sequential
from keras.layers import Dense
def get_keys(data, keys): # get all keys from json file
if isinstance(data, list):
for item in data:
get_keys(item, keys)
if isinstance(data, dict):
sub_keys = data.keys()
for sub_key in sub_keys:
keys.append(sub_key)
# get all keys, each key is a feature of instances
json_data = open("nodes.json") # read 4016 instances
jdata = json.load(json_data)
keys = []
get_keys(jdata, keys)
keys = set(keys)
print(set(keys))
def build_instance(json_object): # use to build instance from json object, ex: instance = [f0,f1,f2,f3,....f404]
features = []
features.append(json_object.get('id'))
for key in keys:
value = json_object.get(key)
if value is None:
value = 0
elif key == 'id':
continue
features.append(value)
return features
# read all instances and format them, each instance will be [f0,f1, f2,...], as i read from json file, each instance will have 405 features
instances = []
num_of_instances = 0
for item in jdata:
features = build_instance(item)
instances.append(features)
num_of_instances = num_of_instances + 1
print(num_of_instances)
# read "author_id - co author ids" file
traintxt = open('train.txt', 'r')
lines = traintxt.readlines()
au_vs_co_auth_list = []
for line in lines:
line = line.split('\t', 200)
print(line)
# convert value from string to int
string = line[0] # example line[0] = '14 445'
id_vs_coauthor = string.split(" ", 200)
id = id_vs_coauthor[0]
co_author = id_vs_coauthor[1]
line[0:1] = [int(id), int(co_author)]
for i in range(2, len(line)):
line[i] = int(line[i])
au_vs_co_auth_list.append(line)
print(len(au_vs_co_auth_list)) # we have 4016 authors
X_train = []
Y_train = []
generated_train_pairs = []
train_num = 30000 # choose 30000 random training instances
for i in range(train_num):
print(i)
index1 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
co_authors_of_index1 = au_vs_co_auth_list[index1]
author_id_of_index_1 = au_vs_co_auth_list[index1][0]
if index1 % 2 == 0: # try to create a sample that two author is not related
index2 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
author_id_of_index_2 = au_vs_co_auth_list[index2][0]
# make sure id1 != id2 and auth 1 and auth2 are not related
while (index1 == index2) or (author_id_of_index_2 in co_authors_of_index1):
index2 = numpy.random.randint(0, len(au_vs_co_auth_list), 1)[0]
author_id_of_index_2 = au_vs_co_auth_list[index2][0]
y = [0, 1] # [relative=FALSE,non-related = TRUE]
else: # try to create a sample that two author is related
author_id_of_index_2 = numpy.random.randint(1, len(co_authors_of_index1),size=1)[0]
y = [1, 0] # [relative=TRUE,non-related = FALSE]
x = instances[author_id_of_index_1][1:] + instances[author_id_of_index_2][
1:] # x = [feature1, feature2,...feature404',feature1', feature2',...feature404']
X_train.append(x)
Y_train.append(y)
X_train = numpy.asarray(X_train)
Y_train = numpy.asarray(Y_train)
print(X_train.shape)
print(Y_train.shape)
# now we have x_train, y_train, build model right now
model = Sequential()
model.add(Dense(512, input_shape=X_train[0].shape, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(2, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=512, epochs=3, verbose=2)
model.save("model.h5")
# now to predict probability of linking between two author ids
id1 = 11 # just random
id2 = 732 # just random
author1 = None
author2 = None
for item in jdata:
if item.get('id') == id1:
author1 = build_instance(item)
if item.get('id') == id2:
author2 = build_instance(item)
if author1 is not None and author2 is not None:
break
x_test = author1[1:] + author2[1:]
x_test = numpy.expand_dims(numpy.asarray(x_test), axis=0)
probability = model.predict(x_test)
print("author id ", id1, " and author id ", id2, end=" ")
if probability[0][1] > probability[0][0]:
print("Not related")
else:
print("Related")
print(probability)
Output:
author id 11 and author id 732 related
Before diving into how to find a solution, I recommend understand your data well, and spend a good part of your time digesting the problem and preparing a dataset.
So from the scenario you described it seems to me your problem is given two Nodes and their attributes predict if there is a link this can interpreted as a binary classification task. I will provide an initial minimalistic simple solution.
what confused me is that you mentioned you have only link_attributes.iloc[:5,:7] link_attributes but not node attributes. In the case you have node attributes it makes more sense because then we just make a combinations of pairs of nodes, and label the pairs wich are not connected as 0 or not_connected and the ones connected as 1 or connected.
So let's make a dataset. As I'm didn't exactly understand what the link attributes mean, let's generate some random data but we can adapt a better example if you edit your question with more details about your data.
About creating a Dataset
For every nodes in the mapping we will create 10 dummy random columns just for the sake of demonstrating.
Then we will create a list of all authors and coauthor called list_of_authors and generate pairs out of this calling it pair_of_authors.
for every pair of authors we will label them as linked or not linked using mapping, for that I created a function called check_if_pair_is_linked.
after this I will show how to create a simple baseline solution for the task. We will use scikit-learn with has a big list of easy to use models for classification.
Code
I folded the code and describe it in 3 major simple steps:
prepare your inputs to create a dataset (using mappings and attributes)
create dataset (for every pair of authors, label then as linked or not and concatenate their attributes)
Use sci-kit learn to fit, predict and evaluate a a model
import itertools
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn import svm
######## first preparing data to create a dataset
+-- 17 lines: ### we already have mappings {---------------------------------------------------------------------------------------------------
This part creates -----> mappings => {
0:[coauthor12,coauthor17231,...],
1:[...],
...,
732: [...]
}
i author_attributes => {
0:[a0_1,attr0_2,...,attr0_10],
1:[a1_1,attr1_2,...,attr1_10],
...,
732: [...]
}
#### Generating our Dataset and preparing dataset to the scikit-learn library(and most other) format
### The idea is generating pairs of authors regardless if they're linked or not and label if a pair is linked
+-- 24 lines: {--------------------------------------------------------------------------------------------------------------------------------
This part creates, a list of pairs of authors containing (attributes_of_both_authors, is_linked_label)
-----> dataset = [
([a0_1,...,a0_10,a1_1,...,a1_10],label_pair0_1)),
([a0_1,...,a0_10,a2_1,...,a2_10],label_pair1_2),
...
([a142_1,...,a142_10,a37_1,...,a37_10],label_pair142_37),
]
#### Training and evaluating a simple machine learning solution
+-- 12 lines: ---------------------------------------------------------------------------------------------------------------------------------This part outputs a report about the model after training the model with a training dataset and evaluate the model in a test dataset (I used the same train data and test data but dont ever do that in a real scenario)
-----> precision recall f1-score support
0 0.93 1.00 0.96 466
1 1.00 0.10 0.18 40
Solution:
import itertools
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn import svm
######## first preparing data to create a dataset
#### we already have mappings
def generate_random_author_attributes(mapping):
author_attributes = {}
for author in mapping.keys():
author_attributes[author] = np.random.random(10).tolist()
for coauthors in mapping.values():
for coauthor in coauthors:
if author_attributes.get(coauthor, False):
pass # check if this coauthor was alredy added
else:
author_attributes[coauthor] = np.random.random(10).tolist()
return author_attributes
mapping = {
0: [2860, 3117],
1: [318, 1610, 1776, 1865, 2283, 2507, 3076, 3108, 3182, 3357, 3675, 4040],
2: [164, 413, 1448, 1650, 3119, 3238],
}
#### hopefully you have attributes like this, for each author you have some attributes, I created 10 random values for each for demonstrating
author_attributes = generate_random_author_attributes(mapping)
#### Generating our Dataset and preparing dataset to the scikit-learn library(and most other) format
### The idea is generating pairs of authors regardless if they're linked or not and label if a pair is linked
def check_if_pair_is_linked(pair_of_authors, mapping):
''' return 1 if linked, 0 if not linked'''
coauthors_of_author0 = mapping.get(pair_of_authors[0],[])
coauthors_of_author1 = mapping.get(pair_of_authors[1],[])
if(pair_of_authors[1] in coauthors_of_author0) or (pair_of_authors[0] in coauthors_of_author1):
return 1
else:
return 0
def create_dataset(author_attributes, mapping):
list_of_all_authors = list(itertools.chain.from_iterable([coauthors for coauthors in mapping.values()]))
list_of_all_authors.extend(mapping.keys())
dataset = []
for pair_of_authors in itertools.permutations(list_of_all_authors, 2):
pair_label = check_if_pair_is_linked(pair_of_authors, mapping)
pair_attributes = author_attributes[pair_of_authors[0]] + author_attributes[pair_of_authors[1]]
dataset.append((pair_attributes,pair_label))
return dataset
dataset=create_dataset(author_attributes, mapping)
X_train = [pair_attributes for pair_attributes,_ in dataset]
y_train = [pair_label for _,pair_label in dataset]
#### Training and evaluating a simple machine learning solution
binary_classifier = svm.SVC()
binary_classifier.fit(X_train, y_train)
#### Checking if the model is good
X_test = X_train # never use you train data as test data
y_test = y_train
true_pairs_label = y_test
predicted_pairs_label = binary_classifier.predict(X_test)
print(classification_report(true_pairs_label, predicted_pairs_label))
OUTPUT
precision recall f1-score support
0 0.93 1.00 0.96 466
1 1.00 0.15 0.26 40
accuracy 0.93 506
macro avg 0.97 0.57 0.61 506
weighted avg 0.94 0.93 0.91 506

How to show the class distribution in Dataset object in Tensorflow

I am working on a multi-class classification task using my own images.
filenames = [] # a list of filenames
labels = [] # a list of labels corresponding to the filenames
full_ds = tf.data.Dataset.from_tensor_slices((filenames, labels))
This full dataset will be shuffled and split into train, valid and test dataset
full_ds_size = len(filenames)
full_ds = full_ds.shuffle(buffer_size=full_ds_size*2, seed=128) # seed is used for reproducibility
train_ds_size = int(0.64 * full_ds_size)
valid_ds_size = int(0.16 * full_ds_size)
train_ds = full_ds.take(train_ds_size)
remaining = full_ds.skip(train_ds_size)
valid_ds = remaining.take(valid_ds_size)
test_ds = remaining.skip(valid_ds_size)
Now I am struggling to understand how each class is distributed in train_ds, valid_ds and test_ds. An ugly solution is to iterate all the element in the dataset and count the occurrence of each class. Is there any better way to solve it?
My ugly solution:
def get_class_distribution(dataset):
class_distribution = {}
for element in dataset.as_numpy_iterator():
label = element[1]
if label in class_distribution.keys():
class_distribution[label] += 1
else:
class_distribution[label] = 0
# sort dict by key
class_distribution = collections.OrderedDict(sorted(class_distribution.items()))
return class_distribution
train_ds_class_dist = get_class_distribution(train_ds)
valid_ds_class_dist = get_class_distribution(valid_ds)
test_ds_class_dist = get_class_distribution(test_ds)
print(train_ds_class_dist)
print(valid_ds_class_dist)
print(test_ds_class_dist)
The answer below assumes:
there are five classes.
labels are integers from 0 to 4.
It can be modified to suit your needs.
Define a counter function:
def count_class(counts, batch, num_classes=5):
labels = batch['label']
for i in range(num_classes):
cc = tf.cast(labels == i, tf.int32)
counts[i] += tf.reduce_sum(cc)
return counts
Use the reduce operation:
initial_state = dict((i, 0) for i in range(5))
counts = train_ds.reduce(initial_state=initial_state,
reduce_func=count_class)
print([(k, v.numpy()) for k, v in counts.items()])
A solution inspired by user650654 's answer, only using TensorFlow primitives (with tf.unique_with_counts instead of for loop):
In theory, this should have better performance and scale better to large datasets, batches or class count.
num_classes = 5
#tf.function
def count_class(counts, batch):
y, _, c = tf.unique_with_counts(batch[1])
return tf.tensor_scatter_nd_add(counts, tf.expand_dims(y, axis=1), c)
counts = train_ds.reduce(
initial_state=tf.zeros(num_classes, tf.int32),
reduce_func=count_class)
print(counts.numpy())
Similar and simpler version with numpy that actually had better performances for my simple use-case:
count = np.zeros(num_classes, dtype=np.int32)
for _, labels in train_ds:
y, _, c = tf.unique_with_counts(labels)
count[y.numpy()] += c.numpy()
print(count)

Multi-Target and Multi-Class prediction

I am relatively new to machine learning as well as tensorflow. I would like to train the data so that predictions with 2 targets and multiple classes could be made. Is this something that can be done? I was able to implement the algorithm for 1 target but don't know how I need to do it for a second target as well.
An example dataset:
DayOfYear Temperature Flow Visibility
316 8 1 4
285 -1 1 4
326 8 2 5
323 -1 0 3
10 7 3 6
62 8 0 3
56 8 1 4
347 7 2 5
363 7 0 3
77 7 3 6
1 7 1 4
308 -1 2 5
364 7 3 6
If I train (DayOfYear Temperature Flow) I can predict the Visibility quite well. But I need to predict Flow as well somehow. I am pretty sure that Flow will influence Visibility so I am not sure how to go with that.
This is the implementation that I have
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import urllib
import numpy as np
import tensorflow as tf
# Data sets
TRAINING = "/ml_baetterich_learn.csv"
TEST = "/ml_baetterich_test.csv"
VALIDATION = "/ml_baetterich_validation.csv"
def main():
# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TRAINING,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=TEST,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
validation_set = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=VALIDATION,
target_dtype=np.int,
features_dtype=np.int,
target_column=-1)
# Specify that all features have real-value data
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=3)]
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=9,
model_dir="/tmp/iris_model")
# Define the training inputs
def get_train_inputs():
x = tf.constant(training_set.data)
y = tf.constant(training_set.target)
return x, y
# Fit model.
classifier.fit(input_fn=get_train_inputs, steps=4000)
# Define the test inputs
def get_test_inputs():
x = tf.constant(test_set.data)
y = tf.constant(test_set.target)
return x, y
# Define the test inputs
def get_validation_inputs():
x = tf.constant(validation_set.data)
y = tf.constant(validation_set.target)
return x, y
# Evaluate accuracy.
accuracy_test_score = classifier.evaluate(input_fn=get_test_inputs,
steps=1)["accuracy"]
accuracy_validation_score = classifier.evaluate(input_fn=get_validation_inputs,
steps=1)["accuracy"]
print ("\nValidation Accuracy: {0:0.2f}\nTest Accuracy: {1:0.2f}\n".format(accuracy_validation_score,accuracy_test_score))
# Classify two new flower samples.
def new_samples():
return np.array(
[[327,8,3],
[47,8,0]], dtype=np.float32)
predictions = list(classifier.predict_classes(input_fn=new_samples))
print(
"New Samples, Class Predictions: {}\n"
.format(predictions))
if __name__ == "__main__":
main()
Option 1: multi-headed model
You could use a multi-headed DNNEstimator model. This treats Flow and Visibility as two separate softmax classification targets, each with their own set of classes. I had to modify the load_csv_without_header helper function to support multiple targets (which could be cleaner, but is not the point here - feel free to ignore its details).
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import csv
import collections
num_flow_classes = 4
num_visib_classes = 7
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
def load_csv_without_header(fn, target_dtype, features_dtype, target_columns):
with gfile.Open(fn) as csv_file:
data_file = csv.reader(csv_file)
data = []
targets = {
target_cols: []
for target_cols in target_columns.keys()
}
for row in data_file:
cols = sorted(target_columns.items(), key=lambda tup: tup[1], reverse=True)
for target_col_name, target_col_i in cols:
targets[target_col_name].append(row.pop(target_col_i))
data.append(np.asarray(row, dtype=features_dtype))
targets = {
target_col_name: np.array(val, dtype=target_dtype)
for target_col_name, val in targets.items()
}
data = np.array(data)
return Dataset(data=data, target=targets)
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1),
tf.contrib.layers.real_valued_column("", dimension=2),
]
head = tf.contrib.learn.multi_head([
tf.contrib.learn.multi_class_head(
num_flow_classes, label_name="Flow", head_name="Flow"),
tf.contrib.learn.multi_class_head(
num_visib_classes, label_name="Visibility", head_name="Visibility"),
])
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=head,
)
def get_input_fn(filename):
def input_fn():
dataset = load_csv_without_header(
fn=filename,
target_dtype=np.int,
features_dtype=np.int,
target_columns={"Flow": 2, "Visibility": 3}
)
x = tf.constant(dataset.data)
y = {k: tf.constant(v) for k, v in dataset.target.items()}
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
Option 2: multi-labeled head
If you keep your CSV data separated by commas, and keep the last column for all the classes a row might have (separated by some token such as space), you can use the following code:
import numpy as np
import tensorflow as tf
all_classes = ["0", "1", "2", "3", "4", "5", "6"]
def k_hot(classes_col, all_classes, delimiter=' '):
table = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(all_classes)
)
classes = tf.string_split(classes_col, delimiter)
ids = table.lookup(classes)
num_items = tf.cast(tf.shape(ids)[0], tf.int64)
num_entries = tf.shape(ids.indices)[0]
y = tf.SparseTensor(
indices=tf.stack([ids.indices[:, 0], ids.values], axis=1),
values=tf.ones(shape=(num_entries,), dtype=tf.int32),
dense_shape=(num_items, len(all_classes)),
)
y = tf.sparse_tensor_to_dense(y, validate_indices=False)
return y
def feature_engineering_fn(features, labels):
labels = k_hot(labels, all_classes)
return features, labels
feature_columns = [
tf.contrib.layers.real_valued_column("", dimension=1), # DayOfYear
tf.contrib.layers.real_valued_column("", dimension=2), # Temperature
]
classifier = tf.contrib.learn.DNNEstimator(
feature_columns=feature_columns,
hidden_units=[10, 20, 10],
model_dir="iris_model",
head=tf.contrib.learn.multi_label_head(n_classes=len(all_classes)),
feature_engineering_fn=feature_engineering_fn,
)
def get_input_fn(filename):
def input_fn():
dataset = tf.contrib.learn.datasets.base.load_csv_without_header(
filename=filename,
target_dtype="S100", # strings of length up to 100 characters
features_dtype=np.int,
target_column=-1
)
x = tf.constant(dataset.data)
y = tf.constant(dataset.target)
return x, y
return input_fn
classifier.fit(input_fn=get_input_fn("tmp_train.csv"), steps=4000)
res = classifier.evaluate(input_fn=get_input_fn("tmp_test.csv"), steps=1)
print("Validation:", res)
We are using DNNEstimator with a multi_label_head, which uses sigmoid crossentropy rather than softmax crossentropy as a loss function. This means that each of the output units/logits are passed through the sigmoid function, which gives the likelihood of the data point belonging to that class, i.e. the classes are computed independently and are not mutually exclusive as they are with softmax crossentropy. This means that you could have between 0 and len(all_classes) classes set for each row in the training set and final predictions.
Also notice that the classes are represented as strings (and k_hot makes the conversion to token indices), so that you could use arbitrary class identifiers such as category UUIDs in e-commerce settings. If the categories in the 3rd and 4th column are different (Flow ID 1 != Visibility ID 1), you could prepend the column name to each class ID, e.g.
316,8,flow1 visibility4
285,-1,flow1 visibility4
326,8,flow2 visibility5
For a description of how k_hot works, see my other SO answer. I decided to use k_hot as a separate function (rather than define it directly in feature_engineering_fn because it's a distinct piece of functionality, and probably TensorFlow will soon have a similar utility function.
Note that if you're now using the first two columns to predict the last two columns, your accuraccy will certainly go down, as the last two columns are highly correlated and using one of them will give you a lot of information about the other. Actually, your code was using only the 3rd column, which was kind of a cheat anyway if the goal is to predict the 3rd and 4th columns.