I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.
I would like to whiten each image in a batch. The code I have to do so is this:
def whiten(self, x):
shape = x.shape
x = K.batch_flatten(x)
mn = K.mean(x, 0)
std = K.std(x, 0) + K.epsilon()
r = (x - mn) / std
r = K.reshape(x, (-1,shape[1],shape[2],shape[3]))
return r
#
where x is (?, 320,320,1). I am not keen on the reshape function with a -1 arg. Is there a cleaner way to do this?
Let's see what the -1 does. From the Tensorflow documentation (Because the documentation from Keras is scarce compared to the one from Tensorflow):
If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant.
So what this means:
from keras import backend as K
X = tf.constant([1,2,3,4,5])
K.reshape(X, [-1, 5])
# Add one more dimension, the number of columns should be 5, and keep the number of elements to be constant
# [[1 2 3 4 5]]
X = tf.constant([1,2,3,4,5,6])
K.reshape(X, [-1, 3])
# Add one more dimension, the number of columns should be 3
# For the number of elements to be constant the number of rows should be 2
# [[1 2 3]
# [4 5 6]]
I think it is simple enough. So what happens in your code:
# Let's assume we have 5 images, 320x320 with 3 channels
X = tf.ones((5, 320, 320, 3))
shape = X.shape
# Let's flat the tensor so we can perform the rest of the computation
flatten = K.batch_flatten(X)
# What this did is: Turn a nD tensor into a 2D tensor with same 0th dimension. (Taken from the documentation directly, let's see that below)
flatten.shape
# (5, 307200)
# So all the other elements were squeezed in 1 dimension while keeping the batch_size the same
# ...The rest of the stuff in your code is executed here...
# So we did all we wanted and now we want to revert the tensor in the shape it had previously
r = K.reshape(flatten, (-1, shape[1],shape[2],shape[3]))
r.shape
# (5, 320, 320, 3)
Besides, I can't think of a cleaner way to do what you want to do. If you ask me, your code is already clear enough.
I met the usage of the reduce_mean with the vector as the second arguments. I looked through sensor flow manual but can't find the corresponding example. The codes are below:
tf.reduce_mean(train, [0,1,2]
where train is at size batchsize x H x L x 2
I also played with some experiments but can't figure out how this second vector input will be processed
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0,1,2])
Output = 1
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0])
Output = [[2 2 2]
[2 2 0]]
tensor = tf.constant([[[2,2,4],[2,2,0]],[[2,2,0],[2,2,0]]])
trainenergy = tf.reduce_mean(tensor, [0,1])
Output = [2 2 1]
Just figure out tf.reduce_mean(train, [0,1,2]) if the second argument is the vector. It will reduce the dimension as the order of the element is the vector. For example, the [0,1,2] will reduce along the axis of 0,1,2
Thanks for your help tensorflow community!
I have a question regarding understanding and visualizing the output of the estimator's evaluate function.
I have a DNNClassifier and have trained it on data with 10 output ranges predictions can go into.
After training and running
accuracy = classifier.evaluate(input_fn = test_input_fn)['accuracy']
I see my accuracy as 33.8%. Which who knows how good that is. (Probably not good)
How can I see the output of each of the comparisons?
As the test_data is ran I would like to see what the estimate is, and what the actual value is. Basically a side by side of y and y'.
something like: [0 0 0 0 0 0 0 0 1] vs [0 0 0 0 0 0 0 0 1 0] 'false'
Rather than just seeing the aggregated overall accuracy.
Thanks!
So in the event that someone reads the question above, and understands what I was trying to do (view the output of predictions), I have a solution.
The solution is to utilize the .predict() method.
A good example is here:
https://www.tensorflow.org/get_started/estimator#classify_new_samples
My code ended up looking like:
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x = {"x": np.array(predict_set.data)},
num_epochs = 1,
shuffle = False)
predictions = list(classifier.predict(input_fn=predict_input_fn))
print("\n Predictions:")
print(len(predictions))
for p in predictions:
print(int(p['classes'][0]))
which outputs the predictions in a column which I can copy / paste into some spread sheet program to examine my data.
I'm new to Tensorflow (And neural networks) and i am working on a simple classification problem. Would like to ask 2 questions.
Say i have 120 labels of a permutation of [1,2,3,4,5]. Is it really necessary for me to One-Hot encode before feeding it into my graph? If yes, should i encode before feeding into tensorflow?
And if i do One-Hot encode, the softmax prediction will give [0.001 0.202 0.321……0.002 0.0003 0.0004]. Running arg_max will produce the right index. How would i get tensorflow to return me the correct label instead of a one-hot result?
Thank you.
So your input are 120 labels in {1, 2, 3, 4, 5} (each of which can be either digit from 1 to 5)?
# Your input, a 1D tensor of 120 elements from 1-5.
# Better shift your label space to 0-4 instead.
labels = labels - 1
# Now convert to a 2D tensor of 120 x 5 onehot labels.
onehot_labels = tf.one_hot(labels, 5)
# Now some computations.
....
# You end up with some onehot_output
# of the same shape as your labels (120x5).
# As you said, arg_max will give you the index of the result,
# which is a 1D index label of 120 elements.
output = tf.argmax(onehot_output, axis=1).
# You might want to shift back to {1,2,3,4,5}.
output = output + 1