Scikit Learn: RandomForest: clf.predict works with float, but not clf.score - pandas

I'm working on a classification problem. The labels I am trying to predict:
df3['relevance'].unique()
array([ 3. , 2.5 , 2.33, 2.67, 2. , 1. , 1.67, 1.33, 1.25,
2.75, 1.75, 1.5 , 2.25])
When I call predict using the features I've made, it works OK:
clf = RandomForestClassifier()
clf.fit(df3[features], df['relevance'])
pd.crosstab(clf.predict(df3[features]), df3['relevance'])
But when I call clf.score:
clf.score(df3['features'], df3['relevance'])
I get
ValueError: continuous is not supported
Should I be classifying the relevance label I am trying to predict as another data type? Thanks for any help.

The issue you are facing happens is likely because your relevance column is made up of continuous numbers.
I would suggest switching over to the RandomForestRegressor() if you are trying to predict continuous numbers. Otherwise, convert your variables into 1s and 0s based on some threshold value.

Simply encode labels as integers and everything will work well. Floats suggest regression.
In particular you can use LabelEncoder http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> import numpy as np
>>> X = np.array([[0], [1], [1.2]])
>>> y = [0.5, 1.2, -0.1]
>>> clf = RF()
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> print clf.score(y, X)
Traceback (most recent call last):
[.....]
ValueError: continuous is not supported
>>> y = [0, 1, 2]
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> print clf.score(X, y)
1.0
or compute .score yourself as this is extremely trivial function
print np.mean(clf.predict(X) == y)

Related

'key of type tuple not found and not a MultiIndex' while generating ROC for multi-class classification

I am trying to generate a ROC curve using XGBoost through a multi-class classification but facing this 'key of type tuple not found and not a MultiIndex' everytime.
Classification:
from xgboost import XGBClassifier
from xgboost import plot_tree
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from itertools import cycle
from sklearn.metrics import roc_auc_score
model = XGBClassifier()
model = model.fit(x_train, y_train)
print('Accuracy:', model.score(x_test,y_test))
score=cross_val_score(model,X,y,cv=5)
print(score)
print('CV Score:',np.mean(score))
y_pred1=model.predict(x_test)
Generating ROC:
n_classes = 5
fpr = dict()
tpr = dict()
roc_auc = dict()
lw=2
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred1[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green', 'yellow', 'pink'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
Out:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-34-14f08a1b6222> in <module>
5 lw=2
6 for i in range(n_classes):
----> 7 fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred1[:, i])
8 roc_auc[i] = auc(fpr[i], tpr[i])
9 colors = cycle(['blue', 'red', 'green', 'yellow', 'pink'])
2 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py in _get_values_tuple(self, key)
1014
1015 if not isinstance(self.index, MultiIndex):
-> 1016 raise KeyError("key of type tuple not found and not a MultiIndex")
1017
1018 # If key is contained, would have returned by now
KeyError: 'key of type tuple not found and not a MultiIndex'
Q: Why is it returning a multi-index error even after I have 5 classes in my dataframe?

Using tf extract_image_patches for input to a CNN?

I want to extract patches from my original images to use them as input for a CNN.
After a little research I found a way to extract patches with
tensorflow.compat.v1.extract_image_patches.
Since these need to be reshaped to "image format" I implemented a method reshape_image_patches to reshape them and store the reshaped patches in an array.
image_patches2 = []
def reshape_image_patches(image_patches, sess, ksize_rows, ksize_cols):
a = sess.run(tf.shape(image_patches))
nr, nc = a[1], a[2]
for i in range(nr):
for j in range(nc):
patch = tf.reshape(image_patches[0,i,j,], [ksize_rows, ksize_cols, 3])
image_patches2.append(patch)
return image_patches2
How can I use this in combination with Keras generators to make these patches the input of my CNN?
Edit 1:
I have tried the approach in Load tensorflow images and create patches
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
dataset = tf.keras.preprocessing.image_dataset_from_directory(
<directory>,
label_mode=None,
seed=1,
subset='training',
validation_split=0.1,
image_size=(900, 900))
get_patches = lambda x: (tf.reshape(
tf.image.extract_patches(
x,
sizes=[1, 16, 16, 1],
strides=[1, 8, 8, 1],
rates=[1, 1, 1, 1],
padding='VALID'), (111*111, 16, 16, 3)))
dataset = dataset.map(get_patches)
fig = plt.figure()
plt.subplots_adjust(wspace=.1, hspace=.2)
images = next(iter(dataset))
for index, image in enumerate(images):
ax = plt.subplot(2, 2, index + 1)
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(image)
plt.show()
In line: images = next(iter(dataset)) I get the error: InvalidArgumentError: Input to reshape is a tensor with 302800896 values, but the requested shape has 9462528
[[{{node Reshape}}]]
Does somebody know how to fix this?
The tf.reshape does not change the order of or the total number of elements in the tensor. The error as states, you are trying to reduce total number of elements from 302800896 to 9462528 . You are using tf.reshape in lambda function.
In below example, I have recreated your scenario where I have the given the shape argument as 2 for tf.reshape which doesn't accommodate all the elements of original tensor, thus throws the error -
Code -
%tensorflow_version 2.x
import tensorflow as tf
t1 = tf.Variable([1,2,2,4,5,6])
t2 = tf.reshape(t1, 2)
Output -
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-3-0ff1d701ff22> in <module>()
3 t1 = tf.Variable([1,2,2,4,5,6])
4
----> 5 t2 = tf.reshape(t1, 2)
3 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: Input to reshape is a tensor with 6 values, but the requested shape has 2 [Op:Reshape]
tf.reshape should be in such a way that the arrangement of elements can change but total number of elements must remain the same. So the fix would be to change the shape to [2,3] -
Code -
%tensorflow_version 2.x
import tensorflow as tf
t1 = tf.Variable([1,2,2,4,5,6])
t2 = tf.reshape(t1, [2,3])
print(t2)
Output -
tf.Tensor(
[[1 2 2]
[4 5 6]], shape=(2, 3), dtype=int32)
To solve your problem, either extract patches(tf.image.extract_patches) of size that you are trying to tf.reshape OR change the tf.reshape to size of extract patches.
Will also suggest you to look into other tf.image functionality like tf.image.central_crop and tf.image.crop_and_resize.

Tensorflow, tf.multinomial, get the associated probabilities failed

I am trying to using tf.multinomial to sample, and I want to get the associated probability value of the sampled values. Here is my example code,
In [1]: import tensorflow as tf
In [2]: tf.enable_eager_execution()
In [3]: probs = tf.constant([[0.5, 0.2, 0.1, 0.2], [0.6, 0.1, 0.1, 0.1]], dtype=tf.float32)
In [4]: idx = tf.multinomial(probs, 1)
In [5]: idx # print the indices
Out[5]:
<tf.Tensor: id=43, shape=(2, 1), dtype=int64, numpy=
array([[3],
[2]], dtype=int64)>
In [6]: probs[tf.range(probs.get_shape()[0], tf.squeeze(idx)]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-56ef51f84ca2> in <module>
----> 1 probs[tf.range(probs.get_shape()[0]), tf.squeeze(idx)]
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in _slice_helper(tensor, slice_spec, var)
616 new_axis_mask |= (1 << index)
617 else:
--> 618 _check_index(s)
619 begin.append(s)
620 end.append(s + 1)
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in _check_index(idx)
514 # TODO(slebedev): IndexError seems more appropriate here, but it
515 # will break `_slice_helper` contract.
--> 516 raise TypeError(_SLICE_TYPE_ERROR + ", got {!r}".format(idx))
517
518
TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got <tf.Tensor: id=7, shape=(2,), dtype=int32, numpy=array([3, 2])>
The expected result I want is [0.2, 0.1] as indicated by idx.
But in Numpy, this method works as answered in https://stackoverflow.com/a/23435869/5046896
How can I fix it?
You can try tf.gather_nd, you can try
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> probs = tf.constant([[0.5, 0.2, 0.1, 0.2], [0.6, 0.1, 0.1, 0.1]], dtype=tf.float32)
>>> idx = tf.multinomial(probs, 1)
>>> row_indices = tf.range(probs.get_shape()[0], dtype=tf.int64)
>>> full_indices = tf.stack([row_indices, tf.squeeze(idx)], axis=1)
>>> rs = tf.gather_nd(probs, full_indices)
Or, you can use tf.distributions.Multinomial, the advantage is you do not need to care about the batch_size in the above code. It works under varying batch_size when you set the batch_size=None. Here is a simple example,
multinomail = tf.distributions.Multinomial(
total_count=tf.constant(1, dtype=tf.float32), # sample one for each record in the batch, that is [1, batch_size]
probs=probs)
sampled_actions = multinomail.sample() # sample one action for data in the batch
predicted_actions = tf.argmax(sampled_actions, axis=-1)
action_probs = sampled_actions * predicted_probs
action_probs = tf.reduce_sum(action_probs, axis=-1)
I prefer the latter one because it is flexible and elegant.

Tensorflow dynamic_rnn propagates nans for batch size greater than 1

Hoping someone can help me understand an issue I have been having using LSTMs with dynamic_rnn in Tensorflow. As per this MWE, when I have a batch size of 1 with sequences that are incomplete (I pad the short tensors with nan's as opposed to zeros to highlight) everything operates as normal, the nan's in the short sequences are ignored as expected...
import tensorflow as tf
import numpy as np
batch_1 = np.random.randn(1, 10, 8)
batch_2 = np.random.randn(1, 10, 8)
batch_1[6:] = np.nan # lets make a short batch in batch 1 second sample of length 6 by padding with nans
seq_lengths_batch_1 = [6]
seq_lengths_batch_2 = [10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[1, 10, 8], dtype=tf.float32)
lengths = tf.placeholder(shape=[1], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=5)
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(fake_loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print(fl, lrv)
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_2, lengths: seq_lengths_batch_2})
print(fl, lrv)
sess.close()
which outputs properly populated values of the ilk....
0.00659429 [[ 0.11608966 0.08498846 -0.02892204 -0.01945034 -0.1197343 ]]
-0.080244 [[-0.03018401 -0.18946587 -0.19128899 -0.10388547 0.11360413]]
However then when I increase my batch size up to size 3 for example, the first batch executes correctly but then somehow the second batch causes nans to start to propogating
import tensorflow as tf
import numpy as np
batch_1 = np.random.randn(3, 10, 8)
batch_2 = np.random.randn(3, 10, 8)
batch_1[1, 6:] = np.nan
batch_2[0, 8:] = np.nan
seq_lengths_batch_1 = [10, 6, 10]
seq_lengths_batch_2 = [8, 10, 10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[3, 10, 8], dtype=tf.float32)
lengths = tf.placeholder(shape=[3], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=5)
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(fake_loss)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print(fl, lrv)
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value], feed_dict={input_vals: batch_2, lengths: seq_lengths_batch_2})
print(fl, lrv)
sess.close()
giving
0.0533635 [[ 0.33622459 -0.0284576 0.11914439 0.14402215 -0.20783389]
[ 0.20805927 0.17591488 -0.24977767 -0.03432769 0.2944448 ]
[-0.04508523 0.11878576 0.07287208 0.14114542 -0.24467923]]
nan [[ nan nan nan nan nan]
[ nan nan nan nan nan]
[ nan nan nan nan nan]]
I have found this behavior quite strange, as I expected all values after the sequence lengths to be ignored as happens with a batch size of 1 but doesn't work with a batch size of 2 or more.
Obviously, nans do not get propagated if I use 0 as my padding value, but this doesn't inspire me with any confidence that dynamic_rnn is functioning as I am expecting it to.
Also I should mention that if I remove the optimisation step the issue doesnt occur so now I'm properly confused and after a day of trying many different permutations, I cant see why batch size would make any difference here
I did not trace it down to the exact operation but here is what I believe to be the case.
Why aren't values beyond sequence_length ignored? They are ignored in the sense that they are multiplied by 0 (they are masked out) when doing some operations. Mathematically, the result is always a zero, so they should have no effect. Unfortunately, nan * 0 = nan. So, if you give nan values in your examples, they propagate. You might wonder why TensorFlow does not ignore them completely, but only masks them. The reason is performance on modern hardware. It is much easier to do operations on a large regular shape with a bunch of zeros than on several small shapes (that you get from decomposing an irregular shape).
Why does it only happen on the second batch? In the first batch, the loss and last hidden state are computed using the original variable values. They are fine. Because you also do the optimizer update in the sess.run(), variables get updated and become nan in the first call. In the second call, the nans from variables spread to loss and hidden state.
How can I be confident that the values beyond sequence_length are really masked out? I modified your example to reproduce the issue but also made it deterministic.
import tensorflow as tf
import numpy as np
batch_1 = np.ones((3, 10, 2))
batch_1[1, 7:] = np.nan
seq_lengths_batch_1 = [10, 7, 10]
tf.reset_default_graph()
input_vals = tf.placeholder(shape=[3, 10, 2], dtype=tf.float32)
lengths = tf.placeholder(shape=[3], dtype=tf.int32)
cell = tf.nn.rnn_cell.LSTMCell(num_units=3, initializer=tf.constant_initializer(1.0))
init_state = tf.nn.rnn_cell.LSTMStateTuple(*[tf.ones([3, c]) for c in cell.state_size])
outputs, states = tf.nn.dynamic_rnn(cell=cell, dtype=tf.float32, sequence_length=lengths, inputs=input_vals,
initial_state=init_state)
last_relevant_value = states.h
fake_loss = tf.reduce_mean(last_relevant_value)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1).minimize(fake_loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(1):
_, fl, lrv = sess.run([optimizer, fake_loss, last_relevant_value],
feed_dict={input_vals: batch_1, lengths: seq_lengths_batch_1})
print "VARIABLES:", sess.run(tf.trainable_variables())
print "LOSS and LAST HIDDEN:", fl, lrv
If you replace the np.nan in batch_1[1, 7:] = np.nan with any number (e.g. try -1M, 1M, 0) , you will see that the values you get are the same. You can also run the loop for more iterations. As a further sanity check, if you set seq_lengths_batch_1 to something "wrong", e.g. [10, 8, 10], you can see that now the value you use in batch_1[1, 7:] = np.nan effects the output.

Randomly selecting elements from a tensor in Tensorflow

Given a tensor whose shape is Nx2, how is it possible to select k elements from this tensor akin to np.random.choice (with equal probability) ? Another point to note is that the value of N dynamically changes during execution. Meaning to say that I'm dealing with a dynamically-sized tensor.
You can just wrap np.random.choice as a tf.py_func. See for example this answer. In your case, you need to flatten your tensor so it is an array of length 2*N:
import numpy as np
import tensorflow as tf
a = tf.placeholder(tf.float32, shape=[None, 2])
size = tf.placeholder(tf.int32)
y = tf.py_func(lambda x, s: np.random.choice(x.reshape(-1),s), [a, size], tf.float32)
with tf.Session() as sess:
print(sess.run(y, {a: np.random.rand(4,2), size:5}))
I had a similar problem, where I wanted to subsample points from a pointcloud for an implementation of PointNet. My input dimension was [None, 2048, 3], and I was subsampling down to [None, 1024, 3] using the following custom layer:
class SubSample(Layer):
def __init__(self,num_samples):
super(SubSample, self).__init__()
self.num_samples=num_samples
def build(self, input_shape):
self.shape = input_shape #[None,2048,3]
def call(self, inputs, training=None):
k = tf.random.uniform([self.shape[1],]) #[2048,]
bl = tf.argsort(k)<self.num_samples #[2048,]
res = tf.boolean_mask(inputs, bl, axis=1) #[None,1024,3]
# Reshape needed so that channel shape is passed when `run_eagerly=False`, otherwise it returns `None`
return tf.reshape(res,(-1,self.num_samples,self.shape[-1])) #[None,1024,3]
SubSample(1024)(tf.random.uniform((64,2048,3))).shape
>>> TensorShape([64, 1024, 3])
As far as I can tell, this works for TensorFlow 2.5.0
Note that this isn't directly an answer to the question at hand, but the answer that I was looking for when I stumbled across this question.