Getting 'Dataset is empty, or contains only positive or negative samples' when using Xgboost rank:pairwise, eval_metric: auc - xgboost

When I run the xgboost rank demo by setting 2 samples for every group, eval_metric=auc, it shows warning that 'Dataset is empty, or contains only positive or negative samples'.
I have tried for many times modify the dtarget for training and validattion group and found that it has no effect and the problem occurs only when I set 2 samples for every gourp in dgroup, such as [2,2,2]. I don't kwnow where the problem is.
My xgboost param is :
xgb_rank_params1 = {
'booster': 'gbtree',
'eta': 0.1,
'gamma': 1.0,
'min_child_weight': 0.1,
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'max_depth': 6,
'num_boost_round': 10,
'save_period': 0
}
data prebuild code is:
n_group = 3
n_choice = 2
dtrain = np.random.uniform(0, 100, [n_group * n_choice, 2])
dtarget = [1, 0, 1, 0, 1, 0]
# **problem here : when set n_choice = 2 sample for every gourp**
dgroup = np.array([n_choice for i in range(n_group)]).flatten()
# concate Train data, very import here !
xgbTrain = DMatrix(dtrain, label=dtarget)
xgbTrain.set_group(dgroup)
# generate eval data
dtrain_eval = np.random.uniform(0, 100, [n_group * n_choice, 2])
xgbTrain_eval = DMatrix(dtrain_eval, label=dtarget)
xgbTrain_eval.set_group(dgroup)
evallist = [(xgbTrain, 'train'), (xgbTrain_eval, 'eval')]
rankModel = train(xgb_rank_params1, xgbTrain, num_boost_round=20, evals=evallist)
output says:
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[0] train-auc:nan eval-auc:nan
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[1] train-auc:nan eval-auc:nan

Related

Converting a Segemented Ground Truth to a Contour Image efficiently with Numpy

Suppose I have a segmented image as a Numpy array, where each entry in the image is a number from 1, ... C, C+1 where C is the number of segmentation classes, and class C+1 is some background class. I want to find an efficient way to convert this to a contour image (a binary image where a contour pixel will have value 1, and the rest will have values 0), so that any pixel who has a neighbor in its 8-neighbourhood (or 4-neighbourhood) will be a contour pixel.
The inefficient way would be something like:
def isValidLocation(i, j, image_height, image_width):
if i<0:
return False
if i>image_height-1:
return False
if j<0:
return False
if j>image_width-1:
return False
return True
def get8Neighbourhood(i, j, image_height, image_width):
nbd = []
for height_offset in [-1, 0, 1]:
for width_offset in [-1, 0, 1]:
if isValidLocation(i+height_offset, j+width_offset, image_height, image_width):
nbd.append((i+height_offset, j+width_offset))
return nbd
def getContourImage(seg_image):
seg_image_height = seg_image.shape[0]
seg_image_width = seg_image.shape[1]
contour_image = np.zeros([seg_image_height, seg_image_width], dtype=np.uint8)
for i in range(seg_image_height):
for j in range(seg_image_width):
nbd = get8Neighbourhood(i, j, seg_image_height, seg_image_width)
for (m,n) in nbd:
if seg_image[m][n] != seg_image[i][j]:
contour_image[i][j] = 1
break
return contour_image
I'm looking for a more efficient "vectorized" way of achieving this, as I need to be able to compute this at run time on batches of 8 images at a time in a deep learning context. Any insights appreciated. Visual Example Below. The first image is the original image overlaid over the ground truth segmentation mask (not the best segmentation admittedly...), the second is the output of my code, which looks good, but is way too slow. Takes me about 10 seconds per image with an intel 9900K cpu.
Image Credit from SUN RGBD dataset.
This might work but it might have some limitations which I cannot be sure of without testing on the actual data, so I'll be relying on your feedback.
import numpy as np
from scipy import ndimage
import matplotlib.pyplot as plt
# some sample data with few rectangular segments spread out
seg = np.ones((100, 100), dtype=np.int8)
seg[3:10, 3:10] = 20
seg[24:50, 40:70] = 30
seg[55:80, 62:79] = 40
seg[40:70, 10:20] = 50
plt.imshow(seg)
plt.show()
Now to find the contours, we will convolve the image with a kernel which should give 0 values when convolved within the same segment of the image and <0 or >0 values when convolved over image regions with multiple segments.
# kernel for convolving
k = np.array([[1, -1, -1],
[1, 0, -1],
[1, 1, -1]])
convolved = ndimage.convolve(seg, k)
# contour pixels
non_zeros = np.argwhere(convolved != 0)
plt.scatter(non_zeros[:, 1], non_zeros[:, 0], c='r', marker='.')
plt.show()
As you can see in this sample data the kernel has a small limitation and misses identifying two contour pixels caused due to symmetric nature of data (which I think would be a rare case in actual segmentation outputs)
For better understanding, this is the scenario(occurs at top left and bottom right corners of the rectangle) where the kernel convolution fails to identify the contour i.e. misses one pixel
[ 1, 1, 1]
[ 1, 1, 1]
[ 1, 20, 20]
Based on #sai's idea I came up with this snippet, which yielded the same result much, much faster than my original code. Runs in 0.039 seconds, which when compared to close to 8-10 seconds for the original I'd say is quite a speed-up!
filters = []
for i in [0, 1, 2]:
for j in [0, 1, 2]:
filter = np.zeros([3,3], dtype=np.int)
if i ==1 and j==1:
pass
else:
filter[i][j] = -1
filter[1][1] = 1
filters.append(filter)
def getCountourImage2(seg_image):
convolved_images = []
for filter in filters:
convoled_image = ndimage.correlate(seg_image, filter, mode='reflect')
convolved_images.append(convoled_image)
convoled_images = np.add.reduce(convolved_images)
seg_image = np.where(convoled_images != 0, 255, 0)
return seg_image

Finding loss mask of variable length in keras tensorflow

Trying to build loss function which captures the below functionality, which mask the output values once 'end of sequence' is encountered.
Given a tensor of shape [BatchSize,MaxSequenceLenght,OutputNodes]
Consider the below example
batch size = 3
Max Sequence Length=4
OutputNodes = 3
predicted = [[[0.1,0.3,0.2],[0.4,0.6,0.8],[0.5,0.2,0.3],[0.0,0.0,0.99]],
[[0.1,0.3,0.2],[0.4,0.9,0.8],[0.5,0.2,0.9],[0.4,0.6,0.8]],
[[0.1,0.3,0.2],[0.4,0.9,0.8],[0.5,0.2,0.1],[0.4,0.6,0.1]]]
I am dedicating the last output node to symbolise the 'end of sequence(EOS)' here node=2 . Nodes are labelled as (0, 1 and 2)
Based on the predicted value, I have to return a mask which tries to find the first occurrence of EOS.
In the above example,
first row has following sequence (argmax) => 1,2,0,2
Second row has following sequence => 1,1,2,2
Third row has following sequence => 1,1,9,1
So my mask should be
[[1,0,0,0],
[1,1,0,0],
[1,1,1,1]
The mask will ensure, the values post the EOS is ignored or not considered in calculating the loss.
Below is my code snipped I tried
sequence_cluster_asign = keras.backend.argmax(sequence_values,axis=-1)
loss_mask = []
for seq in K.tf.unstack(sequence_cluster_asign):
##appendEOS- To make sure tf.where is not empty
seq = tf.concat([seq,endOfSequenceTensor],axis=0)
endOfSequenceLocation = K.tf.where(K.tf.equal(seq,endOfSequence))[0][0]
loss_mask.append(tf.sequence_mask(endOfSequenceLocation,max_decoder_seq_length,dtype=tf.float32))
final_mask = K.stack(loss_mask)
Error encountered : ValueError: Cannot infer num from shape (?,?)
If you want to get mask in your question, you can use the following method.
import tensorflow as tf
import keras
from keras import backend as K
sequence_values = K.placeholder(shape=(None, 4, 3))
sequence_cluster_asign = keras.backend.argmax(sequence_values,axis=-1)
# keras version
result = K.cast(K.less(sequence_cluster_asign,sequence_values.get_shape().as_list()[-1]-1),dtype='int32')
result = K.cumprod(result,axis=-1)
# tensorflow version
# result = tf.cast(tf.less(sequence_cluster_asign,sequence_values.get_shape().as_list()[-1]-1),dtype=tf.int32)
# result = tf.cumprod(result,axis=-1)
predicted = [[[0.1,0.3,0.2],[0.4,0.6,0.8],[0.5,0.2,0.3],[0.0,0.0,0.99]],
[[0.1,0.3,0.2],[0.4,0.9,0.8],[0.5,0.2,0.9],[0.4,0.6,0.8]],
[[0.1,0.3,0.2],[0.4,0.9,0.8],[0.5,0.2,0.1],[0.4,0.6,0.1]]]
with tf.Session() as sess:
print(result.eval(feed_dict={sequence_values:predicted}))
[[1 0 0 0]
[1 1 0 0]
[1 1 1 1]]

Sketch_RNN , ValueError: Cannot feed value of shape

I get the following error:
ValueError: Cannot feed value of shape (1, 251, 5) for Tensor u'vector_rnn_1/Placeholder_1:0', which has shape '(1, 117, 5)'
when running code from here
https://github.com/tensorflow/magenta-demos/blob/master/jupyter-notebooks/Sketch_RNN.ipynb
The error occurs in this method:
def encode(input_strokes):
strokes = to_big_strokes(input_strokes).tolist()
strokes.insert(0, [0, 0, 1, 0, 0])
seq_len = [len(input_strokes)]
draw_strokes(to_normal_strokes(np.array(strokes)))
return sess.run(eval_model.batch_z, feed_dict={eval_model.input_data: [strokes], eval_model.sequence_lengths: seq_len})[0]
I have to mention I trained my own model following the instructions here:
https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn
Can someone help me into understanding and solving this issue ?
Thanks
Regards
For my case, the problem is caused by to_big_strokes() function. If you do not modify the to_big_stroke() in sketch_rnn/utils.py, it will by default prolong the input_strokes sequence to the length of 250.
All you need to do, is to modify the parameter max_len in that function. You need to change that value to the maximum sequence length of your own dataset, which is 21 for me, as the line marked with "change" shown below.
def to_big_strokes(stroke, max_len=21): # change: 250 -> 21
"""Converts from stroke-3 to stroke-5 format and pads to given length."""
# (But does not insert special start token).
result = np.zeros((max_len, 5), dtype=float)
l = len(stroke)
assert l <= max_len
result[0:l, 0:2] = stroke[:, 0:2]
result[0:l, 3] = stroke[:, 2]
result[0:l, 2] = 1 - result[0:l, 3]
result[l:, 4] = 1
return result
The problem was that the strokes size is not equal as the array size expected by the algorithm.
So adapting the strokes array fixed the issue.

Given a dataframe with N elements, how can make m smaller dataframes such that the size of each m is some fraction of N?

I have a dataset (call it Data) with ~25000 instances that I want to split into a train set, development set, and test set. I want it to be such that,
train set = 0.7*Data
development set = 0.1*Data
test set = 0.2*Data
When making the split, I want the instances to be randomly sampled and NOT REPEATED between the 3 sets. This is why I can't use something like,
train_set = Data.sample(frac=0.7)
dev_set = Data.sample(frac=0.1)
train_set = Data.sample(frac=0.2)
where instances from Data may be repeated in the sets. Is there a build in function that I am missing or could you help me write a function for doing this?
I will use an array to demonstrate an example of what I am looking for.
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
splits = [0.7, 0.1, 0.2]
def splitFunction(data, array_of_splits):
// I need your help here
splits = splitFunction(A, splits)
#output
[[1, 3, 8, 9, 6, 7, 2], [4], [5, 0]]
Thank you in advance!
from random import shuffle
def splitFunction(data, array_of_splits):
data_copy = data[:] # copy data if don't want to change original array
shuffle(data_copy) # randomizes data
splits = []
startIndex = 0
for val in array_of_splits:
split = data_copy[startIndex:startIndex + val*len(data)]
startIndex = startIndex + val*len(data)
splits.append(split)
return splits

Tensorflow: When using slim.dataset.Dataset, is there a way to map label ID values to other values?

dataset = slim.dataset.Dataset(...)
provider = slim.dataset_data_provider.DatasetDataProvider(dataset, ..._
image, labels = provider.get(['image', 'label')
Let's say, for an example in a dataset A, labels could be [1, 2, 1, 3]. However, for some reason (e.g, due to dataset B), I would like to map the label IDs to other values. The mapping could be like below.
# {old_label: target_label}
mapping = {0: 0, 1: 2, 2: 2, 3: 2, 4: 2, 5: 3, 6: 1}
For now, I am guessing two ways:
-- tf.data.Dataset seems to have a map(map_func) function that every examples should pass, which could be the solution. However, I am more familiar to slim.dataset.Dataset. Is there a similar trick for slim.dataset.Dataset?
-- I was wondering if I can simply apply some mapping function to a tensor label such as:
new_labels = tf.map_fn(lambda x: x+1, labels, dtype=tf.int32)
# labels = [1 2 1 3] --> new_labels = [2 3 2 4]. This works.
new_labels = tf.map_fn(lambda x: mapping[x], labels, dtype=tf.int32)
# I wished but this does not work!
However, the below didn't work, which is what I need. Could anyone please advise?
I think you can try tf.contrib.lookup:
keys = list(mapping.keys())
values = [mapping[k] for k in keys]
table = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(keys, values, key_dtype=tf.int64, value_dtype=tf.int64), -1
)
new_labels = table.lookup(labels)
sess=tf.Session()
sess.run(table.init)
print(sess.run(new_labels))