HDBSCAN inference labels not equal to labels generated during train period - hierarchical-clustering

I am trying to use HDBSCAN to generate clusters. And I also want to save the model and use it to make cluster label to new datas.
The problem is, whenever I apply HDBSCAN to datat bigger than 9754,
inferenced labels are not equal to original ones. I tested my code on train data for sanity check,
so the data used to train HDBSCAN is same with inference data.
emb_size = 10000
val_size = 10
emb = np.array(df_emb.iloc[:emb_size, 1:-1])
emb_val = np.array(df_emb.iloc[:val_size, 1:-1])
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, cluster_selection_epsilon=cluster_selection_epsilon, min_samples=1, gen_min_span_tree=True, prediction_data=True)
clusterer.fit(emb)
labels = clusterer.labels_
score = clusterer.relative_validity_
label_val = hdbscan.approximate_predict(clusterer, emb_val)
here is my code. I expected that label_val[0] == labels[:10], but what I actually got is like this.
labels[:10] = array([-1, -1, 8, 15, -1, -1, 41, -1, 6, 44])
label_val[0] = array([-1, -1, 9, 16, -1, -1, 42, -1, 7, 45])
So the same data results in different labels, depending whether this labels are generated during train period or inference period. What am I missing here?

Related

Keras Sequential with multiple inputs

Given 3 array as input to the network, it should learn what links data in 1st array, 2nd array, and 3rd array.
In particular:
1st array contains integer numbers (eg.: 2, 3, 5, 6, 7)
2nd array contains integer numbers (eg.: 3, 2, 4, 6, 2)
3rd array contains integer numbers that are the results of an operation done between data in 1st and 2nd array (eg.: 6, 6, 20, 36, 14).
As you can see from the example data here above, the operation done is a multiplication so the network should learn this, giving:
model.predict(11,2) = 22.
Here's the code I've used:
import logging
import numpy as np
import tensorflow as tf
primo = np.array([2, 3, 5, 6, 7])
secondo = np.array([3, 2, 4, 6, 2])
risu = np.array([6, 6, 20, 36, 14])
l0 = tf.keras.layers.Dense(units=1, input_shape=[1])
model = tf.keras.Sequential([l0])
input1 = tf.keras.layers.Input(shape=(1, ), name="Pri")
input2 = tf.keras.layers.Input(shape=(1, ), name="Sec")
merged = tf.keras.layers.Concatenate(axis=1)([input1, input2])
dense1 = tf.keras.layers.Dense(
2,
input_dim=2,
activation=tf.keras.activations.sigmoid,
use_bias=True)(merged)
output = tf.keras.layers.Dense(
1,
activation=tf.keras.activations.relu,
use_bias=True)(dense1)
model = tf.keras.models.Model([input1, input2], output)
model.compile(
loss="mean_squared_error",
optimizer=tf.keras.optimizers.Adam(0.1))
model.fit([primo, secondo], risu, epochs=500, verbose = False, batch_size=16)
print(model.predict(11, 2))
My questions are:
is it correct to concatenate the 2 input as I did? I don't understand if concatenating in such a way the network understand that input1 and input2 are 2 different data
I'm not able to make the model.predict() working, every attempt result in an error
Your model has two inputs, each with shape (None,1), so you need to use np.expand_dims:
print(model.predict([np.expand_dims(np.array(11), 0), np.expand_dims(np.array(2), 0)]))
Output:
[[20.316557]]

Getting 'Dataset is empty, or contains only positive or negative samples' when using Xgboost rank:pairwise, eval_metric: auc

When I run the xgboost rank demo by setting 2 samples for every group, eval_metric=auc, it shows warning that 'Dataset is empty, or contains only positive or negative samples'.
I have tried for many times modify the dtarget for training and validattion group and found that it has no effect and the problem occurs only when I set 2 samples for every gourp in dgroup, such as [2,2,2]. I don't kwnow where the problem is.
My xgboost param is :
xgb_rank_params1 = {
'booster': 'gbtree',
'eta': 0.1,
'gamma': 1.0,
'min_child_weight': 0.1,
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'max_depth': 6,
'num_boost_round': 10,
'save_period': 0
}
data prebuild code is:
n_group = 3
n_choice = 2
dtrain = np.random.uniform(0, 100, [n_group * n_choice, 2])
dtarget = [1, 0, 1, 0, 1, 0]
# **problem here : when set n_choice = 2 sample for every gourp**
dgroup = np.array([n_choice for i in range(n_group)]).flatten()
# concate Train data, very import here !
xgbTrain = DMatrix(dtrain, label=dtarget)
xgbTrain.set_group(dgroup)
# generate eval data
dtrain_eval = np.random.uniform(0, 100, [n_group * n_choice, 2])
xgbTrain_eval = DMatrix(dtrain_eval, label=dtarget)
xgbTrain_eval.set_group(dgroup)
evallist = [(xgbTrain, 'train'), (xgbTrain_eval, 'eval')]
rankModel = train(xgb_rank_params1, xgbTrain, num_boost_round=20, evals=evallist)
output says:
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[0] train-auc:nan eval-auc:nan
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[15:54:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/metric/auc.cc:330: Dataset is empty, or contains only positive or negative samples.
[1] train-auc:nan eval-auc:nan

Sketch_RNN , ValueError: Cannot feed value of shape

I get the following error:
ValueError: Cannot feed value of shape (1, 251, 5) for Tensor u'vector_rnn_1/Placeholder_1:0', which has shape '(1, 117, 5)'
when running code from here
https://github.com/tensorflow/magenta-demos/blob/master/jupyter-notebooks/Sketch_RNN.ipynb
The error occurs in this method:
def encode(input_strokes):
strokes = to_big_strokes(input_strokes).tolist()
strokes.insert(0, [0, 0, 1, 0, 0])
seq_len = [len(input_strokes)]
draw_strokes(to_normal_strokes(np.array(strokes)))
return sess.run(eval_model.batch_z, feed_dict={eval_model.input_data: [strokes], eval_model.sequence_lengths: seq_len})[0]
I have to mention I trained my own model following the instructions here:
https://github.com/tensorflow/magenta/tree/master/magenta/models/sketch_rnn
Can someone help me into understanding and solving this issue ?
Thanks
Regards
For my case, the problem is caused by to_big_strokes() function. If you do not modify the to_big_stroke() in sketch_rnn/utils.py, it will by default prolong the input_strokes sequence to the length of 250.
All you need to do, is to modify the parameter max_len in that function. You need to change that value to the maximum sequence length of your own dataset, which is 21 for me, as the line marked with "change" shown below.
def to_big_strokes(stroke, max_len=21): # change: 250 -> 21
"""Converts from stroke-3 to stroke-5 format and pads to given length."""
# (But does not insert special start token).
result = np.zeros((max_len, 5), dtype=float)
l = len(stroke)
assert l <= max_len
result[0:l, 0:2] = stroke[:, 0:2]
result[0:l, 3] = stroke[:, 2]
result[0:l, 2] = 1 - result[0:l, 3]
result[l:, 4] = 1
return result
The problem was that the strokes size is not equal as the array size expected by the algorithm.
So adapting the strokes array fixed the issue.

Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?

Looking at this code example from the TF documentation:
filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
Does the dataset.repeat(num_epochs) require that the entire dataset be loaded into memory? Or is it re-initializing the dataset(s) that came before it when it receives an end-of-dataset exception?
The documentation is ambiguous about this point.
Based on this simple test it appears that repeat does not buffer the dataset, it must be re-initializing the upstream datasets.
n = tf.data.Dataset.range(5).shuffle(buffer_size=5).repeat(2).make_one_shot_iterator().get_next()
[sess.run(n) for _ in range(10)]
Out[83]: [2, 0, 3, 1, 4, 3, 1, 0, 2, 4]
Logic suggests that if repeat were buffering its input, the same random shuffle pattern would have have been repeated in this simple experiment.

TensorFlow: How to get Intermediate value of a variable in tf.while_loop()?

I need to fetch the intermediate value of a tensor in tf.while_loop(), however, it only gives me the last returned value.
For example, I have a variable x, which has 3 pages and its dimension is 3*2*4. Now I want to fetch each page one time and calculate the total sum, the page sum, the mean, max and min value of each page. Then I define the condition and body function and want to use tf.while_loop() to calculate the needed results. The source code is as bellow.
import tensorflow as tf
x = tf.constant([[[41, 8, 48, 82],
[9, 56, 67, 23]],
[[95, 89, 44, 54],
[11, 33, 29, 1]],
[[34, 9, 5, 70],
[14, 35, 18, 17]]], dtype=tf.int32)
def cond(out, count, x):
return count < 3
def body(out, count, x):
outTemp = tf.slice(x, [count, 0, 0], [1, -1, -1])
count += 1
outPack = tf.unpack(out)
outPack[0] += tf.reduce_sum(outTemp)
outPack[1] = tf.reduce_sum(outTemp)
outPack[2] = tf.reduce_mean(outTemp)
outPack[3] = tf.reduce_max(outTemp)
outPack[4] = tf.reduce_min(outTemp)
out = tf.pack(outPack)
return out, count, x
out = tf.Variable(tf.constant([0, 0, 0, 0, 0])) # total sum, page sum, mean, max, min
count = tf.Variable(tf.constant(0))
result = tf.while_loop(cond, body, [out, count, x])
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
print(sess.run(x))
print(sess.run(result)[0])
When I run the program, it only gives me the returned value of the last time and I can only get the results of the last page.
So the question is, How can I get the results of each page and How can I get the intermediate value from tf.while_loop()?
Thank you.
To get the "intermediate value" of any variable, you can simply make use of the tf.Print op which really is an identity operation with the side effect of printing a relevant message when evaluating the aforementioned variable.
As an example,
x = tf.Print(x, [x], "Value of x is: ")
Can be placed in any line where you want the value to be reported.