Implementation of Hierarchical Sampling for Active Learning - numpy

I am working on a personal machine learning project where I am attempting to classify data into binary classes when the classes are extremely imbalanced. I am initially trying to implement the approach proposed in Hierarchical Sampling for Active Learning by S Dasgupta which exploits the cluster structure of the dataset to aide the active learner.
However, I am struggling to implement the algorithm proposed in the article. I have written this so far, however am unsure how to continue:
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
data_dist = pdist(X) # computing the distance
data_link = linkage(data_dist) # computing the linkage
The data is stored in X and the correct clasification in y. Sample dataset:
X = np.array([[0.3,0.7],[0.5,0.5] ,[0.2,0.8], [0.1,0.9]])
y = np.array([[0], [1], [0], [1]])
(Note the actual dataset is approximately 500 times larger)

Hierarchical Sampling for Active Learning as proposed by S Dasgupta is now implemented in libact, a Python active learning libary. For the source code, see this link.
Example (from doccumentation):
from libact.query_strategies import UncertaintySampling
from libact.query_strategies.multiclass import HierarchicalSampling
sub_qs = UncertaintySampling(
dataset, method='sm', model=SVM(decision_function_shape='ovr'))
qs = HierarchicalSampling(
dataset, # Dataset object
dataset.get_num_of_labels(),
active_selecting=True,
subsample_qs=sub_qs
)

Related

H2O | ExtendedIsolation Forest | model.explain() gives, KeyError: 'response_column'

I have been struggling with this error for a few hours now, but seem lost even after reading through the documentation.
I'm using H2O's Extended Isolation Forest (EIF), an unsupervised model, to detect anomalies in an unlabelled dataset. Which is working as intended, however for the project i'm working on the model explainability is extremely important. I discovered the explain function, which supposedly returns several explainablity methods for a model. I'm particularly interested in the SHAP values from this function.
The documentation states
The main functions, h2o.explain() (global explanation) and h2o.explain_row() (local explanation) work for individual H2O models, as well a list of models or an H2O AutoML object. The h2o.explain() function generates a list of explanations.
Since the H2O models link brings me to a page which covers both supervised and unsupervised models I assume the explain function would work for both types of models.
When trying to run my code the following code works just fine.
import h2o
from h2o.estimators import H2OExtendedIsolationForestEstimator
h2o.init()
df_EIF = h2o.H2OFrame(df_EIF)
predictors = df_EIF.columns[0:37]
eif = H2OExtendedIsolationForestEstimator(ntrees = 75, sample_size =500, extension_level = (len(predictors) -1) )
eif.train(x=predictors, training_frame = df_EIF)
eif_result = eif.predict(df_EIF)
df_EIF['anomaly_score_EIF') = eif_result['anomaly_score']
However when trying to call explain over the model (eif)
eif.explain(df_EIF)
Gives me the following KeyError
KeyError Traceback (most recent call last)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.py in <module>
----> 1 eif.explain(df_EIF)
2
3
4
5
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in explain(models, frame, columns, top_n_features, include_explanations, exclude_explanations, plot_overrides, figsize, render, qualitative_colormap, sequential_colormap)
2895 plt = get_matplotlib_pyplot(False, raise_if_not_available=True)
2896 (is_aml, models_to_show, classification, multinomial_classification, multiple_models, targets,
-> 2897 tree_models_to_show, models_with_varimp) = _process_models_input(models, frame)
2898
2899 if top_n_features < 0:
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _process_models_input(models, frame)
2802 models_with_varimp = [model for model in models if _has_varimp(model)]
2803 tree_models_to_show = _get_tree_models(models, 1 if is_aml else float("inf"))
-> 2804 y = _get_xy(models_to_show[0])[1]
2805 classification = frame[y].isfactor()[0]
2806 multinomial_classification = classification and frame[y].nlevels()[0] > 2
C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _get_xy(model)
1790 """
1791 names = model._model_json["output"]["original_names"] or model._model_json["output"]["names"]
-> 1792 y = model.actual_params["response_column"]
1793 not_x = [
1794 y,
KeyError: 'response_column
From my understanding this response column refers to a column that you are trying to predict. However, since i'm dealing with an unlabelled dataset this response column doesn't exist. Is there a way for me to bypass this error? Is it even possible to utilize the explain() function on unsupervised models? If, so how do I do this? If it is not possible, is there another way to extract the Shap values of each variable from the model? Since the shap.TreeExplainer also doesn't seem to work on a H2O model.
TL;DR: Is it possible to use the .explain() function from h2o on an (Extended) Isolation forest? If so how?
Unfortunately, the explain method in H2O-3 is supported only for the supervised algorithms.
What you could do is to use a surrogate model and look at explanations on it.
Basically, you'd fit a GBM (or DRF as those 2 models support the TreeSHAP) on the data + the prediction of the Extended Isolation Forest which would be the response.
Here is another approach how to explain prediction of (E)IF: https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/isolation-forest/interpreting_isolation-forest.ipynb

How can I find the optimal number of topics in LDA with scikit-learn?

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

One Hot Encoding of large dataset

I want to build recommendation system using association rules with implemented in mlxtend library apriori algorithm. In my sales data there is information about 36 millions of transactions and 50k unique products.
I tried to use sklearn OneHotEncoder and pandas get_dummies() but both are giving OOM error as they are not able to create frame in shape of (36 mil, 50k)
MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8
Is there any other solution?
Like you, I too had out of memory error with mlxtend at first, but the following small changes fixed the problem completely.
`
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
te = TransactionEncoder()
#te_ary = te.fit(itemSetList).transform(itemSetList)
#df = pd.DataFrame(te_ary, columns=te.columns_)
fitted = te.fit(itemSetList)
te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good
df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good
# now you can call mlxtend's fpgrowth() followed by association_rules()
`
You should also use fpgrowth instead of apriori on the big transaction datasets because apriori is too primitive. fpgrowth is more intelligent and modern than apriori but gives equivalent results. The mlxtend lib supports both apriori and fpgrowth.
I think a good solution would be to use embeddings instead of one-hot encoding for your problem. In addition, I recommend that you split your dataset into smaller subsets to further avoid the memory consumption problems.
You should also consult this thread : https://datascience.stackexchange.com/questions/29851/one-hot-encoding-vs-word-embeding-when-to-choose-one-or-another

Why does keras (SGD) optimizer.minimize() not reach global minimum in this example?

I'm in the process of completing a TensorFlow tutorial via DataCamp and am transcribing/replicating the code examples I am working through in my own Jupyter notebook.
Here are the original instructions from the coding problem :
I'm running the following snippet of code and am not able to arrive at the same result that I am generating within the tutorial, which I have confirmed are the correct values via a connected scatterplot of x vs. loss_function(x) as seen a bit further below.
# imports
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import Variable, keras
def loss_function(x):
import math
return 4.0*math.cos(x-1)+np.divide(math.cos(2.0*math.pi*x),x)
# Initialize x_1 and x_2
x_1 = Variable(6.0, np.float32)
x_2 = Variable(0.3, np.float32)
# Define the optimization operation
opt = keras.optimizers.SGD(learning_rate=0.01)
for j in range(100):
# Perform minimization using the loss function and x_1
opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
# Perform minimization using the loss function and x_2
opt.minimize(lambda: loss_function(x_2), var_list=[x_2])
# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())
I draw a quick connected scatterplot to confirm (successfully) that the loss function that I using gets me back to the same graph provided by the example (seen in screenshot above)
# Generate loss_function(x) values for given range of x-values
losses = []
for p in np.linspace(0.1, 6.0, 60):
losses.append(loss_function(p))
# Define x,y coordinates
x_coordinates = list(np.linspace(0.1, 6.0, 60))
y_coordinates = losses
# Plot
plt.scatter(x_coordinates, y_coordinates)
plt.plot(x_coordinates, y_coordinates)
plt.title('Plot of Input values (x) vs. Losses')
plt.xlabel('x')
plt.ylabel('loss_function(x)')
plt.show()
Here are the resulting global and local minima, respectively, as per the DataCamp environment :
4.38 is the correct global minimum, and 0.42 indeed corresponds to the first local minima on the graphs RHS (when starting from x_2 = 0.3)
And here are the results from my environment, both of which move opposite the direction that they should be moving towards when seeking to minimize the loss value:
I've spent the better part of the last 90 minutes trying to sort out why my results disagree with those of the DataCamp console / why the optimizer fails to minimize this loss for this simple toy example...?
I appreciate any suggestions that you might have after you've run the provided code in your own environments, many thanks in advance!!!
As it turned out, the difference in outputs arose from the default precision of tf.division() (vs np.division()) and tf.cos() (vs math.cos()) -- operations which were specified in (my transcribed, "custom") definition of the loss_function().
The loss_function() had been predefined in the body of the tutorial and when I "inspected" it using the inspect package ( using inspect.getsourcelines(loss_function) ) in order to redefine it in my own environment, the output of said inspection didn't clearly indicate that tf.division & tf.cos had been used instead of their NumPy counterparts (which my version of the code had used).
The actual difference is quite small, but is apparently sufficient to push the optimizer in the opposite direction (away from the two respective minima).
After swapping in tf.division() and tf.cos (as seen below) I was able to arrive at the same results as seen in the DC console.
Here is the code for the loss_function that will back in to the same results as seen in the console (screenshot) :
def loss_function(x):
import math
return 4.0*tf.cos(x-1)+tf.divide(tf.cos(2.0*math.pi*x),x)

Trouble with KNN on OpenCV, new_samples.type() == CV_32F when training

I am trying to set a simple KNN problem implementation with a three class dataset but whenever I try to execute the train function I keep the said (-215:Assertion failed) new_samples.type() == CV_32F in function 'cv::ml::Impl::train error.
I have tried reshaping the responses array into many different things since most of the errors came from that part of the code, that goes from 1 x n matrix to a single list. I am following this tutorial. I can get it done with two classes by defining my own data just like I do with three classes but I can't manage to train with three classes.
import numpy as np
import cv2 as cv
classA=([(10,1,1),(9,2,2),(11,1,2),(8,3,2),(7,2,3),(8,5,4),(9,3,4),(6,6,5),(8,6,6),(9,7,7)])
classB=([(5,1,20),(5,2,19),(5,1,21),(4,2,18),(4,1,19),(6,3,20),(6,2,19),(4,4,18),(4,5,21),(6,4,19)])
classC=([(5,14,10),(6,13,9),(4,12,11),(6,11,9),(6,7,12),(7,6,13),(7,7,10), (7,8,11),(8,8,12),(7,6,11)])
points = classA + classB + classC
responses = ([0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2])
# Using numpy's array?
points_np = np.asarray(points)
responses_np = np.asarray(responses).reshape((30,1))
#print(points_np)
#print(responses_np)
knn = cv.ml.KNearest_create()
knn.train(points_np, cv.ml.ROW_SAMPLE, responses_np)
I know both sample and response data should follow a similar structure so the function can associate each point to a class but I think my issue is on the type of structure I am using for the responses variable. How should I shape or set the responses variable in order to be readable for the train function?
As indicated in the assertion, the data type for samples must be CV_32F, which stands for 32 bit float.
points_np = np.asarray(points).astype(np.float32)
responses_np = np.asarray(responses).reshape((30,1)).astype(np.float32)