silhouette_score is 0.36 for k_means_clustering - k-means

is silhouette_score 0.36 good for k_mean_clustering on 167650 data points for 6 cluster, after removing outliers and replacing 'NaN' with 0 ?
k means cluster with following hyper parameters: random_state=50, init = 'random', algorithm='auto'

Related

sklearn classification report

I am training an electra model with tensorflow on a multi label task. The ROC performance of each individual label is
AUROC per tag
morality_binary: 0.8840802907943726
emotion_binary: 0.8690611124038696
positive_binary: 0.9115268588066101
negative_binary: 0.9200447201728821
care_binary: 0.9266915321350098
fairness_binary: 0.8638730645179749
authority_binary: 0.8471786379814148
sanctity_binary: 0.9040042757987976
harm_binary: 0.9046630859375
injustice_binary: 0.8968375325202942
betrayal_binary: 0.846387505531311
subversion_binary: 0.7741811871528625
degradation_binary: 0.9601025581359863
But when I run the the sklearn classification report:
THRESHOLD = 0.5
y_pred = predictions.numpy()
y_true = labels.numpy()
upper, lower = 1, 0
y_pred = np.where(y_pred > THRESHOLD, upper, lower)
print(classification_report(
y_true,
y_pred,
target_names=LABEL_COLUMNS,
zero_division=0
))
... five of the labels turns out with an f-score of 0:
precision recall f1-score support
morality_binary 0.72 0.73 0.73 347
emotion_binary 0.66 0.73 0.69 303
positive_binary 0.71 0.76 0.73 242
negative_binary 0.70 0.62 0.65 141
care_binary 0.67 0.60 0.63 141
fairness_binary 0.55 0.53 0.54 166
authority_binary 0.00 0.00 0.00 49
sanctity_binary 0.00 0.00 0.00 23
harm_binary 0.48 0.32 0.39 50
injustice_binary 0.62 0.56 0.59 97
betrayal_binary 0.00 0.00 0.00 30
subversion_binary 0.00 0.00 0.00 8
degradation_binary 0.00 0.00 0.00 10
Can someone explain to me how this is possible? I can understand a low f-score, but 0?
I assume 0 is negative and 1 is positive.
AUROC calculates the area under the ROC curve as a measure of how well a classifier performs (0.5 score is a random, coin-flip model). To draw the ROC curve, you need to calculate two values at different threshold values to distinguish positive from negative examples.
y-axis: True positive rate (TPR) - How many of the positive examples did the model predict as negative.
x-axis: False positive rate (FPR) - How many of the negative examples did the model predict as positive.
TPR is also called recall. We calculate this using the following formula:
TPR = True positives / (True positives + False Negatives)
= True positives / All positives
So the only way TPR can be 0 is because TP is also 0. This means that precision will also be 0 as we calculate precision using the following formula:
Precision = True positives / (True positives + False positives)
Which will also result in 0 if and only if TP is equal to 0.
Now given the ROC curve (Roc curve shown here), if FPR is 0, the area under the curve will also be equal to 0. This is because you have picked a single threshold value (0.5) in your code to predict 0 or 1. This is not a representation of the ROC curve and AUROC measure.
I suggest you take a look at the ROC curve and try different values for you classification threshold. The resulting AUROC values suggest that your model performs better than a random one in general, so you should find a good threshold.

tfidf from word counts

I have a categorical variable with large cardinality (+1000). Each of these values can occur repeatedly in each train/test instance.
Although this is not really text data it seems to have similar properties and I would like to treat this as a text classification problem.
My starting point is a dataframe listing the number of occurrences of each "word" in each "document", e.g.
{'Word1': {0: '1',
1: '3',
2: '0',
3: '0',
4: '0'},
'Word2': {0: '0',
1: '2',
2: '0',
3: '0',
4: '0'}
I would like to apply tfidf transformation to these "word" counts. How can I do that?
sklearn.feature_extraction.text.TfidfVectorizer seems to expect a sequence of strings or a file as an input which it preprocesses and tokenizes. None of this is necessary in this case as I already have the "word" counts.
So how to get the tfidf transformation of these counts?
I had a similar situation where I was trying to recreate TF-IDF from word counts
Try below code, worked for me.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
"The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)
from collections import Counter
import pandas as pd
columns = [k for (v, k) in sorted((v, k)
for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
columns=columns)
# ate dog sandwich transfigured wizard
#0 0.75 0.38 0.54 0.00 0.00
#1 0.00 0.00 0.45 0.63 0.63
df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
# ate dog sandwich transfigured wizard
#0 0.71 0.71 1.0 0.71 0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
# ate dog sandwich transfigured wizard
#0 2 1 2 0 0
#1 0 0 1 1 1
# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
# ate dog sandwich transfigured wizard
#0 0.754584 0.377292 0.536893 0.0 0.0
tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
# ate dog sandwich transfigured wizard
#0 0.0 0.0 0.449436 0.631667 0.631667

Series of if statements applied to data frame

I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.
Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.

Why does a table I create from a Pandas dataframe have extra decimal places?

I have a dataframe which I want to use to create a table. The dataframe contains data in float form which I have already rounded to two decimal places. Here is a dataframe which is a subset of the dataframe I am working on:
Band R^2
0 Band2 Train 0.37
1 Band3 Train 0.50
2 Band4 Train 0.19
3 Band2 Test 0.41
4 Band3 Test 0.53
5 Band4 Test 0.12
As you can see all data in the R^2 column are rounded.
I have written the following simple code to create a table which I intend to export as a png so I can embed it in a LaTeX document. Here is the code:
ax1 = plt.subplot(111,frameon = False)
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.set_frame_on(False)
myTable = table(ax1, df)
myTable.auto_set_font_size(False)
myTable.set_fontsize(13)
myTable.scale(1.2, 3.5)
Here is the table:
Can anybody explain why the data in the R^2 are longer than two decimal places?

Shifting a Pandas column, and then take the mean of the next 3 values (post_shift)

I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm attempting to create a new column, that first shifts the Open column 3 rows (df.Open.shift(-3)) and then takes the average of itself and the next 2 values.
So for example the above dataframe's Open column would be shifted -3 rows and look something like this:
Date
2007-03-22 2.610
2007-03-23 3.650
2007-03-26 3.910
2007-03-27 3.700
2007-03-28 3.710
2007-03-29 3.710
2007-03-30 3.500
I then want to take the forward mean of the next 3 values(including itself) via iteration.
So the first iteration would 2.610 (first value) + 3.650 + 3.910(which are the next values) divided by 3.
Then we take the next value 3.650 (first value) and do the same. Creating a column of values.
At first I tried something like :
df['Avg'] =df.Open.shift(-3).iloc[0:3].mean()
But this doesn't iterate through all the values of Open.shift
This next loop seems to work but is very slow, and I was told it's bad practice to use for loops in Pandas.
for i in range(0, len(df.Open)):
df['Avg'][i] =df.Open.shift(-3).iloc[i:i+4].mean()
I tried to thinking of ways to use apply
df.Open.shift(-3).apply(loc[0:4].mean())
df.Open.shift(-3).apply(lambda x: x[0:4].mean())
but these seems to give errors such as
TypeError: 'float' object is not subscriptable etc
I can't think of an elegant way of doings this.
Thank you.
You can use pandas rolling_mean. Since it uses backward window, it will give you the first two rows as 2.61 (value itself) and 3.13(mean of row 0 and row 1). To handle that, you can use shift(-2) to shift the values by 2 rows.
pd.rolling_mean(df, window=3, min_periods=1).shift(-2)
output:
open
date
2007-03-22 3.390000
2007-03-23 3.753333
2007-03-26 3.773333
2007-03-27 3.706667
2007-03-28 3.640000
2007-03-29 NaN
2007-03-30 NaN
numpy solution
As promised
NOTE: HUGE CAVEAT
This is an advanced technique and is not recommended for any beginner!!!
Using this might actually shave your poodle bald by accident. BE CAREFUL!
as_strided
from numpy.lib.stride_tricks import as_strided
import numpy as np
import pandas as pd
# I didn't have your full data for all dates
# so I created my own array
# You should be able to just do
# o = df.Open.values
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
# because we shift 3 rows, I trim with 3:
# because it'll be rolling 3 period mean
# add two np.nan at the end
# this makes the strides cleaner.. sortof
# whatever, I wanted to do it
o = np.append(o[3:], [np.nan] * 2)
# strides are the size of the chunk of memory
# allocated to each array element. there will
# be a stride for each numpy dimension. for
# a one dimensional array, I only want the first
s = o.strides[0]
# it gets fun right here
as_strided(o, (len(o) - 2, 3), (s, s))
# ^ \___________/ \__/
# | \ \______
# object to stride --- size of array --- \
# to make memory chunk
# to step forward
# per dimension
[[ 2.61 3.65 3.91]
[ 3.65 3.91 3.7 ]
[ 3.91 3.7 3.71]
[ 3.7 3.71 3.71]
[ 3.71 3.71 3.5 ]
[ 3.71 3.5 nan]
[ 3.5 nan nan]]
Now we just take the mean. All together
o = np.array([2.65, 2.87, 2.83, 2.61, 3.65, 3.91, 3.70, 3.71, 3.71, 3.50])
o = np.append(o[3:], [np.nan] * 2)
s = o.strides[0]
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1)
array([ 3.39 , 3.75333333, 3.77333333, 3.70666667, 3.64 ,
nan, nan])
You can wrap it in a pandas series
pd.Series(
as_strided(o, (len(o) - 2, 3), (s, s)).mean(1),
df.index[3:],
)