maximum score in WordNet based similarity - wordnet

some of simlarity score are ranged between 0 and 1 such as shortest path and WuP. therefore the similarity between car and automobile will be 1 but other measures such as LCh will be
lch( car, automobile ) = 3.6889
I want to know the maximum score for these measures. Is 3.6889 consider to be the maximum value? Is these means LCH score between 0 and 3.6889.
I addition the following measures
jcn( car, automobile ) = 12876699.5
res( car, automobile ) = 9.3679
lesk( car, automobile ) = 9519

It seems that 3.6375861597263857 is the maximum for lch_similarity (I can't get 3.6889...). lch_similarity, according to the documentation has the following properties:
Leacock Chodorow Similarity:
Return a score denoting how similar two word senses are, based on the
shortest path that connects the senses (as above) and the maximum depth
of the taxonomy in which the senses occur. The relationship is given as
-log(p/2d) where p is the shortest path length and d is the taxonomy
depth.
...
:return: A score denoting the similarity of the two ``Synset`` objects,
normally greater than 0. None is returned if no connecting path
could be found. If a ``Synset`` is compared with itself, the
maximum score is returned, which varies depending on the taxonomy
depth.
Given that rock_hind.n.01 is at the deepest level (19) in the WordNet taxonomy and that change.n.06 is at the shallowest level (2), we can experiment with varying depths:
>>> from nltk.corpus import wordnet as wn
>>> rock = wn.synset('rock_hind.n.01')
>>> change = wn.synset('change.n.06')
>>> rock.lch_similarity(rock)
3.6375861597263857
>>> change.lch_similarity(change)
3.6375861597263857
>>> change.lch_similarity(rock)
0.7472144018302211
>>> rock.lch_similarity(change)
0.7472144018302211
Similar experiments can be made for the other measures, where the ranges seem quite a bit larger:
>>> from nltk.corpus import wordnet_ic, genesis
>>> brown_ic = wordnet_ic.ic('ic-brown.dat')
>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')
>>> genesis_ic = wn.ic(genesis, False, 0.0)
>>> rock.res_similarity(rock, brown_ic) # res_similarity, brown
1e+300
>>> rock.res_similarity(change, brown_ic)
-0.0
>>> rock.res_similarity(rock, semcor_ic) # res_similarity, semcor
1e+300
>>> rock.res_similarity(change, semcor_ic)
-0.0
>>> rock.res_similarity(rock, genesis_ic) # res_similarity, genesis
1e+300
>>> rock.res_similarity(change, genesis_ic)
-0.08306855877006339
>>> change.res_similarity(rock, genesis_ic)
-0.08306855877006339
>>> rock.jcn_similarity(rock, brown_ic) # jcn, brown - results are identical with semcor and genesis
1e+300
>>> rock.jcn_similarity(change, brown_ic)
1e-300
>>> change.jcn_similarity(rock, brown_ic)
1e-300

Related

Statistics and Pandas: What normalization means in value_counts() in Pandas

The question is not about coding but to understand what normalize means in terms of statistics and correlation of data
This is an example of what I am doing.
Without normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(), color='black')
plt.show();
With normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(normalize=True), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(normalize=True), color='black')
plt.show();
Which one better correlates the values with or without normalization? or is it a whole wrong idea?
I am new to data and pandas, so excuse my bad code, chaining, commenting, style :)
As you can see when you normalize (second plot), the sum of both points is equal to 1, for each line that is plotted. Normalizing is giving you the rate of occurrences of each value instead of the number of occurrences.
Heres what the doc says:
normalize : bool, default False
    Return proportions rather than frequencies.
value_counts() probably returns something like:
0 110000
1 1000
dtype: int64
and value_counts(normalize=True) probably returns something like:
0 0.990991
1 0.009009
dtype: float64
In other words, the relation between the normalized and non-normalized can be checked as:
>>> counts = df['alcoholism'].value_counts()
>>> rate = df['alcoholism'].value_counts(normalize=True)
>>> np.allclose(rate, counts / counts.sum())
True
Where np.allclose allowing to properly compare two series of floating point numbers.

Matplotlib boxplot select method to calculate the quartile values

Using boxplot from matplotlib.pyplot the quartile values are calculated by including the median. Can this be changed to NOT include the median?
For example, consider the ordered data set
2, 3, 4, 5, 6, 7, 8
If the median is NOT included, then Q1=3 and Q3=7. However, boxplot includes the median value, i.e. 5, and generates the figure below
Is it possible to change this behavior, and NOT include the median in the calculation of the quartiles? This should correspond to Method 1 as described on on the Wikipedia page Quartile. The code to generate the figure is listed below
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
data = [2, 3, 4, 5, 6, 7, 8]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
bp = ax.boxplot(data, '',
vert=False,
positions=[0.4],
widths=[0.3])
ax.set_xlim([0,9])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The question is motivated by the fact that several educational resources around the internet do not calculate the quartiles as the default settings used by matplotlib's boxplot. For example, in the online course, "Statistics and probability" from Khan Academy, the quartiles are calculated as described in Method 1 on the Wikipedia page Quartiles, while boxplot employs Method 2.
Consider an example from Khan Academy's course "Statistics and probability" section "Comparing range and interquartile range (IQR)" . The daily high temperatures are recorded in Paradise, MI. for 7 days and found to be 16, 24, 26, 26,26, 27, and 28 degree Celsius. Describe the data with a boxplot and calculate IQR.
The result of using the default settings in boxplot and that presented by Prof. Khan are very different, see figure below.
The IQR found by matplotlib is 1.5, and that calculated by Prof. Khan is 3.
As pointed out in the comments by #JohanC, boxplot can not directly be configured to follow Method 1, but requires a customized function. Therefore, neglecting the calculation of outliers, I updated the code to calculate the quartiles according to Method 1, and thus be comparable with the Khan Academy course. The code is listed below, not very pythonic, suggestions are welcome.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
from matplotlib.ticker import MultipleLocator
def median(x):
"""
x - input a list of numbers
Returns the midpoint number, for example
in a list with oddnumbers
[1,2, 3, 4,5] returns 3
for a list with even numbers the algebraic mean is returned, e.g
[1,2,3,4] returns 2.5
"""
if len(x)&1:
# Odd number of elements in list, e.g. x = [1,2,3] returns 2
index_middle = int((len(x)-1)/2)
median = x[index_middle]
else:
# Even number of elements in list, e.g. x = [-1,2] returns 0.5
index_lower = int(len(x)/2-1)
index_upper = int(len(x)/2)
median = (x[index_lower]+x[index_upper])/2
return median
def method_1_quartiles(x):
"""
x - list of numbers
"""
x.sort()
N = len(x)
if N&1:
# Odd number of elements
index_middle = int((N-1)/2)
lower = x[0:index_middle] # Up to but not including
upper = x[index_middle+1:N+1]
Q1= median(lower)
Q2 = x[index_middle]
Q3 = median(upper)
else:
# Even number of elements
index_lower = int(N/2)
lower = x[0:index_lower]
upper = x[index_lower:N]
Q1= median(lower)
Q2 = (x[index_lower-1]+x[index_lower])/2
Q3 = median(upper)
return Q1,Q2,Q3
data = [16,24,26, 26, 26,27,28]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
stats = cbook.boxplot_stats(data,)[0]
Q1_default = stats['q1']
Q3_default = stats['q3']
stats['whislo']=min(data)
stats['whishi']=max(data)
IQR_default = Q3_default - Q1_default
Q1, Q2, Q3 = method_1_quartiles(data)
IQR = Q3-Q1
stats['q1'] = Q1
stats['q3'] = Q3
print(f"IQR: {IQR}")
ax.bxp([stats],vert=False,manage_ticks=False,widths=[0.3],positions=[0.4],showfliers=False)
ax.set_xlim([15,30])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The graph generated is

Count number of unique colours in image [duplicate]

This question already has answers here:
Most dominant color in RGB image - OpenCV / NumPy / Python
(3 answers)
Closed 3 years ago.
I am trying to count the number of unique colours in an image. I have some code that I think should work however when I run it on an image its saying a I have 252 different colours out of a possible 16,777,216‬. That seems wrong given the image is BGR so shouldn't their be much more different colours (thousands not hundreds?)?
def count_colours(src):
unique, counts = np.unique(src, return_counts=True)
print(counts.size)
return counts.size
src = cv2.imread('../../images/di8.jpg')
src = imutils.resize(src, height=300)
count_colours(src) # outputs 252 different colours!? only?
Is that value correct? And if not how can I fix my function count_colours()?
Source image:
Edit: is this correct?
def count_colours(src):
unique, counts = np.unique(src.reshape(-1, src.shape[-1]), axis=0, return_counts=True)
return counts.size
If you look at the uniques you are getting back, I'm pretty sure you'll find they are scalars.
You need to use the axis keyword:
>>> import numpy as np
>>> from scipy.misc import face
>>>
>>> img = face()
>>> np.unique(img.reshape(-1, img.shape[-1]), axis=0, return_counts=True)
(array([[ 0, 0, 5],
[ 0, 0, 7],
[ 0, 0, 9],
...,
[255, 248, 255],
[255, 249, 255],
[255, 252, 255]], dtype=uint8), array([1, 2, 2, ..., 1, 1, 1]))
The comment by # Edeki Okoh is correct. You need to find a way to take the color channels into account. There is probably a much cleaner solution but a hacky way to do this would be something like this. Each color channels has values from 0 to 255 so we add 1 in order to make sure that it gets multiplied. Blue will represent the last the digits, green the middle three ones and red the first three. Now every value is representing a unique color.
b,g,r = cv2.split(src)
shiftet_im = b + 1000 * (g + 1) + 1000 * 1000 * (r + 1)
The resulting image should have one channel with each value representing a unique color combination.
I think you only counted for a single channel e.g R-value out of full RGB channel. that's why you have only 252 discrete values.
In theory R G B each can have 256 discrete states.
256*256*256 =16777216
means in total you can have 16777216 possibilities of colors.
My suggestion is to convert RGB uchar CV_8UC3 into a single 32bit data structure like CV_32FC1
Let
Given image as input
# my test small sie text image. which I can count the number of the state by hand
import cv2
import numpy as np
image=cv2.imread('/home/usr/naneDownloads/vuQ9y.png' )# change here
b,g,r = cv2.split(image)
out_in_32U_2D = np.int32(b) << 16 + np.int32(g) << 8 + np.int32(r) #bit wise shift 8 for each channel.
out_in_32U_1D= out_in_32U_2D.reshape(-1) #convert to 1D
np.unique(out_in_32U_1D)
array([-2147483648, -2080374784, -1073741824, -1006632960, 0,
14336, 22528, 30720, 58368, 91136,
123904, 237568, 368640, 499712, 966656,
1490944, 2015232, 3932160, 6029312, 8126464,
15990784, 24379392, 32768000, 65011712, 67108864,
98566144, 132120576, 264241152, 398458880, 532676608,
536870912, 805306368, 1073741824, 1140850688, 1342177280,
1610612736, 1879048192], dtype=int32)
len(np.unique(out_in_32U_1D))
37 # correct for my test wirting paper when compare when my manual counting
The code here should be able to provide you with what you needed

Difficulty with numpy broadcasting

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

The random number generator in numpy

I am using the numpy.random.randnand numpy.random.randto generate random numbers. I am confusing about the difference between random.randn and random.rand?
The main difference between the two is mentioned in the docs. Links to Doc rand and Doc randn
For numpy.rand, you get random values generated from a uniform distribution within 0 - 1
But for numpy.randn you get random values generated from a normal distribution, with mean 0 and variance 1.
Just a small example.
>>> import numpy as np
>>> np.random.rand(10)
array([ 0.63067838, 0.61371053, 0.62025104, 0.42751699, 0.22862483,
0.75287427, 0.90339087, 0.06643259, 0.17352284, 0.58213108])
>>> np.random.randn(10)
array([ 0.19972981, -0.35193746, -0.62164336, 2.22596365, 0.88984545,
-0.28463902, 1.00123501, 1.76429108, -2.5511792 , 0.09671888])
>>>
As you can see that rand gives me values within 0-1,
whereas randn gives me values with mean == 0 and variance == 1
To explain further, let me generate a large enough sample:
>>> a = np.random.rand(100)
>>> b = np.random.randn(100)
>>> np.mean(a)
0.50570149531258946
>>> np.mean(b)
-0.010864958465191673
>>>
you can see that the mean of a is close to 0.50, which was generated using rand. The mean of b on the other hand is close to 0.0, which was generated using randn
You can also get a conversion from rand numbers to randn numbers in Python by the application of percent point function (ppf) for the Normal Distribution with random variables distributed ~ N(0,1). It is a well-known method of projecting any uniform random variables (0,1) onto ppf in order to get random variables for a desired cumulative distribution.
In Python we can visualize that process as follows:
from numpy.random import rand
import matplotlib.pyplot as plt
from scipy.stats import norm
u = rand(100000) # uniformly distributed rvs
z = norm.ppf(u) # ~ N(0,1) rvs
plt.hist(z,bins=100)
plt.show()