How to calculate LSA Word score as seen in "LSA Intro AI Seminar" - lsa

If you check http://www.cs.nmsu.edu/~mmartin/LSA_Intro_AI_Seminar.ppt they show the calculated score for each word on Slide 25.
I have not been able to find how to calculate this summary.
Recently, I have completed a LSA implementation and can produce all the other results in this PPT, but not Slive 25.
The reason why I ask this is because I would like to use this to indicate the 'top reasons' why a document scored high.

Ok; I've decided to write the author of the powerpoint (Dr. Melanie J Martin) and she pointed out page 406 of the Deerwester et al paper she based it on; the "X hat" section.
To test it, I've written out the example to see if I could get the same values, and it worked :)
Test code:
static void XHatTest()
{
double[,] testT = new double[12, 2], testD = new double[2, 9], testS = new double[2, 2];
testT[0, 0] = 0.22;
testT[0, 1] = -0.11;
testT[1, 0] = 0.20;
testT[1, 1] = -0.07;
testT[2, 0] = 0.24;
testT[2, 1] = 0.04;
testT[3, 0] = 0.40;
testT[3, 1] = 0.06;
testT[4, 0] = 0.64;
testT[4, 1] = -0.17;
testT[5, 0] = 0.27;
testT[5, 1] = 0.11;
testT[6, 0] = 0.27;
testT[6, 1] = 0.11;
testT[7, 0] = 0.30;
testT[7, 1] = -0.14;
testT[8, 0] = 0.21;
testT[8, 1] = 0.27;
testT[9, 0] = 0.01;
testT[9, 1] = 0.49;
testT[10, 0] = 0.04;
testT[10, 1] = 0.62;
testT[11, 0] = 0.03;
testT[11, 1] = 0.45;
testD[0, 0] = 0.20;
testD[0, 1] = 0.61;
testD[0, 2] = 0.46;
testD[0, 3] = 0.54;
testD[0, 4] = 0.28;
testD[0, 5] = 0.00;
testD[0, 6] = 0.02;
testD[0, 7] = 0.02;
testD[0, 8] = 0.08;
testD[1, 0] = -0.06;
testD[1, 1] = 0.17;
testD[1, 2] = -0.13;
testD[1, 3] = -0.23;
testD[1, 4] = 0.11;
testD[1, 5] = 0.19;
testD[1, 6] = 0.44;
testD[1, 7] = 0.62;
testD[1, 8] = 0.53;
testS[0, 0] = 3.34;
testS[0, 1] = 0;
testS[1, 0] = 0;
testS[1, 1] = 2.54;
Matrix A = new Matrix(testT), B = new Matrix(testD), C = new Matrix(testS);
Matrix Result = A * C * B;
for (int row = 0; row < Result.NoRows; row++)
{
for (int col = 0; col < Result.NoCols; col++)
{
Console.Write(Math.Round(Result[row, col], 2) + " ");
}
Console.WriteLine();
}
}

Related

SMOTE adds many rows with 0 values to dataframe

Please help me, i cannot understand why X_synthetic_df returns hundreds of rows with 0 values. All of the rows have normal values fine until row 1745. From that row, all the other row values contain nothing but zeros
def nearest_neighbors(nominal_columns, numeric_columns, df, row, k):
def distance(row1, row2):
distance = 0
for col in nominal_columns:
if row1[col] != row2[col]:
distance += 1
for col in numeric_columns:
distance += (row1[col] - row2[col])**2
return distance**0.5
distances = []
for i in range(len(df)):
r = df.iloc[i]
if r.equals(row):
continue
d = distance(row, r)
if(d!=0):
distances.append((d, i))
distances.sort()
nearest = [i for d, i in distances[:k]]
return nearest
def smotenc(X, y, nominal_cols, numeric_cols, k=5, seed=None):
minority_class = y[y==1]
majority_class = y[y==0]
minority_samples = X[y == 1]
minority_target = y[y == 1]
n_synthetic_samples = len(majority_class)-len(minority_class)
synthetic_samples = np.zeros((n_synthetic_samples, X.shape[1]))
if seed is not None:
np.random.seed(seed)
for i in range(len(minority_samples)):
nn = nearest_neighbors(nominal_cols, numeric_cols, minority_samples, minority_samples.iloc[i], k=k)
for j in range(min(k, n_synthetic_samples - i*k)):
nn_idx = int(np.random.choice(a=nn))
diff = minority_samples.iloc[(nn_idx)] - minority_samples.iloc[i]
print(diff)
if (diff == 0).all():
continue
synthetic_sample = minority_samples.iloc[i] + np.random.rand() * diff
synthetic_samples[(i*k)+j, :] = synthetic_sample
X_resampled = pd.concat([X[y == 1], pd.DataFrame(synthetic_samples,columns=X.columns)], axis=0)
y_resampled = np.concatenate((y[y == 1], [1] * n_synthetic_samples))
return X_resampled, y_resampled
minority_features = df_nominal.columns.get_indexer(df_nominal.columns)
synthetic = smotenc(check_x.head(3000),check_y.head(3000),nominal_cols,numeric_cols,seed = None)
X_synthetic_df = synthetic[0]
X_synthetic_df = pd.DataFrame(X_synthetic_df,columns = X.columns)
I was a dataframe with n synthetic samples, where n is the difference between the majority samples and minority class samples

np.array([0, 0, 0]) [ [1, 2, 2] ] += [4, 5, 6] does not accumulate

All arrays in the following are NumPy arrays.
I have an array of numbers, say a = [4, 5, 6].
I want to add them to an accumulation array, say s = [0, 0, 0]
but I want to control which number goes to where.
For instance, I want
s[1] += a[0],
s[2] += a[1], and then
s[2] += a[2].
So I set up an auxiliary array i = [1, 2, 2]
and hope that s[i] += a would work.
But it wouldn't;
s[2] end up only receiving a[2],
as if s[i] += a is implemented by
t = [0, 0, 0]; t[i] = a; s += t.
I would like to know if there is a way to achieve my version of "s[i] += a"
without having to do for-loops in pure python,
as I heard that the latter is much slower.

Why does indexing a string inside of a recursive call yield a different result?

In my naive implementation of edit-distance finder, I have to check whether the last characters of two strings match:
ulong editDistance(const string a, const string b) {
if (a.length == 0)
return b.length;
if (b.length == 0)
return a.length;
const auto delt = a[$ - 1] == b[$ - 1] ? 0 : 1;
import std.algorithm : min;
return min(
editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + delt,
editDistance(a, b[0 .. $ - 1]) + 1,
editDistance(a[0 .. $ - 1], b) + 1
);
}
This yields the expected results but if I replace delt with its definition it always returns 1 on non-empty strings:
ulong editDistance(const string a, const string b) {
if (a.length == 0)
return b.length;
if (b.length == 0)
return a.length;
//const auto delt = a[$ - 1] == b[$ - 1] ? 0 : 1;
import std.algorithm : min;
return min(
editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + a[$ - 1] == b[$ - 1] ? 0 : 1, //delt,
editDistance(a, b[0 .. $ - 1]) + 1,
editDistance(a[0 .. $ - 1], b) + 1
);
}
Why does this result change?
The operators have different precedence from what you expect. In const auto delt = a[$ - 1] == b[$ - 1] ? 0 : 1; there is no ambiguity, but in editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + a[$ - 1] == b[$ - 1] ? 0 : 1, there is (seemingly).
Simplifying:
auto tmp = editDistance2(a[0..$-1], b[0..$-1]);
return min(tmp + a[$-1] == b[$-1] ? 0 : 1),
//...
);
The interesting part here is parsed as (tmp + a[$-1]) == b[$-1] ? 0 : 1, and tmp + a[$-1] is not equal to b[$-1]. The solution is to wrap things in parentheses:
editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + (a[$ - 1] == b[$ - 1] ? 0 : 1)

Using libSVM programmatically

I have started using libSVM (java: https://github.com/cjlin1/libsvm) programmatically. I wrote the following code to test it:
svm_parameter param = new svm_parameter();
// default values
param.svm_type = svm_parameter.C_SVC;
param.kernel_type = svm_parameter.RBF;
param.degree = 3;
param.gamma = 0;
param.coef0 = 0;
param.nu = 0.5;
param.cache_size = 40;
param.C = 1;
param.eps = 1e-3;
param.p = 0.1;
param.shrinking = 1;
param.probability = 0;
param.nr_weight = 0;
param.weight_label = new int[0];
param.weight = new double[0];
svm_problem prob = new svm_problem();
prob.l = 4;
prob.y = new double[prob.l];
prob.x = new svm_node[prob.l][2];
for(int i = 0; i < prob.l; i++)
{
prob.x[i][0] = new svm_node();
prob.x[i][1] = new svm_node();
prob.x[i][0].index = 1;
prob.x[i][1].index = 2;
prob.x[i][0].value = (i%2!=0)?-1:1;
prob.x[i][1].value = (i/2%2==0)?-1:1;
prob.y[i] = (prob.x[i][0].value == 1 && prob.x[i][1].value == 1)?1:-1;
System.out.println("X = [ " + prob.x[i][0].value + ", " + prob.x[i][1].value + " ] \t -> " + prob.y[i] );
}
svm_model model = svm.svm_train(prob, param);
int test_length = 4;
for( int i = 0; i < test_length; i++)
{
svm_node[] x_test = new svm_node[2];
x_test[0] = new svm_node();
x_test[1] = new svm_node();
x_test[0].index = 1;
x_test[0].value = (i%2!=0)?-1:1;
x_test[1].index = 2;
x_test[1].value = (i/2%2==0)?-1:1;
double d = svm.svm_predict(model, x_test);
System.out.println("X[0] = " + x_test[0].value + " X[1] = " + x_test[1].value + "\t\t\t Y = "
+ ((x_test[0].value == 1 && x_test[1].value == 1)?1:-1) + "\t\t\t The predicton = " + d);
}
Since I am testing on the same training data, I'd expect to get 100% accuracy, but the output that I get, is the following:
X = [ 1.0, -1.0 ] -> -1.0
X = [ -1.0, -1.0 ] -> -1.0
X = [ 1.0, 1.0 ] -> 1.0
X = [ -1.0, 1.0 ] -> -1.0
*
optimization finished, #iter = 1
nu = 0.5
obj = -20000.0, rho = 1.0
nSV = 2, nBSV = 2
Total nSV = 2
X[0] = 1.0 X[1] = -1.0 Y = -1 The predicton = -1.0
X[0] = -1.0 X[1] = -1.0 Y = -1 The predicton = -1.0
X[0] = 1.0 X[1] = 1.0 Y = 1 The predicton = -1.0
X[0] = -1.0 X[1] = 1.0 Y = -1 The predicton = -1.0
We can see that the following prediction is erroneous:
X[0] = 1.0 X[1] = 1.0 Y = 1 The predicton = -1.0
Anyone knows what is the mistake in my code?
You're using Radial Basis Function (param.kernel_type = svm_parameter.RBF) which uses gamma. Setting 'param.gamma = 1' should yield 100% accuracy.

How to create cartesian product [duplicate]

This question already has answers here:
Generate all possible n-character passwords
(4 answers)
Closed 1 year ago.
I have a list of integers, a = [0, ..., n]. I want to generate all possible combinations of k elements from a; i.e., the cartesian product of the a with itself k times. Note that n and k are both changeable at runtime, so this needs to be at least a somewhat adjustable function.
So if n was 3, and k was 2:
a = [0, 1, 2, 3]
k = 2
desired = [(0,0), (0, 1), (0, 2), ..., (2,3), (3,0), ..., (3,3)]
In python I would use the itertools.product() function:
for p in itertools.product(a, repeat=2):
print p
What's an idiomatic way to do this in Go?
Initial guess is a closure that returns a slice of integers, but it doesn't feel very clean.
For example,
package main
import "fmt"
func nextProduct(a []int, r int) func() []int {
p := make([]int, r)
x := make([]int, len(p))
return func() []int {
p := p[:len(x)]
for i, xi := range x {
p[i] = a[xi]
}
for i := len(x) - 1; i >= 0; i-- {
x[i]++
if x[i] < len(a) {
break
}
x[i] = 0
if i <= 0 {
x = x[0:0]
break
}
}
return p
}
}
func main() {
a := []int{0, 1, 2, 3}
k := 2
np := nextProduct(a, k)
for {
product := np()
if len(product) == 0 {
break
}
fmt.Println(product)
}
}
Output:
[0 0]
[0 1]
[0 2]
[0 3]
[1 0]
[1 1]
[1 2]
[1 3]
[2 0]
[2 1]
[2 2]
[2 3]
[3 0]
[3 1]
[3 2]
[3 3]
The code to find the next product in lexicographic order is simple: starting from the right, find the first value that won't roll over when you increment it, increment that and zero the values to the right.
package main
import "fmt"
func main() {
n, k := 5, 2
ix := make([]int, k)
for {
fmt.Println(ix)
j := k - 1
for ; j >= 0 && ix[j] == n-1; j-- {
ix[j] = 0
}
if j < 0 {
return
}
ix[j]++
}
}
I've changed "n" to mean the set is [0, 1, ..., n-1] rather than [0, 1, ..., n] as given in the question, since the latter is confusing since it has n+1 elements.
Just follow the answer Implement Ruby style Cartesian product in Go, play it on http://play.golang.org/p/NR1_3Fsq8F
package main
import "fmt"
// NextIndex sets ix to the lexicographically next value,
// such that for each i>0, 0 <= ix[i] < lens.
func NextIndex(ix []int, lens int) {
for j := len(ix) - 1; j >= 0; j-- {
ix[j]++
if j == 0 || ix[j] < lens {
return
}
ix[j] = 0
}
}
func main() {
a := []int{0, 1, 2, 3}
k := 2
lens := len(a)
r := make([]int, k)
for ix := make([]int, k); ix[0] < lens; NextIndex(ix, lens) {
for i, j := range ix {
r[i] = a[j]
}
fmt.Println(r)
}
}