Understanding this application of a Naive Bayes Classifier - bayesian

I'm a little confused with this example I've been following online. Please correct me if anything is wrong before I get to my question! I know Bayes theorem is this:
P(A│B)= P(B│A) * P(A)
----------
P(B)
In the example I'm looking at, classifying is being done on text documents. The text documents are all either "terrorism" or "entertainment", so:
Prior probability for either, i.e. P(A) = 0.5
There are six documents with word frequencies like so:
The example goes on to break down the frequency of these words in relation to each class, applying Laplace estimation:
So to my understanding each of these numbers represents the P(B|A), i.e. the probability of that word appearing given a particular class (either terrorism or entertainment).
Now a new document arrives, with this breakdown:
The example calculates the probability of this new text document relating to terrorism by doing this:
P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)
which works out as:
0.5 x 0.2380 x 0.1904 x 0.3333 x 0.0476 x 0.0952 x 0.0952
Again, up to now I think I'm following. The P(Terrorism | W) is P (A|B), P(Terrorism) = P(A) = 0.5 and P(B|A) = all the results for "terrorism" in the above table multiplied together.
But to apply it to this new document, the example calculates each of the P(B|A) above to the power of the new frequency. So the above calculation becomes:
0.5 x 0.2380^2 x 0.1904^1 x 0.3333^2 x 0.0476^0 x 0.0952^0 x 0.0952^1
From there they do a few sums which I get and find the answer. My question is:
Where in the formula does it say to apply the new frequency as a power to the current P(B|A)?
Is this just something statistical I don't know about? Is this universal or just a particular example of how to do it? I'm asking because all the examples I find are slightly different, using slightly different keywords and terms and I'm finding it just a tad confusing!

First of all, the formula
P(Terrorism | W) = P(Terrorism) x P(kill | Terrorism) x P(bomb | Terrorism) x P(kidnap | Terrorism) x P(music | Terrorism) x P(movie | Terrorism) x P(TV | Terrorism)
isn't quite right. You need to divide that by P(W). But you hint that this is taken care of later when it says that "they do a few sums", so we can move on to your main question.
Traditionally when doing Naive Bayes on text classification, you only look at the existence of words, not their counts. Of course you need the counts to estimate P(word | class) at train time, but at test time P("music" | Terrorism) typically means the probability that the word "music" is present at least once in a Terrorism document.
It looks like what the implementation you are dealing with is doing is it's trying to take into account P("occurrences of kill" = 2 | Terrorism) which is different from P("at least 1 occurrence of kill" | Terrorism). So why do they end up raising probabilities to powers? It looks like their reasoning is that P("kill" | Terrorism) (which they estimated at train time) represents the probability of an arbitrary word in a Terrorism document to be "kill". So by simplifying assumption, the probability of a second arbitrary word in a Terrorism document to be "kill" is also P("kill" | Terrorism).
This leaves a slight problem for the case that a word does not occur in a document. With this scheme, the corresponding probability is raised to the 0th power, in other words it goes away. In other words, it is approximating that P("occurrences of music" = 0 | Terrorism) = 1. It should be clear that in general, this is strictly speaking false since it would imply that P(occurrences of music" > 0 | Terrorism) = 0. But for real world examples where you have long documents and thousands or tens of thousands of words, most words don't occur in most documents. So instead of bothering with accurately calculating all those probabilities (which would be computationally expensive), they are basically swept under the rug because for the vast majority of cases, it wouldn't change the classification outcome anyway. Also note that on top of it being computationally intensive, it is numerically unstable because if you are multiplying thousands or tens of thousands of numbers less than 1 together, you will underflow and it will spit out 0; if you do it in log space, you are still adding tens of thousands of numbers together which would have to be handled delicately from a numerical stability point of view. So the "raising it to a power" scheme inherently removes unnecessary fluff, decreasing computational intensity, increasing numerical stability, and still yields nearly identical results.
I hope the NSA doesn't think I'm a terrorist for having used the word Terrorism so much in this answer :S

Related

Self-Attention Explainability of the Output Score Matrix

I am learning about attention models, and following along with Jay Alammar's amazing blog tutorial on The Illustrated Transformer. He gives a great walkthrough for how the attention scores are calculated, but I get a bit lost at a certain point, and am not seeing how the attention score Z matrix he explains is used to interpret strength of associations between different words within an input sequence.
He mentions that given some input matrix X, with shape N x D, where N is the number of elements in an input sequence, and D is the input dimensionality, we multiply X with three separate weight matrices of shape D x d, where d is some lower dimensionality that represents the projected space of the query, key, and value matrices:
The query and key matrices are dotted, and then divided by a scaling factor usually the square root of the projected dimensionality, and then run through a softmax function. This produces a weight matrix of size N x N, which is multiplied by the value matrix to get an output Z of shape N x d, which Jay says
That concludes the self-attention calculation. The resulting vector is
one we can send along to the feed-forward neural network.
The screenshot from his blog for this calculation is below:
However, this is where I'm confused. Z is N x d. However, I don't particularly understand what I'm supposed to do with this matrix from an interpretability sense, and as far as I understand, for a particular sequence element (ie. the word cats in the sequence I love pets, especially cats), self-attention is supposed to score other parts of the sequence high when it is relevant or strong associated with that word embedding. However, I'd expect then that Z is N x N, so I could say that I can select the Z[i,j] and say for the i-th word in the sequence, the j-th word relates or associates with it this or that much.
In fact, wouldn't it make much more sense to use only the softmax output of the weights (without multiplying them by the value matrix), since it already is N x N? In essence, how is Jay determining the strength of these associations in this particular sequence with the word it?
This is an N by 1 relationship he is showing - there are N values that correspond with the strength of association to the word it.

tensorflow crossed column feature with vocabulary list for cross terms

How would a make a crossed_column with a vocabulary list for the crossed terms? That is suppose that I have two categorical columns
animal [dog, cat, puma, other]
food [pizza, salad, quinoa, other]
and now I want to make the crossed column, animal x food - but I've done some frequency counts of the training data (in spark before exporting tfrecords for training tensorflow models), and puma x quinoa only showed up once, and cat x quinoa never showed up. So I don't want to generate features for them, I don't think I have enough training examples to learn what their weights should be. What I'd like is for both of them to get absorbed in the "other x other" feature -- the thought that I'll learn some kind of average weight for a feature that covers all the infrequent terms.
It doesn't look like I can do that with tf.feature_column.crossed_column -- any idea how I would do this kind of thing in tensorflow?
Or, should I not worry about it? If I crosses all the features I'd get 20, but there are only 18 that I think are important - so maybe set the hash map size to 18 or, less, causing collisions? Then include the first order columns, animal and food, so the model can figure out what it is looking at? This is the approach I'm getting from reading the docs. I like it because it is simpler, but am concerned about model accuracy.
I think what I really want is some kind of sparse table lookup, rather than hashing the cross -- imagine I have
column A - integer Ids, 1 to 10,000
column B - integer Ids, 1 to 10,000
column C - integer Ids, 1 to 10,000
and there are only 1 million of the 1 trillion possible crosses between A,B,C that I want to make features for -- all the rest will go into the 1 million + 1 other x other x other feature, how would I do that in tensorflow?

Genetic algorithm for optimization in game playing agent heuristic evaluation function

This is in response to an answer given in this question:
How to create a good evaluation function for a game?, particularly by #David (it is the first answer).
Background: I am using a genetic algorithm to optimize the hyper parameters in a game playing agent that is using minimax / alpha beta pruning (with iterative deepening). In particular, I would like to optimize the heuristic (evaluation) function parameters using a genetic algorithm. The evaluation function I use is:
f(w) = w * num_my_moves - (1-w) * num_opponent_moves
The only parameter to optimize is w in [0,1].
Here's how I programmed the genetic algorithm:
Create a random population of say 100 agents
Let them play 1000 games at random with replacement.
Let the parents be the top performing agents with some poorer performing agents mixed in for genetic diversity.
Randomly breed some parents to create children. * Breeding process: We define a child to be an average of the weights of its parents.
i.e. childWeight = 0.5(father.w+ mother.w)
The new population is formed by the parents and the newly created children.
Randomly mutate 1% of the population as follows: newWeight = agent.x + random.uniform(-0.01,0.01) and account for trivial border cases (i.e. less than zero and greater than one, appropriately).
Evolve 10 times (i.e. repeat for new population)
My question: Please evaluate the bold points above. In particular, does anyone have a better way to breed (rather than trivially averaging the parent weights), and does anyone have a better way to mutate, rather than just adding random.uniform(-0.01,0.01)?
It looks like you're not actually applying a genetic-algorithm to your agents, but rather just simple evolution directly on the phenotype/weights. I suggest you try introducing a genetic representation of your weights, and evolve this genome instead. An example would be to represent your weights as a binary string, and apply evolution on each bit of the string, meaning there is a likelihood that each bit gets mutated. This is called point mutations. There are many other mutations you can apply, but it would do as a start.
What you will notice is that your agents don't get stuck in local minima as much because sometimes a small genetic change can vastly change the phenotype/weights.
Ok, that might sound complicated, it's not really. Let me give you an example:
Say you have a weight of 42 in base 10. This would be 101010 in binary. Now you have implemented a 1% mutation rate on each bit of the binary representation. Let's say the last bit is flipped. Then we have 101011 in binary, or 43 in decimal. Not such a big change. Doing the same with the second bit on the other hand gives you 111010 in binary or 58 decimal. Notice the big jump. This is what we want, and lets your agent population search a larger part of the solution space faster.
With regard to breeding. You can try crossover. Lets assume you have many weights each with a genetic encoding. If you represent the whole genome (all the binary data) as one long binary string you can combine sections of the two parents genome. Example, again. The following is the "father" and "mother" genome and phenotype:
Weight Name: W1 W2 W3 W4 W5
Father Phenotype: 43 15 34 17 14
Father Genome: 101011 001111 100010 010001 001110
Mother Genome: 100110 100111 011001 010100 101000
Mother Phenotype: 38 39 25 20 40
What you can do is draw arbitrary lines through both genomes at the same place, and assign the segments arbitrarily to the child. This is a version of crossover.
Weight Name: W1 W2 W3 W4 W5
Father Genome: 101011 00.... ...... .....1 001110
Mother Genome: ...... ..0111 011001 01010. ......
Child Genome: 101011 000111 011001 010101 001110
Child Phenotype: 43 7 25 21 14
Here the first 8 and the last 7 bits come from the father, and the middle comes from the mother. Notice how weight W1 and W5 are entirely from the father, and W3 is entirely from the mother. While W2 and W4 are combinations. W4 had hardly any change, while W2 has changed drastically.
I hope this gives you some insight in how to do genetic algorithms. That said, I recommend using a modern library instead of implementing it yourself, unless you are doing it to learn.
Edit: More on handling the weights/binary representation:
If you need fractions, you can do this by separating the numerator and denominator as different weights, or have one of them as a constant, e.g., 42 and 10 gives 4.2.)
Larger than 0 constraints come free. To actually get negative numbers you need to negate your weights.
Less than 1 constraint you can get by dividing the weight by the maximum possible value for that bit string length. In the examples above you have 6 bits, which can become a maximum of 63. If you then after mutation get a binary string of 101010 or 42 in base 10, you do 42/63 getting 0.667 and can only ever get as high as 1.0, when 63/63.
Two weights' sum equal to 1? If you get 101010 and 001000 for W1 and W2, it gives 42 and 8, then you can go W1_scaled = W1 / (W1 + W2) = 0.84 and W2_scaled = W2 / (W1 + W2) = 0.16. This should give you W1_scaled + W2_scaled = 1 always.
Since I was mentioned.
Rather than averaging the parent weights, I picked random numbers using the parent weights as a min/max. I additionally found I had to widen the range slightly (compensating for the reduction in standard deviation when I'd average two uniform random numbers, or sqrt(2), but I probably wasn't exact) to resist the pull toward the average. Otherwise the population converges toward the average and can't escape.
So if the parents' weights were 0.1 and 0.2, it might pick a random number between 0.08 and 0.22 for the child weight.
Late edit: A more accepted, studied, understood approach that I didn't know at the time is something called "Differential Evolution".

sampling 2-dimensional surface: how many sample points along X & Y axes?

I have a set of first 25 Zernike polynomials. Below are shown few in Cartesin co-ordinate system.
z2 = 2*x
z3 = 2*y
z4 = sqrt(3)*(2*x^2+2*y^2-1)
:
:
z24 = sqrt(14)*(15*(x^2+y^2)^2-20*(x^2+y^2)+6)*(x^2-y^2)
I am not using 1st since it is piston; so I have these 24 two-dim ANALYTICAL functions expressed in X-Y Cartesian co-ordinate system. All are defined over unit circle, as they are orthogonal over unit circle. The problem which I am describing here is relevant to other 2D surfaces also apart from Zernike Polynomials.
Suppose that origin (0,0) of the XY co-ordinate system and the centre of the unit circle are same.
Next, I take linear combination of these 24 polynomials to build a 2D wavefront shape. I use 24 random input coefficients in this combination.
w(x,y) = sum_over_i a_i*z_i (i=2,3,4,....24)
a_i = random coefficients
z_i = zernike polynomials
Upto this point, everything is analytical part which can be done on paper.
Now comes the discretization!
I know that when you want to re-construct a signal (1Dim/2Dim), your sampling frequency should be at least twice the maximum frequency present in the signal (Nyquist-Shanon principle).
Here signal is w(x,y) as mentioned above which is nothing but a simple 2Dim
function of x & y. I want to represent it on computer now. Obviously I can not take all infinite points from -1 to +1 along x axis and same for y axis.
I have to take finite no. of data points (which are called sample points or just samples) on this analytical 2Dim surface w(x,y)
I am measuring x & y in metres, and -1 <= x <= +1; -1 <= y <= +1.
e.g. If I divide my x-axis from -1 to 1, in 50 sample points then dx = 2/50= 0.04 metre. Same for y axis. Now my sampling frequency is 1/dx i.e. 25 samples per metre. Same for y axis.
But I took 50 samples arbitrarily; I could have taken 10 samples or 1000 samples. That is the crux of the matter here: how many samples points?How will I determine this number?
There is one theorem (Nyquist-Shanon theorem) mentioned above which says that if I want to re-construct w(x,y) faithfully, I must sample it on both axes so that my sampling frequency (i.e. no. of samples per metre) is at least twice the maximum frequency present in the w(x,y). This is nothing but finding power spectrum of w(x,y). Idea is that any function in space domain can be represented in spatial-frequency domain also, which is nothing but taking Fourier transform of the function! This tells us how many (spatial) frequencies are present in your function w(x,y) and what is the maximum frequency out of these many frequencies.
Now my question is first how to find out this maximum sampling frequency in my case. I can not use MATLAB fft2() or any other tool since it means already I have samples taken across the wavefront!! Obviously remaining option is find it analytically ! But that is time consuming and difficult since I have 24 polynomials & I will have to use then continuous Fourier transform i.e. I will have to go for pen and paper.
Any help will be appreciated.
Thanks
Key Assumptions
You want to use the "Nyquist-Shanon" theorem to determine sampling frequency
Obviously remaining option is find it analytically ! But that is time
consuming and difficult since I have 21 polynomials & I have to use
continuous Fourier transform i.e. done by analytically.
Given the assumption I have made (and noting that consideration of other mathematical techniques is out of scope for StackOverflow), you have no option but to calculate the continuous Fourier Transform.
However, I believe you haven't considered all the options for calculating the transform other than a laborious paper exercise e.g.
Numerical approximation of the continuous F.T. using code
Symbolic Integration e.g. Wolfram Alpha
Surely a numerical approximation of the Fourier Transform will be adequate for your solution?
I am assuming this is for coursework or research rather, so all you really care about as a physicist is a solution that is the quickest solution that is accurate within the scope of your problem.
So to conclude, IMHO, don't waste time searching for a more mathematically elegant solution or trick and just solve the problem with one of the above methods

Normal Distribution function

edit
So based on the answers so far (thanks for taking your time) I'm getting the sense that I'm probably NOT looking for a Normal Distribution function. Perhaps I'll try to re-describe what I'm looking to do.
Lets say I have an object that returns a number of 0 to 10. And that number controls "speed". However instead of 10 being the top speed, I need 5 to be the top speed, and anything lower or higher would slow down accordingly. (with easing, thus the bell curve)
I hope that's clearer ;/
-original question
These are the times I wish I remembered something from math class.
I'm trying to figure out how to write a function in obj-C where I define the boundries, ex (0 - 10) and then if x = foo y = ? .... where x runs something like 0,1,2,3,4,5,6,7,8,9,10 and y runs 0,1,2,3,4,5,4,3,2,1,0 but only on a curve
Something like the attached image.
I tried googling for Normal Distribution but its way over my head. I was hoping to find some site that lists some useful algorithms like these but wasn't very successful.
So can anyone help me out here ? And if there is some good sites which shows useful mathematical functions, I'd love to check them out.
TIA!!!
-added
I'm not looking for a random number, I'm looking for.. ex: if x=0 y should be 0, if x=5 y should be 5, if x=10 y should be 0.... and all those other not so obvious in between numbers
alt text http://dizy.cc/slider.gif
Okay, your edit really clarifies things. You're not looking for anything to do with the normal distribution, just a nice smooth little ramp function. The one Paul provides will do nicely, but is tricky to modify for other values. It can be made a little more flexible (my code examples are in Python, which should be very easy to translate to any other language):
def quarticRamp(x, b=10, peak=5):
if not 0 <= x <= b:
raise ValueError #or return 0
return peak*x*x*(x-b)*(x-b)*16/(b*b*b*b)
Parameter b is the upper bound for the region you want to have a slope on (10, in your example), and peak is how high you want it to go (5, in the example).
Personally I like a quadratic spline approach, which is marginally cheaper computationally and has a different curve to it (this curve is really nice to use in a couple of special applications that don't happen to matter at all for you):
def quadraticSplineRamp(x, a=0, b=10, peak=5):
if not a <= x <= b:
raise ValueError #or return 0
if x > (b+a)/2:
x = a + b - x
z = 2*(x-a)/b
if z > 0.5:
return peak * (1 - 2*(z-1)*(z-1))
else:
return peak * (2*z*z)
This is similar to the other function, but takes a lower bound a (0 in your example). The logic is a little more complex because it's a somewhat-optimized implementation of a piecewise function.
The two curves have slightly different shapes; you probably don't care what the exact shape is, and so could pick either. There are an infinite number of ramp functions meeting your criteria; these are two simple ones, but they can get as baroque as you want.
The thing you want to plot is the probability density function (pdf) of the normal distribution. You can find it on the mighty Wikipedia.
Luckily, the pdf for a normal distribution is not difficult to implement - some of the other related functions are considerably worse because they require the error function.
To get a plot like you showed, you want a mean of 5 and a standard deviation of about 1.5. The median is obviously the centre, and figuring out an appropriate standard deviation given the left & right boundaries isn't particularly difficult.
A function to calculate the y value of the pdf given the x coordinate, standard deviation and mean might look something like:
double normal_pdf(double x, double mean, double std_dev) {
return( 1.0/(sqrt(2*PI)*std_dev) *
exp(-(x-mean)*(x-mean)/(2*std_dev*std_dev)) );
}
A normal distribution is never equal to 0.
Please make sure that what you want to plot is indeed a
normal distribution.
If you're only looking for this bell shape (with the tangent and everything)
you can use the following formula:
x^2*(x-10)^2 for x between 0 and 10
0 elsewhere
(Divide by 125 if you need to have your peek on 5.)
double bell(double x) {
if ((x < 10) && (x>0))
return x*x*(x-10.)*(x-10.)/125.;
else
return 0.;
}
Well, there's good old Wikipedia, of course. And Mathworld.
What you want is a random number generator for "generating normally distributed random deviates". Since Objective C can call regular C libraries, you either need a C-callable library like the GNU Scientific Library, or for this, you can write it yourself following the description here.
Try simulating rolls of dice by generating random numbers between 1 and 6. If you add up the rolls from 5 independent dice rolls, you'll get a surprisingly good approximation to the normal distribution. You can roll more dice if you'd like and you'll get a better approximation.
Here's an article that explains why this works. It's probably more mathematical detail than you want, but you could show it to someone to justify your approach.
If what you want is the value of the probability density function, p(x), of a normal (Gaussian) distribution of mean mu and standard deviation sigma at x, the formula is
p(x) = exp( ((x-mu)^2)/(2*sigma^2) ) / (sigma * 2 * sqrt(pi))
where pi is the area of a circle divided by the square of its radius (approximately 3.14159...). Using the C standard library math.h, this is:
#include <math>
double normal_pdf(double x, double mu, double sigma) {
double n = sigma * 2 * sqrt(M_PI); //normalization factor
p = exp( -pow(x-mu, 2) / (2 * pow(sigma, 2)) ); // unnormalized pdf
return p / n;
}
Of course, you can do the same in Objective-C.
For reference, see the Wikipedia or MathWorld articles.
It sounds like you want to write a function that yields a curve of a specific shape. Something like y = f(x), for x in [0:10]. You have a constraint on the max value of y, and a general idea of what you want the curve to look like (somewhat bell-shaped, y=0 at the edges of the x range, y=5 when x=5). So roughly, you would call your function iteratively with the x range, with a step that gives you enough points to make your curve look nice.
So you really don't need random numbers, and this has nothing to do with probability unless you want it to (as in, you want your curve to look like a the outline of a normal distribution or something along those lines).
If you have a clear idea of what function will yield your desired curve, the code is trivial - a function to compute f(x) and a for loop to call it the desired number of times for the desired values of x. Plot the x,y pairs and you're done. So that's your algorithm - call a function in a for loop.
The contents of the routine implementing the function will depend on the specifics of what you want the curve to look like. If you need help on functions that might return a curve resembling your sample, I would direct you to the reading material in the other answers. :) However, I suspect that this is actually an assignment of some sort, and that you have been given a function already. If you are actually doing this on your own to learn, then I again echo the other reading suggestions.
y=-1*abs(x-5)+5