NumPy Difference Between np.average() and np.mean() [duplicate] - numpy

This question already has answers here:
np.mean() vs np.average() in Python NumPy?
(5 answers)
Closed 4 years ago.
NumPy has two different functions for calculating an average:
np.average()
and
np.mean()
Since it is unlikely that NumPy would include a redundant feature their must be a nuanced difference.
This was a concept I was very unclear on when starting data analysis in Python so I decided to make a detailed self-answer here as I am sure others are struggling with it.

Short Answer:
'Mean' and 'Average' are two different things. People use them interchangeably but shouldn't. np.mean() gives you the arithmetic mean where as np.average() allows you to get the arithmetic mean if you don't add other parameters, but can also be used to take a weighted average.
Long Answer and Background:
Statistics:
Since NumPy is mostly used for working with data sets it is important to understand the mathematical concept that causes this confusion. In simple mathematics and every day life we use the word Average and Mean as interchangeable words when this is not the case.
Mean: Commonly refers to the 'Arithmetic Mean' or the sum of a collection of numbers divided by the number of numbers in the collection1
Average: Average can refer to many different calculations, of which the 'Arithmetic Mean' is one. Others include 'Median', 'Mode', 'Weighted Mean, 'Interquartile Mean' and many others.2
What This Means For NumPy:
Back to the topic at hand. Since NumPy is normally used in applications related to mathematics it needs to be a bit more precise about the difference between Average() and Mean() than tools like Excel which use Average() as a function for finding the 'Arithmetic Mean'.
np.mean()
In NumPy, np.mean() will allow you to calculate the 'Arithmetic Mean' across a specified axis.
Here's how you would use it:
myArray = np.array([[3, 4], [5, 6]])
np.mean(myArray)
There are also parameters for changing which dType is used and which axis the function should compute along (the default is the flattened array).
np.average()
np.average() on the other hand allows you to take a 'Weighted Mean' in which different numbers in your array may have a different weight. For example, in the documentation we can see:
>>> data = range(1,5)
>>> data
[1, 2, 3, 4]
>>> np.average(data)
2.5
>>> np.average(range(1,11), weights=range(10,0,-1))
4.0
For the last function if you were to take a non-weighted average you would expect the answer to be 6. However, it ends up being 4 because we applied the weights too it.
If you don't have a good handle on what a 'weighted mean' we can try and simplify it:
Consider this a very elementary summary of our 'weighted mean' it isn't going to be quite mathematically accurate (which I hope someone will correct) but it should allow you to visualize what we're discussing.
A mean is the average of all numbers summed and divided by the total number of numbers. This means they all have an equal weight, or are counted once. For our mean sample this meant:
(1+2+3+4+5+6+7+8+9+10+11)/11 = 6
A weighted mean involves including numbers at different weights. Since in our above example it wouldn't include whole numbers it can be a bit confusing to visualize so we'll imagine the weighting fit more nicely across the numbers and it would look something like this:
(1+1+1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+2+2+2+3+3+3+3+3+3+3+3+4+4+4+4+4+4+4+5+5+5+5+5+5+6+6+6+6+6+6+7+7+7+7+7+8+8+8+8+9+9+9+-11)/59 = 3.9~
Even though in the actual number set there is only one instance of the number 1 we're counting it at 10 times its normal weight. This can also be done the other way, we could count a number at 1/3 of its normal weight.
If you don't provide a weight parameter to np.average() it will simply give you the equal weighted average across the flattened axis which is equivalent to the np.mean().
Why Would I Ever Use np.mean()?
If np.average() can be used to find the flat arithmetic mean then you may be asking yourself "why would I ever use np.mean()?" np.mean() allows for a few useful parameters that np.average() does not. One of the key ones is the dType parameter which allows you to set the type used in the computation.
For example the NumPy docs give us this case:
Single point precision:
>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875
Based on the calculation above it looks like our average is 0.546875 but if we use the dType parameter to float64 we get a different result:
>>> np.mean(a, dtype=np.float64)
0.55000000074505806
The actual average 0.55000000074505806.
Now, if you round both of these to two significant digits you get 0.55 in both cases. Where this accuracy becomes important is if you are doing multiple sets of operations on the number still, especially when dealing with very large (or very small numbers) that need a high accuracy.
For example:
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,656.404611
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) =
231,044,654.839687
Even in simpler equations you can end up being off by a few decimal places and that can be relevant in:
Scientific simulations: Due to lengthy equations, multiple steps and a high degree of accuracy needed.
Statistics: The difference between a few percentage points of accuracy can be crucial (for example in medical studies).
Finance: Continually being off by even a few cents in large financial models or when tracking large amounts of capital (banking/private equity) could result in hundreds of thousands of dollars in errors by the end of the year.
Important Word Distinction
Lastly, simply on interpretation you may find yourself in a situation where analyzing data where it is asked of you to find the 'Average' of a dataset. You may want to use a different method of average to find the most accurate representation of the dataset. For example, np.median() may be more accurate than np.average() in cases with outliers and so its important to know the statistical difference.

Related

Can't correctly factorize a polinomial whose coefficients have decimals (non-integer), how can maxima do it?

how are you?.
I'm trying to create a function that calculates the z-transform of a transfer function using the residues method but for that, I need the factors of the characteristic equation and the powers of the factors, so, in order to do that I tried to factorize polynomials with non-integer coefficients but after trying everything that I read I couldn't factorize make maxima to factorize those polynomials the way I need it.
For giving an example, I have this characteristic equation: "s·(s^2+0.1·s)", the factors should be "s^2" and "s + 0.1" but maxima allways gives me "(s^2·(10·s + 1))/10".
Why I'm signalling this?, well, as I learned that maxima treates the outputs equation as list so I can have its dimension and separate the factos by its positions in the list to measure the powers of the factors and do what I need, but like maxima gives me the result that is shown above then the dimension of the list is different and it will make my function to work differently and possibly have errors.
The result that is shown is given by maxima no matter if I use factor, gfactor, or expand or whatever other function that I found and I know that result is because maxima are rationalizing the polynomial before working with it but I don't need that behavior, I only need the pure factors, so, how can I have the result that I want?.
Thanks in advance for the help.

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.
I don't understand why you square eigenvalues
https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444
here?
Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
pca.transform(Y)
In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.
Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?
Answers to your questions:
1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.
2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.
3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).
Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.
EDIT 1:
Answer to edited Point 2
No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).
Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

What is the significance of t-stats value while applying ttest_ind on two pandas series?

what conclusion can be drawn from the resulting t-stats value When ttest_ind is applied on two independent series?
As you can read here, the scipy.stats.ttest_ind has two outputs
The calculated t-statistic.
The two-tailed p-value.
Very intuitively, you can read the t-statistic as a normalized difference of averages in both populations, considering their variances and sizes:
The larger are the samples, the more serious the difference of averages is because we have more evidence for that.
The larger are the variances, the less serious the difference of averages is because the absolute difference can be given by randomness only.
The higher is the value of the t-statistic, the more serious is the difference.
The p-value makes this intuition more explicit: it is the probability that the difference of averages can be considered as zero. If the p-value is bellow a threshold, e.g. 0.05, we say that the difference in not zero.

Emulating fixed precision in python

For a university course in numerical analysis we are transitioning from Maple to a combination of Numpy and Sympy for various illustrations of the course material. This is because the students already learn Python the semester before.
One of the difficulties we have is in emulating fixed precision in Python. Maple allows the user to specify a decimal precision (say 10 or 20 digits) and from then on every calculation is made with that precision so you can see the effect of rounding errors. In Python we tried some ways to achieve this:
Sympy has a rounding function to a specified number of digits.
Mpmath supports custom precision.
This is however not what we're looking for. These options calculate the exact result and round the exact result to the specified number of digits. We are looking for a solution that does every intermediate calculation in the specified precision. Something that can show, for example, the rounding errors that can happen when dividing two very small numbers.
The best solution so far seems to be the custom data types in Numpy. Using float16, float32 and float64 we were able to al least give an indication of what could go wrong. The problem here is that we always need to use arrays of one element and that we are limited to these three data types.
Does anything more flexible exist for our purpose? Or is the very thing we're looking for hidden somewhere in the mpmath documentation? Of course there are workarounds by wrapping every element of a calculation in a rounding function but this obscures the code to the students.
You can use decimal. There are several ways of usage, for example, localcontext or getcontext.
Example with getcontext from documentation:
>>> from decimal import *
>>> getcontext().prec = 6
>>> Decimal(1) / Decimal(7)
Decimal('0.142857')
Example of localcontext usage:
>>> from decimal import Decimal, localcontext
>>> with localcontext() as ctx:
... ctx.prec = 4
... print Decimal(1) / Decimal(3)
...
0.3333
To reduce typing you can abbreviate the constructor (example from documentation):
>>> D = decimal.Decimal
>>> D('1.23') + D('3.45')
Decimal('4.68')

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.