Calculate and return the average of positive, negative, and neutral - dataframe

I have the following dataframe:
enter image description here
I am trying to have three additional columns in which they return sum of instances of 0, 1-, and 1 (positive negative and neutral per say). After that, I want to calculate the average sentiment of user's posts. Any help with appending these averages would be great.
So far I tried the solution below:
def mean_positive(L):
# Get all positive numbers into another list
pos_only = [x for x in L if x > 0]
if pos_only:
return sum(pos_only) / len(pos_only)
raise ValueError('No postive numbers in input')
Thank you.

Related

Weird numpy matrix values

When i want to calculate the determinant of matrix using <<np.linalg.det(mat1)>> or calculate the inverse it gives weird value output . For example it gives 1.11022302e-16 instead of 0.
I tried to round the number for determinant but i couldn't do the same for matrix elements.
Maybe the computation is a not fix numbers so multiplication or division very close to zero but not equals.
You can define a delta that can determine if its close enough, and then compute the the absolute distance between the result and the expected value.
Maybe like this:
res = np.linalg.det(mat)
delta = 0.0001
if abs(math.floor(res)-res)<delta:
return math.floor(res)
if abs(math.ceil(res)-res)<delta:
return math.ceil(res)
return res

python df - highlight values in absolute terms but display negative values

I want to highlight top 5 values in a dataframe in absolute terms. However, in my output, I still want to see the negative values.
I am using this for principal component analysis where strongest factor loadings are considered in absolute terms (e.g., .95, -.93, .89, -.83). We also need to know whether the values are positive or negative.
My current function:
def highlight_top3(s):
is_large = s.nlargest(3).values
return ['background-color: lightgreen' if v in is_large else '' for v in s]
loadings.iloc[:, 0:3].style.apply(highlight_top3)
I could do the following but the negative values disappear:
def highlight_top3(s):
is_large = s.nlargest(3).values
return ['background-color: lightgreen' if v in is_large else '' for v in s]
loadings.abs().iloc[:, 0:3].style.apply(highlight_top3)

Calculating auto covariance in pandas

Following on the answer provided by #pltrdy, in this threat:
https://stackoverflow.com/a/27164416/14744492
How do you convert the pandas.Series.autocorr(), which calculates lag-N (default=1) autocorrelation on Series, into autocovariances?
Sadly the command pandas.Series.autocov()is not implemented in pandas.
What .autocorr(k) calculates is the (Pearson) correlation coefficient for lag k. But we know that, for a series x, that coefficient for lag k is:
\rho_k = \frac{Cov(x_{t}, x_{t-k})}{Var(x)}
Then, to get autocovariance, you multiply autocorrelation by the variance:
def autocov_series(x, lag=1):
return x.autocorr(x, lag=lag) * x.var()
Note that Series.var uses ddof of 1 by default so N - 1 divides the sample variance where N == s.size (and you'd get an unbiased estimate for the population variance).

How to find most similar numerical arrays to one array, using Numpy/Scipy?

Let's say I have a list of 5 words:
[this, is, a, short, list]
Furthermore, I can classify some text by counting the occurrences of the words from the list above and representing these counts as a vector:
N = [1,0,2,5,10] # 1x this, 0x is, 2x a, 5x short, 10x list found in the given text
In the same way, I classify many other texts (count the 5 words per text, and represent them as counts - each row represents a different text which we will be comparing to N):
M = [[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40],
...]
Now, I want to find the top 1 (2, 3 etc) rows from M that are most similar to N. Or on simple words, the most similar texts to my initial text.
The challenge is, just checking the distances between N and each row from M is not enough, since for example row M4 [4,0,8,20,40] is very different by distance from N, but still proportional (by a factor of 4) and therefore very similar. For example, the text in row M4 can be just 4x as long as the text represented by N, so naturally all counts will be 4x as high.
What is the best approach to solve this problem (of finding the most 1,2,3 etc similar texts from M to the text in N)?
Generally speaking, the most widely standard technique of bag of words (i.e. you arrays) for similarity is to check cosine similarity measure. This maps your bag of n (here 5) words to a n-dimensional space and each array is a point (which is essentially also a point vector) in that space. The most similar vectors(/points) would be ones that have the least angle to your text N in that space (this automatically takes care of proportional ones as they would be close in angle). Therefore, here is a code for it (assuming M and N are numpy arrays of the similar shape introduced in the question):
import numpy as np
cos_sim = M[np.argmax(np.dot(N, M.T)/(np.linalg.norm(M)*np.linalg.norm(N)))]
which gives output [ 4 0 8 20 40] for your inputs.
You can normalise your row counts to remove the length effect as you discussed. Row normalisation of M can be done as M / M.sum(axis=1)[:, np.newaxis]. The residual values can then be calculated as the sum of the square difference between N and M per row. The minimum difference (ignoring NaN or inf values obtained if the row sum is 0) is then the most similar.
Here is an example:
import numpy as np
N = np.array([1,0,2,5,10])
M = np.array([[1,0,2,0,5],
[0,0,0,0,0],
[2,0,0,0,20],
[4,0,8,20,40]])
# sqrt of sum of normalised square differences
similarity = np.sqrt(np.sum((M / M.sum(axis=1)[:, np.newaxis] - N / np.sum(N))**2, axis=1))
# remove any Nan values obtained by dividing by 0 by making them larger than one element
similarity[np.isnan(similarity)] = similarity[0]+1
result = M[similarity.argmin()]
result
>>> array([ 4, 0, 8, 20, 40])
You could then use np.argsort(similarity)[:n] to get the n most similar rows.

How do I calculate the sum efficiently?

Given an integer n such that (1<=n<=10^18)
We need to calculate f(1)+f(2)+f(3)+f(4)+....+f(n).
f(x) is given as :-
Say, x = 1112222333,
then f(x)=1002000300.
Whenever we see a contiguous subsequence of same numbers, we replace it with the first number and zeroes all behind it.
Formally, f(x) = Sum over all (first element of the contiguous subsequence * 10^i ), where i is the index of first element from left of a particular contiguous subsequence.
f(x)=1*10^9 + 2*10^6 + 3*10^2 = 1002000300.
In, x=1112222333,
Element at index '9':-1
and so on...
We follow zero based indexing :-)
For, x=1234.
Element at index-'0':-4,element at index -'1':3,element at index '2':-2,element at index 3:-1
How to calculate f(1)+f(2)+f(3)+....+f(n)?
I want to generate an algorithm which calculates this sum efficiently.
There is nothing to calculate.
Multiplying each position in the array od numbers will yeild thebsame number.
So all you want to do is end up with 0s on a repeated number
IE lets populate some static values in an array in psuedo code
$As[1]='0'
$As[2]='00'
$As[3]='000'
...etc
$As[18]='000000000000000000'```
these are the "results" of 10^index
Given a value n of `1234`
```1&000 + 2&00 +3 & 0 + 4```
Results in `1234`
So, if you are putting this on a chip, then probably your most efficient method is to do a bitwise XOR between each register and the next up the line as a single operation
Then you will have 0s in all the spots you care about, and just retrive the values in the registers with a 1
In code, I think it would be most efficient to do the following
```$n = arbitrary value 11223334
$x=$n*10
$zeros=($x-$n)/10```
Okay yeah we can just do bit shifting to get a value like 100200300400 etc.
To approach this problem, it could help to begin with one digit numbers and see what sum you get.
I mean like this:
Let's say, we define , then we have:
F(1)= 45 # =10*9/2 by Euler's sum formula
F(2)= F(1)*9 + F(1)*100 # F(1)*9 is the part that comes from the last digit
# because for each of the 10 possible digits in the
# first position, we have 9 digits in the last
# because both can't be equal and so one out of ten
# becomse zero. F(1)*100 comes from the leading digit
# which is multiplied by 100 (10 because we add the
# second digit and another factor of 10 because we
# get the digit ten times in that position)
If you now continue with this scheme, for k>=1 in general you get
F(k+1)= F(k)*100+10^(k-1)*45*9
The rest is probably straightforward.
Can you tell me, which Hackerrank task this is? I guess one of the Project Euler tasks right?