python df - highlight values in absolute terms but display negative values - dataframe

I want to highlight top 5 values in a dataframe in absolute terms. However, in my output, I still want to see the negative values.
I am using this for principal component analysis where strongest factor loadings are considered in absolute terms (e.g., .95, -.93, .89, -.83). We also need to know whether the values are positive or negative.
My current function:
def highlight_top3(s):
is_large = s.nlargest(3).values
return ['background-color: lightgreen' if v in is_large else '' for v in s]
loadings.iloc[:, 0:3].style.apply(highlight_top3)
I could do the following but the negative values disappear:
def highlight_top3(s):
is_large = s.nlargest(3).values
return ['background-color: lightgreen' if v in is_large else '' for v in s]
loadings.abs().iloc[:, 0:3].style.apply(highlight_top3)

Related

Weird numpy matrix values

When i want to calculate the determinant of matrix using <<np.linalg.det(mat1)>> or calculate the inverse it gives weird value output . For example it gives 1.11022302e-16 instead of 0.
I tried to round the number for determinant but i couldn't do the same for matrix elements.
Maybe the computation is a not fix numbers so multiplication or division very close to zero but not equals.
You can define a delta that can determine if its close enough, and then compute the the absolute distance between the result and the expected value.
Maybe like this:
res = np.linalg.det(mat)
delta = 0.0001
if abs(math.floor(res)-res)<delta:
return math.floor(res)
if abs(math.ceil(res)-res)<delta:
return math.ceil(res)
return res

Calculate and return the average of positive, negative, and neutral

I have the following dataframe:
enter image description here
I am trying to have three additional columns in which they return sum of instances of 0, 1-, and 1 (positive negative and neutral per say). After that, I want to calculate the average sentiment of user's posts. Any help with appending these averages would be great.
So far I tried the solution below:
def mean_positive(L):
# Get all positive numbers into another list
pos_only = [x for x in L if x > 0]
if pos_only:
return sum(pos_only) / len(pos_only)
raise ValueError('No postive numbers in input')
Thank you.

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

Numpy returning False even though both arrays are the same?

From my understanding of numpy, the np.equal([x, prod]) command compares the arrays element by element and returns True for each if they are equal. But every time I execute the command, it returns False for the first comparison. On the other hand, if I copy-paste the two arrays into the command, it returns True for both, as you can see in the screenshot. So, why is there a difference between the two?
You cannot compare floating-point numbers, as they are only an approximation. When you compare them by hardcoded values, they will be equal as they are approximated in the exact same way. But once you apply some mathematical operation on them, it's no longer possible to check if two floating-points are equal.
For example, this
a = 0
for i in range(10):
a += 1/10
print(a)
print(a == 1)
will give you 0.9999999999 and False, even though (1/10) * 10 = 1.
To compare floating-point values, you need to compare the two values against a small delta value. In other words, check if they're just a really small value apart. For example
a = 0
for i in range(10):
a += 1/10
delta = 0.00000001
print(a)
print(abs(a - 1) < delta)
will give you True.
For numpy, you can use numpy.isclose to get a mask or numpy.allclose if you only want a True or False value.

How to add magnitude or value to a vector in Python?

I am using this function to calculate distance between 2 vectors a,b, of size 300, word2vec, I get the distance between 'hot' and 'cold' to be equal 1.
How to add this value (1) to a vector, becz i thought simply new_vec=model['hot']+1, but when I do the calc dist(new_vec,model['hot'])=17?
import numpy
def dist(a,b):
return numpy.linalg.norm(a-b)
a=model['hot']
c=a+1
dist(a,c)
17
I expected dist(a,c) will give me back 1!
You should review what the norm is. In the case of numpy, the default is to use the L-2 norm (a.k.a the Euclidean norm). When you add 1 to a vector, the call is to add 1 to all of the elements in the vector.
>> vec1 = np.random.normal(0,1,size=300)
>> print(vec1[:5])
... [ 1.18469795 0.04074346 -1.77579852 0.23806222 0.81620881]
>> vec2 = vec1 + 1
>> print(vec2[:5])
... [ 2.18469795 1.04074346 -0.77579852 1.23806222 1.81620881]
Now, your call to norm is saying sqrt( (a1-b1)**2 + (a2-b2)**2 + ... + (aN-bN)**2 ) where N is the length of the vector and a is the first vector and b is the second vector (and ai being the ith element in a). Since (a1-b1)**2 == (a2-b2)**2 == ... == (aN-bN)**2 == 1 we expect this sum to produce N which in your case is 300. So sqrt(300) = 17.3 is the expected answer.
>> print(np.linalg.norm(vec1-vec2))
... 17.320508075688775
To answer the question, "How to add a value to a vector": you have done this correctly. If you'd like to add a value to a specific element then you can do vec2[ix] += value where ix indexes the element that you wish to add. If you want to add a value uniformly across all elements in the vector that will change the norm by 1, then add np.sqrt(1/300).
Also possibly relevant is a more commonly used distance metric for word2vec vectors: the cosine distance which measures the angle between two vectors.