Why is the correlation one when values differ? - pandas

I have a dataframe book_matrix with users as rows, books as columns, and ratings as values. When I use corrwith() to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0, but the values are clearly different.
The non-null values [10, 3] and [10, 9] have correlation 1.0. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?

Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:
import pandas as pd
df1 = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[5, 8, 4, 3],
"C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[2, 4, 6, 8],
"B":[-5, -8, -4, -3],
"C":[4, 3, 8, 5]})
df1.corrwith(df2, axis=0)
A 1.000000
B -1.000000
C 0.395437
dtype: float64
So you can see that [1, 2, 3, 4] and [2, 4, 6, 8] have correlation 1.0
The next column [5, 8, 4, 3] and [-5, -8, -4, -3] have extreme negative correlation -1.0
In the last column, [10, 4, 9, 3] and [4, 3, 8, 5] are somewhat correlated 0.395437, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.
So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.
df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
"B": [10, 3, 10, 3, 10, 3]})
df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
"B": [10, 10, 10, 9, 9, 9]})
df1.corrwith(df2, axis=0)
A 1.000000
B 0.333333
dtype: float64
So you can see that [10, 3, 10, 3, 10, 3] and [10, 9, 10, 9, 10, 9] are also correlated perfectly at 1.0.
But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3] and [10, 10, 10, 9, 9, 9] are not perfectly correlated anymore at 0.333333
So going forward, you need more data, and more variations in the data! Hope that helps 😎

Related

For loop to obtain sum and mean on np 3d array

I have the following array
arr = np.array([[[1, 2, 3], [4, 5, 6]],
[[7, 8, 9], [10, 11, 12]]])
I want to go through each element and sum on axis 0, so I do:
lst = []
for x in arr:
for y in np.sum(x,axis=0):
lst.append(y)
where now the lst is
[5, 7, 9, 17, 19, 21]
However I want the output to be in the following form:
[[5, 7, 9], [17, 19, 21]]
to then take the mean of its axis 0 namely (5+17)/2 and so on. The final output should look like
[11., 13., 15.]
I wonder how can I do this? Is it possible to write this whole operation in a compact form as list comprehension?
Update: To get the final output I can do:
np.mean(np.reshape(lst, (len(arr),-1)),axis=0)
Yet I am sure there is a Pythonic way of doing this
In [5]: arr = np.array([[[1, 2, 3], [4, 5, 6]],
...: [[7, 8, 9], [10, 11, 12]]])
In [7]: arr
Out[7]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
The for iterates on the 1st dimension, as though it was a list of arrays:
In [8]: for x in arr:print(x)
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
list(arr) also makes a list (but it is slower than `arr.tolist()).
One common way of iterating on other dimensions is to use an index:
In [10]: for i in range(2):print(arr[:,i])
[[1 2 3]
[7 8 9]]
[[ 4 5 6]
[10 11 12]]
You could also transpose the array placing the desired axis first.
But you don't need to iterate
In [13]: arr.sum(axis=1)
Out[13]:
array([[ 5, 7, 9],
[17, 19, 21]])
In [14]: arr.sum(axis=1).mean(axis=0)
Out[14]: array([11., 13., 15.])

How to delete rows from column which have matching values in the list Pandas

I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.

Numpy: Fast approach to extract rows from an array using an 2D index array

I have 2 arrays a and b:
N,D,V,W = 2,3,4,5
a = np.random.randint(0,V,N*D).reshape(N,D)
a
array([[2, 3, 3],
[2, 0, 3]])
b = np.random.randint(0,10,V*W).reshape(V,W)
b
array([[0, 1, 0, 5, 5],
[0, 3, 6, 8, 7],
[8, 8, 9, 0, 9],
[4, 6, 3, 3, 1]])
What I need to do is to replace every element of array a with a row from array b using the array a element value as the row index of array b.
At the moment I'm doing it this way which works fine:
b[a.ravel(),:].reshape(*a.shape,-1)
array([[[8, 8, 9, 0, 9],
[4, 6, 3, 3, 1],
[4, 6, 3, 3, 1]],
[[8, 8, 9, 0, 9],
[0, 1, 0, 5, 5],
[4, 6, 3, 3, 1]]])
However it seems this approach is a bit slow.
I tested it with:
N,D,V,W = 20000,64,100,256
and it took an average of 674ms on my laptop(8 core, 16 ram)
Can someone please recommend an faster yet still simple approach?

Splitting a number and assigning to elements in a row in a numpy array

How to place a list of numbers in to a 2D numpy array, where the second dimension of the array is equal to the number of digits of the largest number of that list? I also want the elements that don't belong to the original number to be zero in each row of the returning array.
Example:
From the list a = range(0,1001), how to get the numpy array of the below form:
[[0,0,0,0],
[0,0,0,1],
[0,0,0,2],
...
[0,9,9,8]
[0,9,9,9],
[1,0,0,0]]
Please note how the each number is placed in-place in a np.zeros((1000,4)) array at the end of the each row.
NB: A pythonic, vectorized implementation is expected
Broadcasting again!
def split_digits(a):
N = int(np.log10(np.max(a))+1) # No. of digits
r = 10**np.arange(N,-1,-1) # 10-powered range array
return (np.asarray(a)[:,None]%r[:-1])//r[1:]
Sample runs -
In [224]: a = range(0,1001)
In [225]: split_digits(a)
Out[225]:
array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 2],
...,
[0, 9, 9, 8],
[0, 9, 9, 9],
[1, 0, 0, 0]])
In [229]: a = np.random.randint(0,1000000,(7))
In [230]: a
Out[230]: array([431921, 871855, 636144, 541186, 410562, 89356, 476258])
In [231]: split_digits(a)
Out[231]:
array([[4, 3, 1, 9, 2, 1],
[8, 7, 1, 8, 5, 5],
[6, 3, 6, 1, 4, 4],
[5, 4, 1, 1, 8, 6],
[4, 1, 0, 5, 6, 2],
[0, 8, 9, 3, 5, 6],
[4, 7, 6, 2, 5, 8]])
Another concept using pandas str
def pir(a):
z = int(np.log10(np.max(a)))
s = pd.Series(a.astype(str))
zfilled = s.str.zfill(z + 1).sum()
a_ = np.array(list(zfilled)).reshape(-1, z + 1)
return a_.astype(int)
Using #Divakar's random array
a = np.random.randint(0,1000000,(7))
array([ 57190, 29950, 392317, 592062, 460333, 639794, 983647])
pir(a)
array([[0, 5, 7, 1, 9, 0],
[0, 2, 9, 9, 5, 0],
[3, 9, 2, 3, 1, 7],
[5, 9, 2, 0, 6, 2],
[4, 6, 0, 3, 3, 3],
[6, 3, 9, 7, 9, 4],
[9, 8, 3, 6, 4, 7]])

Split last dimension of arrays in lower dimensional arrays

Assume we have an array with NxMxD shape. I want to get a list with D NxM arrays.
The correct way of doing it would be:
np.dsplit(myarray, D)
However, this returns D NxMx1 arrays.
I can achieve the desired result by doing something like:
[myarray[..., i] for i in range(D)]
Or:
[np.squeeze(subarray) for subarray in np.dsplit(myarray, D)]
However, I feel like it is a bit redundant to need to perform an additional operation. Am I missing any numpy function that returns the desired result?
Try D.swapaxes(1,2).swapaxes(1,0)
>>>import numpy as np
>>>a = np.arange(24).reshape(2,3,4)
>>>a
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
>>>[a[:,:,i] for i in range(4)]
[array([[ 0, 4, 8],
[12, 16, 20]]),
array([[ 1, 5, 9],
[13, 17, 21]]),
array([[ 2, 6, 10],
[14, 18, 22]]),
array([[ 3, 7, 11],
[15, 19, 23]])]
>>>a.swapaxes(1,2).swapaxes(1,0)
array([[[ 0, 4, 8],
[12, 16, 20]],
[[ 1, 5, 9],
[13, 17, 21]],
[[ 2, 6, 10],
[14, 18, 22]],
[[ 3, 7, 11],
[15, 19, 23]]])
Edit: As pointed out by ajcr (thanks again), the transpose command is more convenient since the two swaps can be done in one step by using
D.transpose(2,0,1)
np.dsplit uses np.array_split, the core of which is:
sub_arys = []
sary = _nx.swapaxes(ary, axis, 0)
for i in range(Nsections):
st = div_points[i]; end = div_points[i+1]
sub_arys.append(_nx.swapaxes(sary[st:end], axis, 0))
with axis=-1, this is equivalent to:
[x[...,i:(i+1)] for i in np.arange(x.shape[-1])] # or
[x[...,[i]] for i in np.arange(x.shape[-1])]
which accounts for the singleton dimension.
So there's nothing wrong or inefficient about your
[x[...,i] for i in np.arange(x.shape[-1])]
Actually in quick time tests, any use of dsplit is slow. It's generality costs. So adding squeeze is relatively cheap.
But by accepting the other answer, it looks like you are really looking for an array of the correct shape, rather than a list of arrays. For many operations that makes sense. split is more useful when the subarrays have more than one 'row' along the split axis, or even an uneven number of 'rows'.