how to merge consecutive DatetimeIndex? - pandas

The original Series is
2019-02-09 23:04:33 [9]
2019-02-09 23:04:34 [7, 10]
2019-02-09 23:05:41 [0, 10, 11, 13, 15, 16]
2019-02-09 23:05:42 [0, 11, 13, 14, 15, 16]
2019-02-09 23:07:41 [12, 16]
2019-02-09 23:09:42 [1, 3, 15]
How to merge the values which have consecutive DatetimeIndex? The output should be
2019-02-09 23:04:33 [7, 9, 10]
2019-02-09 23:05:41 [0, 10, 11, 13, 14, 15, 16]
2019-02-09 23:07:41 [12, 16]
2019-02-09 23:09:42 [1, 3, 15]

Use custom lambda function for processing lists in set comprehension and for index values is used GroupBy.first:
f = lambda x: sorted(set([z for y in x for z in y]))
df = s.reset_index(name='a')
#consecutive datetimes by 1 second
s1 = df['index'].diff().dt.total_seconds().ne(1).cumsum()
s = (df.groupby(s1)
.agg(i = ('index', 'first'), a = ('a', f))
.set_index('i')['a'])
print (s)
i
2019-02-09 23:04:33 [7, 9, 10]
2019-02-09 23:05:41 [0, 10, 11, 13, 14, 15, 16]
2019-02-09 23:07:41 [12, 16]
2019-02-09 23:09:42 [1, 3, 15]
Name: a, dtype: object

Related

Why is the correlation one when values differ?

I have a dataframe book_matrix with users as rows, books as columns, and ratings as values. When I use corrwith() to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0, but the values are clearly different.
The non-null values [10, 3] and [10, 9] have correlation 1.0. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?
Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:
import pandas as pd
df1 = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[5, 8, 4, 3],
"C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[2, 4, 6, 8],
"B":[-5, -8, -4, -3],
"C":[4, 3, 8, 5]})
df1.corrwith(df2, axis=0)
A 1.000000
B -1.000000
C 0.395437
dtype: float64
So you can see that [1, 2, 3, 4] and [2, 4, 6, 8] have correlation 1.0
The next column [5, 8, 4, 3] and [-5, -8, -4, -3] have extreme negative correlation -1.0
In the last column, [10, 4, 9, 3] and [4, 3, 8, 5] are somewhat correlated 0.395437, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.
So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.
df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
"B": [10, 3, 10, 3, 10, 3]})
df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
"B": [10, 10, 10, 9, 9, 9]})
df1.corrwith(df2, axis=0)
A 1.000000
B 0.333333
dtype: float64
So you can see that [10, 3, 10, 3, 10, 3] and [10, 9, 10, 9, 10, 9] are also correlated perfectly at 1.0.
But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3] and [10, 10, 10, 9, 9, 9] are not perfectly correlated anymore at 0.333333
So going forward, you need more data, and more variations in the data! Hope that helps 😎

How to get the specific out put for Numpy array slicing?

x is an array of shape(n_dim,n_row,n_col) of 1st n natural numbers
b is boolean array of shape(2,) having elements True,false
def array_slice(n,n_dim,n_row,n_col):
x = np.arange(0,n).reshape(n_dim,n_row,n_col)
b = np.full((2,),True)
print(x[b])
print(x[b,:,1:3])
expected output
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]]
[[[ 1 2]
[ 6 7]
[11 12]]]
my output:-
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
[[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]]]
[[[ 1 2]
[ 6 7]
[11 12]]
[[16 17]
[21 22]
[26 27]]]
An example:
In [83]: x= np.arange(24).reshape(2,3,4)
In [84]: b = np.full((2,),True)
In [85]: x
Out[85]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [86]: b
Out[86]: array([ True, True])
With two True, b selects both plains of the 1st dimension:
In [87]: x[b]
Out[87]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
A b with a mix of true and false:
In [88]: b = np.array([True, False])
In [89]: b
Out[89]: array([ True, False])
In [90]: x[b]
Out[90]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]])

Getting A Correlation Column Based on Two Columns with A List Value

I have the following dataset:
df = pd.DataFrame({'A': [[10, 11, 12], [13, 14, 15]],
'B': [[17, 18, 12], [21, 22, 13]]})
df
A B
0 [10, 11, 12] [17, 18, 12]
1 [13, 14, 15] [21, 22, 13]
Now I want to create a new column Correlation based on the A and B columns using scipy.stats.pearsonr method. I'm trying this:
# Creating a function for correlation
def correlation(row):
correlation, p_value = stats.pearsonr(row['A'], row['B'])
return correlation
# Applying the function
df['Correlation'] = df.apply(correlation, axis = 1)
df
A B Correlation
0 [10, 11, 12] [17, 18, 12] -0.777714
1 [13, 14, 15] [21, 22, 13] -0.810885
If I have too many columns, the above script would not the ideal. I am thinking if I can directly use stats.pearsonr in lambda to get the same result?
Any suggestions would be appreciated. Thanks!
I will recommend use zip with for loop
df['out'] = [stats.pearsonr(x, y)[0] for x, y in zip(df.A, df.B)]
df
Out[163]:
A B out
0 [10, 11, 12] [17, 18, 12] -0.777714
1 [13, 14, 15] [21, 22, 13] -0.810885

For loop to obtain sum and mean on np 3d array

I have the following array
arr = np.array([[[1, 2, 3], [4, 5, 6]],
[[7, 8, 9], [10, 11, 12]]])
I want to go through each element and sum on axis 0, so I do:
lst = []
for x in arr:
for y in np.sum(x,axis=0):
lst.append(y)
where now the lst is
[5, 7, 9, 17, 19, 21]
However I want the output to be in the following form:
[[5, 7, 9], [17, 19, 21]]
to then take the mean of its axis 0 namely (5+17)/2 and so on. The final output should look like
[11., 13., 15.]
I wonder how can I do this? Is it possible to write this whole operation in a compact form as list comprehension?
Update: To get the final output I can do:
np.mean(np.reshape(lst, (len(arr),-1)),axis=0)
Yet I am sure there is a Pythonic way of doing this
In [5]: arr = np.array([[[1, 2, 3], [4, 5, 6]],
...: [[7, 8, 9], [10, 11, 12]]])
In [7]: arr
Out[7]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
The for iterates on the 1st dimension, as though it was a list of arrays:
In [8]: for x in arr:print(x)
[[1 2 3]
[4 5 6]]
[[ 7 8 9]
[10 11 12]]
list(arr) also makes a list (but it is slower than `arr.tolist()).
One common way of iterating on other dimensions is to use an index:
In [10]: for i in range(2):print(arr[:,i])
[[1 2 3]
[7 8 9]]
[[ 4 5 6]
[10 11 12]]
You could also transpose the array placing the desired axis first.
But you don't need to iterate
In [13]: arr.sum(axis=1)
Out[13]:
array([[ 5, 7, 9],
[17, 19, 21]])
In [14]: arr.sum(axis=1).mean(axis=0)
Out[14]: array([11., 13., 15.])

Select elements of a numpy array based on the elements of a second array

Consider a numpy array A of shape (7,6)
A = array([[0, 1, 2, 3, 5, 8],
[4, 100, 6, 7, 8, 7],
[8, 9, 10, 11, 5, 4],
[12, 13, 14, 15, 1, 2],
[1, 3, 5, 6, 4, 8],
[12, 23, 12, 24, 4, 3],
[1, 3, 5, 7, 89, 0]])
together with a second numpy array r of the same shape which contains the radius of A starting from a central point A(3,2)=0:
r = array([[3, 3, 3, 3, 3, 4],
[2, 2, 2, 2, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 1, 0, 1, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 2, 2, 2, 2, 3],
[3, 3, 3, 3, 3, 4]])
I would like to pick up all the elements of A which are located at the position 1 of r, i.e. [9,10,11,15,4,6,5,13], all the elements of A located at position 2 of r and so on. I there some numpy function to do that?
Thank you
You can select a section of A by doing something like A[r == 1], to get all the sections as a list you could do [A[r == i] for i in range(r.max() + 1)]. This will work, but may be inefficient depending on how big the values in r go because you need to compute r == i for every i.
You could also use this trick, first sort A based on r, then simply split the sorted A array at the right places. That looks something like this:
r_flat = r.ravel()
order = r_flat.argsort()
A_sorted = A.ravel()[order]
r_sorted = r_flat[order]
edges = r_sorted.searchsorted(np.arange(r_sorted[-1] + 1), 'right')
sections = []
start = 0
for end in edges:
sections.append(A_sorted[start:end])
start = end
I get a different answer to the one you were expecting (3 not 4 from the 4th row) and the order is slightly different (strictly row then column), but:
>>> A
array([[ 0, 1, 2, 3, 5, 8],
[ 4, 100, 6, 7, 8, 7],
[ 8, 9, 10, 11, 5, 4],
[ 12, 13, 14, 15, 1, 2],
[ 1, 3, 5, 6, 4, 8],
[ 12, 23, 12, 24, 4, 3],
[ 1, 3, 5, 7, 89, 0]])
>>> r
array([[3, 3, 3, 3, 3, 4],
[2, 2, 2, 2, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 1, 0, 1, 2, 3],
[2, 1, 1, 1, 2, 3],
[2, 2, 2, 2, 2, 3],
[3, 3, 3, 3, 3, 4]])
>>> A[r==1]
array([ 9, 10, 11, 13, 15, 3, 5, 6])
Alternatively, you can get column then row ordering by transposing both arrays:
>>> A.T[r.T==1]
array([ 9, 13, 3, 10, 5, 11, 15, 6])