The efficient way to compare value between two cell and assign value based on condition in Numpy - pandas

The objective is to count the frequency when two nodes have similar value.
Say, for example, we have a vector
pd.DataFrame([0,4,1,1,1],index=['A','B','C','D','E'])
as below
0
A 0
B 4
C 1
D 1
E 1
And, the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise.
N is then
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
This simple example can be extended to 2D. For example, here create array of shape (4,5)
A B C D E
0 0 0 0 0 0
1 0 4 1 1 1
2 0 1 1 2 2
3 0 3 2 2 2
Similarly, we go row wise and set the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise. At every iteration of the row, we sum the cell value.
The frequency is then equal to
A B C D E
A 4.0 1.0 1.0 1.0 1.0
B 1.0 4.0 2.0 1.0 1.0
C 1.0 2.0 4.0 3.0 3.0
D 1.0 1.0 3.0 4.0 4.0
E 1.0 1.0 3.0 4.0 4.0
Based on this, the following code is proposed. But, the current implementation used 3 for-loops and some if-else statement.
I am curios whether the code below can be enhanced further, or maybe, there is a build-in method within Pandas or Numpy that can be used to achieve similar objective.
import numpy as np
arr=[[ 0,0,0,0,0],
[0,4,1,1,1],
[0,1,1,2,2],
[0,3,2,2,2]]
arr=np.array(arr)
# C=arr
# nrows
npart = len(arr[:,0])
# Ncolumns
m = len(arr[0,:])
X = np.zeros(shape =(m,m), dtype = np.double)
for i in range(npart):
for k in range(m):
for p in range(m):
# Check whether the pair have similar value or not
if arr[i,k] == arr[i,p]:
X[k,p] = X[k,p] + 1
else:
X[k,p] = X[k,p] + 0
Output:
4.00000,1.00000,1.00000,1.00000,1.00000
1.00000,4.00000,2.00000,1.00000,1.00000
1.00000,2.00000,4.00000,3.00000,3.00000
1.00000,1.00000,3.00000,4.00000,4.00000
1.00000,1.00000,3.00000,4.00000,4.00000
p.s. The index A,B,C,D,E and use of pandas are for clarification purpose.

With numpy, you can use broadcasting:
1D
a = np.array([0,4,1,1,1])
(a==a[:, None])*1
output:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1]])
2D
a = np.array([[0, 0, 0, 0, 0],
[0, 4, 1, 1, 1],
[0, 1, 1, 2, 2],
[0, 3, 2, 2, 2]])
(a.T == a.T[:,None]).sum(2)
output:
array([[4, 1, 1, 1, 1],
[1, 4, 2, 1, 1],
[1, 2, 4, 3, 3],
[1, 1, 3, 4, 4],
[1, 1, 3, 4, 4]])

Related

Add a new column if index in other 2 column is the same

I would add the new index to the new column e if b and c is the same.
In the mean time,
I need to consider the limit of the sum(d)<=20,
If the total d with the same b and c is exceed 20,
then give a new index.
the example input data below:
a
b
c
d
0
0
2
9
1
2
1
10
2
1
0
9
3
1
0
11
4
2
1
9
5
0
1
15
6
2
0
9
7
1
0
8
I sort the b and c first,
let comparing be more easier,
then I got key errorKeyError: 0, temporary_size += df.loc[df[i], 'd']\
Hope it like this:
a
b
c
d
e
5
0
1
15
1
0
0
2
9
2
2
1
0
9
3
3
1
0
11
3
7
1
0
8
4
6
2
0
9
5
1
2
1
10
6
4
2
1
9
6
and here is my code:
import pandas as pd
d = {'a': [0, 1, 2, 3, 4, 5, 6, 7], 'b': [0, 2, 1, 1, 2, 0, 2, 1], 'c': [2, 1, 0, 0, 1, 1, 0, 0], 'd': [9, 10, 9, 11, 9, 15, 9, 8]}
df = pd.DataFrame(data=d)
print(df)
df.sort_values(['b', 'c'], ascending=[True, True], inplace=True, ignore_index=True)
e_id = 0
total_size = 20
temporary_size = 0
for i in range(0, len(df.index)-1):
if df.loc[i, 'b'] == df.loc[i+1, 'b'] and df.loc[i, 'c'] != df.loc[i+1, 'c']:
temporary_size = temporary_size + df.loc[i, 'd']
if temporary_size <= total_size:
df.loc['e', i] = e_id
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
e_id += 1
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
print(df)
finally, I can't get the column c in my dataframe.
THANKS FOR ALL!

How to choose 2D diagonals of a 3D NumPy array

I define an array as :
XRN =np.array([[[0,1,0,1,0,1,0,1,0,1],
[0,1,1,0,0,1,0,1,0,1],
[0,1,0,0,1,1,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,0,0,1],
[0,1,0,1,0,1,0,1,1,0],
[1,1,1,0,0,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,1,0,0],
[0,1,0,1,1,1,0,1,0,0],
[0,1,0,1,1,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],]])
print(XRN.shape,XRN)
XRN_LEN = XRN.shape[1]
I can obtain the sum of inner matrix with :
XRN_UP = XRN.sum(axis=1)
print("XRN_UP",XRN_UP.shape,XRN_UP)
XRN_UP (3, 10) [[0 4 1 2 1 3 1 4 0 4]
[1 4 1 3 0 2 2 3 1 3]
[0 4 0 4 2 2 2 4 0 2]]
I want to get the sum of all diagonals with the same shape (3,10)
I tested the code :
RIGHT = [XRN.diagonal(i,axis1=0,axis2=1).sum(axis=1) for i in range(XRN_LEN)]
np_RIGHT = np.array(RIGHT)
print("np_RIGHT=",np_RIGHT.shape,np_RIGHT)
but got
np_RIGHT= (4, 10) [[0 3 0 3 1 2 0 3 1 2]
[1 3 2 1 0 1 1 3 0 3]
[0 2 0 1 1 1 1 2 0 2]
[0 1 0 1 0 0 1 1 0 1]]
I checked all values for axis1 and axis 2 but never got the shape(3,10) : How can I do ?
axis1 axis2 shape
0 1 (4,10)
0 2 (4,4)
1 0 (4,10)
1 2 (4,3)
2 0 (4,4)
2 1 (4,3)
If I understand correctly, you want to sum all possible diagonals on the three elements separately. If that's the case, then you must apply np.diagonal on axis1=1 and axis2=2. This way, you end up with 10 diagonals per element which you sum down to 10 values per element. There are 3 elements, so the resulting shape is (10, 3):
>>> np.array([XRN.diagonal(i, 1, 2).sum(1) for i in range(XRN.shape[-1])])
array([[2, 3, 2],
[2, 1, 2],
[1, 1, 2],
[3, 2, 3],
[2, 2, 2],
[2, 2, 2],
[2, 3, 3],
[2, 2, 2],
[1, 0, 0],
[1, 1, 0]])

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

How to create new list column values from groupby

My goal is to create a new column c_list that contains a list after an groupby (without merge function): df['c_list'] = df.groupby('a').agg({'c':lambda x: list(x)})
df = pd.DataFrame(
{'a': ['x', 'y', 'y', 'x'],
'b': [2, 0, 0, 0],
'c': [8, 2, 5, 6]
}
)
df
Initial dataframe
a b c
0 x 2 8
1 y 0 2
2 y 0 5
3 x 0 6
Looking for:
a b c d
0 x 2 8 [6, 8]
1 y 0 2 [2, 5]
2 y 0 5 [2, 5]
3 x 0 6 [6, 8]
Try with transform
df['d']=df.groupby('a').c.transform(lambda x : [x.values.tolist()]*len(x))
0 [8, 6]
1 [2, 5]
2 [2, 5]
3 [8, 6]
Name: c, dtype: object
Or
df['d']=df.groupby('a').c.agg(list).reindex(df.a).values

Create tensors where all elements up to a given index are 1s, the rest are 0s

I have a placeholder lengths = tf.placeholder(tf.int32, [10]). Each of the 10 values assigned to this placeholder are <= 25. I now want to create a 2-dimensional tensor, called masks, of shape [10, 25], where each of the 10 vectors of length 25 has the first n elements set to 1, and the rest set to 0 - with n being the corresponding value in lengths.
What is the easiest way to do this using TensorFlow's built in methods?
For example:
lengths = [4, 6, 7, ...]
-> masks = [[1, 1, 1, 1, 0, 0, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 1, 0, ..., 0],
...
]
You can reshape lengths to a (10, 1) tensor, then compare it with another sequence/indices 0,1,2,3,...,25, which due to broadcasting will result in True if the indices are smaller then lengths, otherwise False; then you can cast the boolean result to 1 and 0:
lengths = tf.constant([4, 6, 7])
n_features = 25
​
import tensorflow as tf
​
masks = tf.cast(tf.range(n_features) < tf.reshape(lengths, (-1, 1)), tf.int8)
with tf.Session() as sess:
print(sess.run(masks))
#[[1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]