The efficient way to compare value between two cell and assign value based on condition in Numpy

The efficient way to compare value between two cell and assign value based on condition in Numpy - pandas

The objective is to count the frequency when two nodes have similar value.
Say, for example, we have a vector
pd.DataFrame([0,4,1,1,1],index=['A','B','C','D','E'])
as below
0
A 0
B 4
C 1
D 1
E 1
And, the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise.
N is then
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
This simple example can be extended to 2D. For example, here create array of shape (4,5)
A B C D E
0 0 0 0 0 0
1 0 4 1 1 1
2 0 1 1 2 2
3 0 3 2 2 2
Similarly, we go row wise and set the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise. At every iteration of the row, we sum the cell value.
The frequency is then equal to
A B C D E
A 4.0 1.0 1.0 1.0 1.0
B 1.0 4.0 2.0 1.0 1.0
C 1.0 2.0 4.0 3.0 3.0
D 1.0 1.0 3.0 4.0 4.0
E 1.0 1.0 3.0 4.0 4.0
Based on this, the following code is proposed. But, the current implementation used 3 for-loops and some if-else statement.
I am curios whether the code below can be enhanced further, or maybe, there is a build-in method within Pandas or Numpy that can be used to achieve similar objective.
import numpy as np
arr=[[ 0,0,0,0,0],
[0,4,1,1,1],
[0,1,1,2,2],
[0,3,2,2,2]]
arr=np.array(arr)
# C=arr
# nrows
npart = len(arr[:,0])
# Ncolumns
m = len(arr[0,:])
X = np.zeros(shape =(m,m), dtype = np.double)
for i in range(npart):
for k in range(m):
for p in range(m):
# Check whether the pair have similar value or not
if arr[i,k] == arr[i,p]:
X[k,p] = X[k,p] + 1
else:
X[k,p] = X[k,p] + 0
Output:
4.00000,1.00000,1.00000,1.00000,1.00000
1.00000,4.00000,2.00000,1.00000,1.00000
1.00000,2.00000,4.00000,3.00000,3.00000
1.00000,1.00000,3.00000,4.00000,4.00000
1.00000,1.00000,3.00000,4.00000,4.00000
p.s. The index A,B,C,D,E and use of pandas are for clarification purpose.

With numpy, you can use broadcasting:
1D
a = np.array([0,4,1,1,1])
(a==a[:, None])*1
output:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1]])
2D
a = np.array([[0, 0, 0, 0, 0],
[0, 4, 1, 1, 1],
[0, 1, 1, 2, 2],
[0, 3, 2, 2, 2]])
(a.T == a.T[:,None]).sum(2)
output:
array([[4, 1, 1, 1, 1],
[1, 4, 2, 1, 1],
[1, 2, 4, 3, 3],
[1, 1, 3, 4, 4],
[1, 1, 3, 4, 4]])

Related

Add a new column if index in other 2 column is the same

I would add the new index to the new column e if b and c is the same.
In the mean time,
I need to consider the limit of the sum(d)<=20,
If the total d with the same b and c is exceed 20，
then give a new index.
the example input data below:
a
b
c
d
0
0
2
9
1
2
1
10
2
1
0
9
3
1
0
11
4
2
1
9
5
0
1
15
6
2
0
9
7
1
0
8
I sort the b and c first,
let comparing be more easier,
then I got key errorKeyError: 0, temporary_size += df.loc[df[i], 'd']\
Hope it like this:
a
b
c
d
e
5
0
1
15
1
0
0
2
9
2
2
1
0
9
3
3
1
0
11
3
7
1
0
8
4
6
2
0
9
5
1
2
1
10
6
4
2
1
9
6
and here is my code:
import pandas as pd
d = {'a': [0, 1, 2, 3, 4, 5, 6, 7], 'b': [0, 2, 1, 1, 2, 0, 2, 1], 'c': [2, 1, 0, 0, 1, 1, 0, 0], 'd': [9, 10, 9, 11, 9, 15, 9, 8]}
df = pd.DataFrame(data=d)
print(df)
df.sort_values(['b', 'c'], ascending=[True, True], inplace=True, ignore_index=True)
e_id = 0
total_size = 20
temporary_size = 0
for i in range(0, len(df.index)-1):
if df.loc[i, 'b'] == df.loc[i+1, 'b'] and df.loc[i, 'c'] != df.loc[i+1, 'c']:
temporary_size = temporary_size + df.loc[i, 'd']
if temporary_size <= total_size:
df.loc['e', i] = e_id
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
e_id += 1
else:
df.loc[i, 'e'] = e_id
temporary_size = temporary_size + df.loc[i, 'd']
print(df)
finally, I can't get the column c in my dataframe.
THANKS FOR ALL!

How to choose 2D diagonals of a 3D NumPy array

I define an array as :
XRN =np.array([[[0,1,0,1,0,1,0,1,0,1],
[0,1,1,0,0,1,0,1,0,1],
[0,1,0,0,1,1,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,0,0,1],
[0,1,0,1,0,1,0,1,1,0],
[1,1,1,0,0,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],],
[[0,1,0,1,0,1,1,1,0,0],
[0,1,0,1,1,1,0,1,0,0],
[0,1,0,1,1,0,0,1,0,1],
[0,1,0,1,0,0,1,1,0,1],]])
print(XRN.shape,XRN)
XRN_LEN = XRN.shape[1]
I can obtain the sum of inner matrix with :
XRN_UP = XRN.sum(axis=1)
print("XRN_UP",XRN_UP.shape,XRN_UP)
XRN_UP (3, 10) [[0 4 1 2 1 3 1 4 0 4]
[1 4 1 3 0 2 2 3 1 3]
[0 4 0 4 2 2 2 4 0 2]]
I want to get the sum of all diagonals with the same shape (3,10)
I tested the code :
RIGHT = [XRN.diagonal(i,axis1=0,axis2=1).sum(axis=1) for i in range(XRN_LEN)]
np_RIGHT = np.array(RIGHT)
print("np_RIGHT=",np_RIGHT.shape,np_RIGHT)
but got
np_RIGHT= (4, 10) [[0 3 0 3 1 2 0 3 1 2]
[1 3 2 1 0 1 1 3 0 3]
[0 2 0 1 1 1 1 2 0 2]
[0 1 0 1 0 0 1 1 0 1]]
I checked all values for axis1 and axis 2 but never got the shape(3,10) : How can I do ?
axis1 axis2 shape
0 1 (4,10)
0 2 (4,4)
1 0 (4,10)
1 2 (4,3)
2 0 (4,4)
2 1 (4,3)

If I understand correctly, you want to sum all possible diagonals on the three elements separately. If that's the case, then you must apply np.diagonal on axis1=1 and axis2=2. This way, you end up with 10 diagonals per element which you sum down to 10 values per element. There are 3 elements, so the resulting shape is (10, 3):
>>> np.array([XRN.diagonal(i, 1, 2).sum(1) for i in range(XRN.shape[-1])])
array([[2, 3, 2],
[2, 1, 2],
[1, 1, 2],
[3, 2, 3],
[2, 2, 2],
[2, 2, 2],
[2, 3, 3],
[2, 2, 2],
[1, 0, 0],
[1, 1, 0]])

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

How to create new list column values from groupby

My goal is to create a new column c_list that contains a list after an groupby (without merge function): df['c_list'] = df.groupby('a').agg({'c':lambda x: list(x)})
df = pd.DataFrame(
{'a': ['x', 'y', 'y', 'x'],
'b': [2, 0, 0, 0],
'c': [8, 2, 5, 6]
}
)
df
Initial dataframe
a b c
0 x 2 8
1 y 0 2
2 y 0 5
3 x 0 6
Looking for:
a b c d
0 x 2 8 [6, 8]
1 y 0 2 [2, 5]
2 y 0 5 [2, 5]
3 x 0 6 [6, 8]

Try with transform
df['d']=df.groupby('a').c.transform(lambda x : [x.values.tolist()]*len(x))
0 [8, 6]
1 [2, 5]
2 [2, 5]
3 [8, 6]
Name: c, dtype: object
Or
df['d']=df.groupby('a').c.agg(list).reindex(df.a).values

Create tensors where all elements up to a given index are 1s, the rest are 0s

I have a placeholder lengths = tf.placeholder(tf.int32, [10]). Each of the 10 values assigned to this placeholder are <= 25. I now want to create a 2-dimensional tensor, called masks, of shape [10, 25], where each of the 10 vectors of length 25 has the first n elements set to 1, and the rest set to 0 - with n being the corresponding value in lengths.
What is the easiest way to do this using TensorFlow's built in methods?
For example:
lengths = [4, 6, 7, ...]
-> masks = [[1, 1, 1, 1, 0, 0, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 1, 0, ..., 0],
...
]

You can reshape lengths to a (10, 1) tensor, then compare it with another sequence/indices 0,1,2,3,...,25, which due to broadcasting will result in True if the indices are smaller then lengths, otherwise False; then you can cast the boolean result to 1 and 0:
lengths = tf.constant([4, 6, 7])
n_features = 25

import tensorflow as tf

masks = tf.cast(tf.range(n_features) < tf.reshape(lengths, (-1, 1)), tf.int8)
with tf.Session() as sess:
print(sess.run(masks))
#[[1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

The efficient way to compare value between two cell and assign value based on condition in Numpy - pandas

Related

Add a new column if index in other 2 column is the same

How to choose 2D diagonals of a 3D NumPy array

Pandas index clause across multiple columns in a multi-column header

How to create new list column values from groupby

Create tensors where all elements up to a given index are 1s, the rest are 0s

Categories

Resources