Pandas - Row mask and 2d ndarray assignement - pandas

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!

Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

Related

drop columns according to header value ()

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks
Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2

tensor cummulative addition

Suppose I have the following two tensors
count = torch.tensor([5, 3], dtype = torch.long)
label = torch.tensor([1,1,0,0,2,0,0,1], dtype = torch.long)
I want to add value k = 3 to label according to count. The result should looks like
count = torch.tensor([5, 3], dtype = torch.long)
for first 5 element in label, we add 0 to label, for 6 to 8 element in count , add 3 to it
torch.tensor([1, 1, 0, 0, 2, 3, 3, 4], dtype = torch.long)
how to make it applicable to general case?

How can I speed up this function in Python?

I am trying to figure out a way to speed up this function. I am trying to do all pairwise comparisons between the rows and columns of a dataframe (pairwise_df) and store the result. The comparison requires two numpy arrays of continuous values taken from another dataframe (df).
pairwise_df = pd.DataFrame(index = ['insert1', 'insert2', 'insert3'], columns = ['insert1', 'insert2', 'insert3'])
df = pd.DataFrame(data = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
[2, 3, 4, 5, 7, 9, 10, 1, 2, 3]], index = ['insert1', 'insert2', 'insert3'], columns = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
for row in list(pairwise_df.index.values):
for col in list(pairwise_df):
pairwise_df.at[row, col] = cosine_sim(np.array(df.loc[row]), np.array(df.loc[col]))
This works, but takes about 18mins to run on a 2000 x 2000 dataframe, and i'm sure there are ways to speed this up, but my programming experience is minimal.
The cosine_sim function is here, but the function used will vary so it doesn't matter too much:
def cosine_sim(x, y):
dot = np.dot(x, y)
norma = np.linalg.norm(x)
normb = np.linalg.norm(y)
cos = dot / (norma * normb)
return cos
Thanks!
You can avoid loops to compute cosine similarity by creating the array of all combinations using np.tile and np.reshape. The trick here is to use np.einsum to replace the dot product.
m = df.values
x = np.tile(m, m.shape[0]).reshape(-1, m.shape[1])
y = np.tile(m.T, m.shape[0]).T
c = np.einsum('ij,ij->i', x, y) / (np.linalg.norm(x, axis=1) * np.linalg.norm(y, axis=1))
>>> c.reshape(-1, m.shape[0])
array([[1. , 0.57142857, 0.75283826],
[0.57142857, 1. , 0.74102903],
[0.75283826, 0.74102903, 1. ]])

Comparing a NumPy array to another

I have 2 NumPy arrays, such as:
correct = np.array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2])
predicted = np.array([1, 1, 2, 1, 1, 1, 2, 2, 1, 2])
I would like to create 2 new arrays, which contain the indices of 1's incorrectly predicted as something else and the indices of 2's incorrectly predicted as something else, respectively. Desired result:
incorrect_ones = [2]
incorrect_twos = [5, 8]
There just has to be some NumPy way to achieve this... Any ideas?
Thanks.
Calculate the boolean conditions and find the indexes of the locations of the True values:
np.where((correct == 1) & (predicted == 2))[0]
# array([2])
np.where((correct == 2) & (predicted == 1))[0]
# array([5, 8])

update values in dataframe

I have a dataframe in which the second column is an array. I have an another dataframe which has 2 columns, from which the value has to be updated in the first dataframe.
I already tried using update, explode, map, assign method.
df = pd.DataFrame({'Account': ['A1','A2','A3']})
groups = np.array([['g1','g2'],['g3','g4'],['g1','g2','g3']])
df["Group"] = groups.tolist()
key_values = pd.DataFrame({'ID': ['1','2','3','4','5'],'Group': ['g1','g2','g3','g4','g5']})
keys = key_values.set_index('Key')['ID']
ag = Accounts_Group.explode('Group')
Setup
m = key_values.set_index('Group')['ID']
Option 1
explode + map
f = df.explode('Group')
res = f['Group'].map(m).groupby(level=0).agg(list)
0 [1, 2]
1 [3, 4]
2 [1, 2, 3]
Name: Group, dtype: object
Option 2
List comprehension + map
res = [[*map(m.get, el)] for el in df['Group']]
[['1', '2'], ['3', '4'], ['1', '2', '3']]
To assign it back:
df.assign(Group=res)
Account Group
0 A1 [1, 2]
1 A2 [3, 4]
2 A3 [1, 2, 3]
Firstly convert them to strings and replace them. Then you can convert them to list again from string using ast
import ast
df['keys']=df.astype(str).replace(to_replace=list(key_values['Group']),value=list(key_values['ID']),regex=True)['Group']
df['keys']=df['keys'].apply(lambda x: ast.literal_eval(x))
print(df)
Account Group keys
0 A1 [g1, g2] [1, 2]
1 A2 [g3, g4] [3, 4]
2 A3 [g1, g2, g3] [1, 2, 3]