update values in dataframe - numpy

I have a dataframe in which the second column is an array. I have an another dataframe which has 2 columns, from which the value has to be updated in the first dataframe.
I already tried using update, explode, map, assign method.
df = pd.DataFrame({'Account': ['A1','A2','A3']})
groups = np.array([['g1','g2'],['g3','g4'],['g1','g2','g3']])
df["Group"] = groups.tolist()
key_values = pd.DataFrame({'ID': ['1','2','3','4','5'],'Group': ['g1','g2','g3','g4','g5']})
keys = key_values.set_index('Key')['ID']
ag = Accounts_Group.explode('Group')

Setup
m = key_values.set_index('Group')['ID']
Option 1
explode + map
f = df.explode('Group')
res = f['Group'].map(m).groupby(level=0).agg(list)
0 [1, 2]
1 [3, 4]
2 [1, 2, 3]
Name: Group, dtype: object
Option 2
List comprehension + map
res = [[*map(m.get, el)] for el in df['Group']]
[['1', '2'], ['3', '4'], ['1', '2', '3']]
To assign it back:
df.assign(Group=res)
Account Group
0 A1 [1, 2]
1 A2 [3, 4]
2 A3 [1, 2, 3]

Firstly convert them to strings and replace them. Then you can convert them to list again from string using ast
import ast
df['keys']=df.astype(str).replace(to_replace=list(key_values['Group']),value=list(key_values['ID']),regex=True)['Group']
df['keys']=df['keys'].apply(lambda x: ast.literal_eval(x))
print(df)
Account Group keys
0 A1 [g1, g2] [1, 2]
1 A2 [g3, g4] [3, 4]
2 A3 [g1, g2, g3] [1, 2, 3]

Related

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Pandas - Row mask and 2d ndarray assignement

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!
Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

Weighted mean pandas

Im calculating weighted mean for many columns using pandas. In some cases weight can sum to zero so i use np.ma.average:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),('HeightA', [1, 2, 3]), ('WeightA', [0, 0, 0]),('HeightB', [2, 4, 6]), ('WeightB', [1, 2, 4])]))
>>df
ID HeightA WeightA HeightB WeightB
0 1 1 0 2 1
1 1 2 0 4 2
2 1 3 0 6 4
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
This works but i have many columns of height and weights so i dont want to have to write a lambda function for each one so i try:
def givewm(data,weightcolumn):
return np.ma.average(data, weights=data.loc[data.index, weightcolumn])
f = {'HeightA':givewm(df,'WeightA'),'HeightB':givewm(df,'WeightB')}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
Which give error: builtins.TypeError: Axis must be specified when shapes of a and weights differ.
How can i write a function which returns weighted mean with weight column name as input?
Use double nested functions, solution from github:
df = (pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),
('HeightA', [1, 2, 3]),
('WeightA', [10, 20, 30]),
('HeightB', [2, 4, 6]),
('WeightB', [1, 2, 4])])))
print (df)
ID HeightA WeightA HeightB WeightB
0 1 1 10 2 1
1 1 2 20 4 2
2 1 3 30 6 4
def givewm(weightcolumn):
def f1(x):
return np.ma.average(x, weights=df.loc[x.index, weightcolumn])
return f1
f = {'HeightA':givewm('WeightA'),'HeightB':givewm('WeightB')}
df2 = df.groupby('ID').agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143
Verify solution:
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143

Looping through each item in a numpy array?

I'm trying to access each item in a numpy 2D array.
I'm used to something like this in Python [[...], [...], [...]]
for row in data:
for col in data:
print(data[row][col])
but now, I have a data_array = np.array(features)
How can I iterate through it the same way?
Try np.ndenumerate:
>>> a =numpy.array([[1,2],[3,4]])
>>> for (i,j), value in np.ndenumerate(a):
... print(i, j, value)
...
0 0 1
0 1 2
1 0 3
1 1 4
Make a small 2d array, and a nested list from it:
In [241]: A=np.arange(6).reshape(2,3)
In [242]: alist= A.tolist()
In [243]: alist
Out[243]: [[0, 1, 2], [3, 4, 5]]
One way of iterating on the list:
In [244]: for row in alist:
...: for item in row:
...: print(item)
...:
0
1
2
3
4
5
works just same for the array
In [245]: for row in A:
...: for item in row:
...: print(item)
...:
0
1
2
3
4
5
Now neither is good if you want to modify elements. But for crude iteration over all elements this works.
WIth the array I can easily treat it was a 1d
In [246]: [i for i in A.flat]
Out[246]: [0, 1, 2, 3, 4, 5]
I could also iterate with nested indices
In [247]: [A[i,j] for i in range(A.shape[0]) for j in range(A.shape[1])]
Out[247]: [0, 1, 2, 3, 4, 5]
In general it is better to work with arrays without iteration. I give these iteration examples to clearup some confusion.
If you want to access an item in a numpy 2D array features, you can use features[row_index, column_index]. If you wanted to iterate through a numpy array, you could just modify your script to
for row in data:
for col in data:
print(data[row, col])

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?
Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)