Compare two dataframes with varying index lengths and multiple occurances - pandas

df = pd.DataFrame({"_id": [1, 2, 3, 4], "names_e": ["emil", "emma", "enton", "emma"]})
df2 = pd.DataFrame({"id": [1, 3, 4], "name": ["emma", "emma", "emma"]})
#df2 = df2.set_index("id", drop="False")
#df = df.set_index("_id", drop="False")
df[(df['_id']==df2["id"]) & (df['names_e'] == df2["name"])] #-> Can only compare identically-labeled Series objects
#df[[x for x in (df2["name"] == df["names_e"].values)]] #->'Lengths must match to compare'
#df[[x for x in (df2["name"] == df["names_e"])]] # ->Can only compare identically-labeled Series objects
I'm trying to make an intersection of two dataframes based on the column name and the unique identifier id. The expected result would only include id:4 and name:'emma' but I keep running into the same errors

Let us do inner merge:
df2.merge(df.rename(columns={'_id': 'id', 'names_e': 'name'}))
id name
0 4 emma

Related

Numpy fancy indexing with 2D array - explanation

I am (re)building up my knowledge of numpy, having used it a little while ago.
I have a question about fancy indexing with multidimenional (in this case 2D) arrays.
Given the following snippet:
>>> a = np.arange(12).reshape(3,4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> i = np.array( [ [0,1], # indices for the first dim of a
... [1,2] ] )
>>> j = np.array( [ [2,1], # indices for the second dim
... [3,3] ] )
>>>
>>> a[i,j] # i and j must have equal shape
array([[ 2, 5],
[ 7, 11]])
Could someone explain in simple English, the logic being applied to give the results produced. Ideally, the explanation would be applicable for 3D and higher rank arrays being used to index an array.
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
Conceptually (in terms of restrictions placed on "rows" and "columns"), what does it mean to index using a 2D array?
It means you are constructing a 2d array R, such that R=A[B, C]. This means that the value for rij=abijcij.
So it means that the item located at R[0,0] is the item in A with as row index B[0,0] and as column index C[0,0]. The item R[0,1] is the item in A with row index B[0,1] and as column index C[0,1], etc.
So in this specific case:
>>> b = a[i,j]
>>> b
array([[ 2, 5],
[ 7, 11]])
b[0,0] = 2 since i[0,0] = 0, and j[0,0] = 2, and thus a[0,2] = 2. b[0,1] = 5 since i[0,0] = 1, and j[0,0] = 1, and thus a[1,1] = 5. b[1,0] = 7 since i[0,0] = 1, and j[0,0] = 3, and thus a[1,3] = 7. b[1,1] = 11 since i[0,0] = 2, and j[0,0] = 3, and thus a[2,3] = 11.
So you can say that i will determine the "row indices", and j will determine the "column indices". Of course this concept holds in more dimensions as well: the first "indexer" thus determines the indices in the first index, the second "indexer" the indices in the second index, and so on.

Pandas - Row mask and 2d ndarray assignement

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!
Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

update values in dataframe

I have a dataframe in which the second column is an array. I have an another dataframe which has 2 columns, from which the value has to be updated in the first dataframe.
I already tried using update, explode, map, assign method.
df = pd.DataFrame({'Account': ['A1','A2','A3']})
groups = np.array([['g1','g2'],['g3','g4'],['g1','g2','g3']])
df["Group"] = groups.tolist()
key_values = pd.DataFrame({'ID': ['1','2','3','4','5'],'Group': ['g1','g2','g3','g4','g5']})
keys = key_values.set_index('Key')['ID']
ag = Accounts_Group.explode('Group')
Setup
m = key_values.set_index('Group')['ID']
Option 1
explode + map
f = df.explode('Group')
res = f['Group'].map(m).groupby(level=0).agg(list)
0 [1, 2]
1 [3, 4]
2 [1, 2, 3]
Name: Group, dtype: object
Option 2
List comprehension + map
res = [[*map(m.get, el)] for el in df['Group']]
[['1', '2'], ['3', '4'], ['1', '2', '3']]
To assign it back:
df.assign(Group=res)
Account Group
0 A1 [1, 2]
1 A2 [3, 4]
2 A3 [1, 2, 3]
Firstly convert them to strings and replace them. Then you can convert them to list again from string using ast
import ast
df['keys']=df.astype(str).replace(to_replace=list(key_values['Group']),value=list(key_values['ID']),regex=True)['Group']
df['keys']=df['keys'].apply(lambda x: ast.literal_eval(x))
print(df)
Account Group keys
0 A1 [g1, g2] [1, 2]
1 A2 [g3, g4] [3, 4]
2 A3 [g1, g2, g3] [1, 2, 3]

Slicing of a Pandas Series when index elements are not default (doesn't start with 0)

Created a Pandas Series in Python 3.7, providing the 'data' and 'index', where the data contains a list of list; len(list) = 6 and the index list contains the element which starts from 3 rather than starting from 0.
I want to slice the series.
import pandas as pd
li_a = [[1,2],[3,4],[5,6],[7,8],(9,10),(11,12)]
li_c = [3,4,5,6,7,8]
ser1 = pd.Series(data=li_a,index=li_c)
so, ser1[3] output: [1,2] i.e. the First element of the Series
I expected the output of ser1[3:] to be entire Series, but the output was
6 [7, 8]
7 (9, 10)
8 (11, 12)
dtype: object
It is working that way because you are printing by row position, not using index:
print(ser1[3:])
output:
6 [7, 8]
7 (9, 10)
8 (11, 12)
If you want to print rows from specific index number you need to use loc
print(ser1.loc[3:])
output:
3 [1, 2]
4 [3, 4]
5 [5, 6]
6 [7, 8]
7 (9, 10)
8 (11, 12)
edited: from iloc to loc :
loc gets rows (or columns) with particular labels from the index.
your full code (i have changed also your if name line:
def main():
arr = np.arange(10,16)
index1 = np.arange(3,9)
ser1 = pd.Series(data=arr,index=index1)
print(ser1)
print(ser1.loc[3:])
if __name__ == "__main__":
main()

Managing MultiIndex Dataframe objects

So, yet another problem using grouped DataFrames that I am getting so confused over...
I have defined an aggregation dictionary as:
aggregations_level_1 = {
'A': {
'mean': 'mean',
},
'B': {
'mean': 'mean',
},
}
And now I have two grouped DataFrames that I have aggregated using the above, then joined:
grouped_top =
df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom =
df2.groupby(['group_lvl']).agg(aggregations_level_1)
Joining these:
df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10',
rsuffix='_low_10')
A_top_10 A_low_10 B_top_10 B_low_10
mean mean mean mean
group_lvl
a 3.711413 14.515901 3.711413 14.515901
b 4.024877 14.442106 3.694689 14.209040
c 3.694689 14.209040 4.024877 14.442106
Now, if I call index and columns I have:
print df3.index
>> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')
print df3.columns
>> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]])
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
Only A_top_10, B_top_10 for a and c?
I am pretty confused so any overall help would be great!
Need slicers, but first sort columns by sort_index else error:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
df = df.sort_index(axis=1)
idx = pd.IndexSlice
df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
print (df1)
A_low_10 A_top_10
mean mean
group_lvl
a 14.515901 3.711413
b 14.442106 4.024877
c 14.209040 3.694689
And:
idx = pd.IndexSlice
df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
print (df2)
A_top_10 B_top_10
mean mean
group_lvl
a 3.711413 3.711413
c 3.694689 4.024877
EDIT:
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
I think very close, better is say I have MultiIndex in columns.