issue merging two dataframes with pandas by summing element by element [duplicate] - pandas

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20

This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45

In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64

If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20

Related

creating a new dataframe from 3 other dataframes but columns must have specific order without specifying the name of the columns

Assume we have 3 dataframes named df1, df2, df3. Each of these dataframes have 100 rows and 15 columns. I want to create new dataframe that will have the first column of df1, then the first column of df2m then the first column of df3. then it will have the second column of df1 then the second column of df2 then the second column of df3 and so on until all 15 columns of each of the three dataframes are included. For example
df1
A B C ... O
1 1 1 1
1 1 1 1
... ... ... ...
df2
A B C ... O
2 2 2 2
2 2 2 2
... ... ... ...
df3
A B C ... O
3 3 3 3
3 3 3 3
... ... ... ...
The expected output should be something like the following
dfnew
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 ... O_df1 O_df2 O_df3
1 2 3 1 2 3 1 2 3
1 2 3 1 2 3 1 2 3
... ... ... ...
My issue is that I cannot use the names of the columns to specify them. For example I know how to do it like this
# create a list of the dataframes
dfs = [df1, df2, df3]
# concatenate the dataframes along the columns axis (axis=1)
dfnew = pd.concat(dfs, axis=1)
# specify the column names for the new dataframe
column_names = ["column1", "column2", ..., "column15"]
# concatenate the dataframes along the columns axis (axis=1)
# and specify the column names for the new dataframe
dfnew = pd.concat(dfs, axis=1, columns=column_names)
but I cannot use the column names because they will change everytime. Plus it seems like there could be a faster way that hard coding them by using the .loc function
Exmaple
data1 = {'A': {0: 1, 1: 1}, 'B': {0: 1, 1: 1}, 'C': {0: 1, 1: 1}}
df1 = pd.DataFrame(data1)
df2 = df1.replace(1, 2).copy()
df3 = df1.replace(1, 3).copy()
df1
A B C
0 1 1 1
1 1 1 1
df2
A B C
0 2 2 2
1 2 2 2
df3
A B C
0 3 3 3
1 3 3 3
Code
dfs = (pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3'])
.sort_index(level=1, axis=1).swaplevel(0, 1, axis=1))
dfs
A B C
df1 df2 df3 df1 df2 df3 df1 df2 df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3
dfs.set_axis(dfs.columns.map('_'.join), axis=1)
A_df1 A_df2 A_df3 B_df1 B_df2 B_df3 C_df1 C_df2 C_df3
0 1 2 3 1 2 3 1 2 3
1 1 2 3 1 2 3 1 2 3

pandas grouping based on conditions on other columns

import pandas as pd
df = pd.DataFrame(columns=['A','B'])
df['A']=['A','B','A','A','B','B','B']
df['B']=[2,4,3,5,6,7,8]
df
A B
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
df.columns=['id','num']
df
id num
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
I would like to apply groupby on id column but some condition on num column
I want to have 2 columns is_even_count and is_odd_count columns in final data frame where is_even_count only counts even numbers from num column after grouping and is_odd_count only counts odd numbers from num column after grouping.
My output will be
is_even_count is_odd_count
A 1 2
B 3 1
how can i do this in pandas
Use modulo division by 2 and compare by 1 with map:
d = {True:'is_odd_count', False:'is_even_count'}
df = df.groupby(['id', (df['num'] % 2 == 1).map(d)]).size().unstack(fill_value=0)
print (df)
num is_even_count is_odd_count
id
A 1 2
B 3 1
Another solution with crosstab:
df = pd.crosstab(df['id'], (df['num'] % 2 == 1).map(d))
Alternative with numpy.where:
a = np.where(df['num'] % 2 == 1, 'is_odd_count', 'is_even_count')
df = pd.crosstab(df['id'], a)

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Python Pandas: How to take categorical average of a column?

for a given dataframe as follows:
1 a 10
2 a 20
3 a 30
4 b 10
5 b 100
where column 1 is index, column 2 is some categorical value and column 3 is a number. I want categorical mean over column 2, which should look something like this:
a 20
b 55
The value for a is calculated as
(10+20+30)/3 = 20
The value for b is calculated as
(10+100)/2 = 55
I think you can use groupby with mean and reset_index:
print df
a b c
0 1 a 10
1 2 a 20
2 3 a 30
3 4 b 10
4 5 b 100
df1 = df.groupby('b')['c'].mean().reset_index()
print df1
b c
0 a 20
1 b 55
print df1.c.max()
55
print df1.c.min()
20

how to selectively filter elements in pandas group

I want to selectively remove elements of a pandas group based on their properties within the group.
Here's an example: remove all elements except the row with the highest value in the 'A' column
>>> dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc'), 'C': list('lmnopqrt')})
>>> dff
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
>>> grped = dff.groupby('B')
>>> grped.groups
{'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}
apply custom function/method to the groups (sort within group on col 'A', filter elements).
>>> yourGenius(grped,'A').reset_index()
returns dataframe:
A B C
0 2 a m
1 9 b p
2 10 c t
maybe there is a compact way to do this with a lambda function or .filter()? thanks
If you want to select one row per group, you could use groupby/agg
to return index values and select the rows using loc.
For example, to group by B and then select the row with the highest A value:
In [171]: dff
Out[171]:
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
[8 rows x 3 columns]
In [172]: dff.loc[dff.groupby('B')['A'].idxmax()]
Out[172]:
A B C
1 2 a m
4 9 b p
7 10 c t
another option (suggested by jezrael) which in practice is faster for a wide range of DataFrames is
dff.sort_values(by=['A'], ascending=False).drop_duplicates('B')
If you wish to select many rows per group, you could use groupby/apply with a function that returns sub-DataFrames for
each group. apply will then try to merge these sub-DataFrames for you.
For example, to select every row except the last from each group:
In [216]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=list('ABC'), index=list('vwxyz')); df['A'] %= 2; df
Out[216]:
A B C
v 0 1 2
w 1 4 5
x 0 7 8
y 1 10 11
z 0 13 14
In [217]: df.groupby(['A']).apply(lambda grp: grp.iloc[:-1]).reset_index(drop=True, level=0)
Out[217]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
Another way is to use groupby/apply to return a Series of index values. Again apply will try to join the Series into one Series. You could then use df.loc to select rows by index value:
In [218]: df.loc[df.groupby(['A']).apply(lambda grp: pd.Series(grp.index[:-1]))]
Out[218]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
I don't think groupby/filter will do what you wish, since
groupby/filter filters whole groups. It doesn't allow you to select particular rows from each group.