Update dataframe column with values from another dataframe by index - pandas

I have two DataFrames.
One of them contains: item id, name, quantity and price.
Another: item id, name and quantity.
The problem is to update names and quantity in first DataFrame taking information from the second DataFrame by item id. Also, first DataFrame has not all item id's, so I need to take into account only those rows from the second DataFrame, which are in the first one.
DataFrame 1
In [1]: df1
Out[1]:
id name quantity price
0 10 X 10 15
1 11 Y 30 20
2 12 Z 20 15
3 13 X 15 10
4 14 X 12 15
DataFrame 2
In [2]: df2
Out[2]:
id name quantity
0 10 A 3
1 12 B 3
2 13 C 6
I've tried to use apply to iterate through rows and modify column value by condition like this:
def modify(row):
row['name'] = df2[df2['id'] == row['id']]['name'].get_values()[0]
row['quantity'] = df2[df2['id'] == row['id']]['quantity'].get_values()[0]
df1.apply(modify, axis=1)
But it doesn't have any results. DataFrame 1 is still the same
I am expecting something like this first:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 11 Y 30 20
2 12 B 3 15
3 13 C 6 10
4 14 X 12 15
After that I want to drop the rows, which were not modified to get:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 12 B 3 15
2 13 C 6 10

Using update
df1=df1.set_index('id')
df1.update(df2.set_index('id'))
df1=df1.reset_index()
Out[740]:
id name quantity price
0 10 A 3.0 15
1 11 Y 30.0 20
2 12 B 3.0 15
3 13 C 6.0 10
4 14 X 12.0 15

new_df = df.merge(df2, on='id')
new.drop(['name_x','quantity_x'], inplace=True, axis=1)
new.columns = ['id','price','name','quantity']
Output
id price name quantity
0 10 15 A 3
1 12 15 B 3
2 13 10 C 6

Related

Create a new pandas DataFrame Column with a groupby

I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92

pandas: get top n including the duplicates of a sorted column

I have some data like
This is a table sorted by score column and also then by cat column
score cat
18 B
18 A
17 A
16 B
16 A
15 B
14 B
13 A
12 A
10 B
9 B
I want to get the top 5 of score including the duplicates and also add the rank
i.e
rank score cat
1 18 B
1 18 A
2 17 A
3 16 B
3 16 A
4 15 B
5 14 B
How can i get this using pandas
Since the data frame is ordered, try factorize
df['rnk'] = df.score.factorize()[0]+1
out = df[df['rnk'] <= 5]
out
score cat rnk
0 18 B 1
1 18 A 1
2 17 A 2
3 16 B 3
4 16 A 3
5 15 B 4
6 14 B 5

Keep the second entry in a dataframe

I am showing you below an example dataset and the output desired.
ID number
1 50
1 49
1 48
2 47
2 40
2 31
3 60
3 51
3 42
Example output
1 49
2 40
3 51
I want to keep the second entry for every group in my dataset. I have already grouped them by ID but not I want for each Id to keep the second entry and remove all the duplicates afterwards from ID.
Use GroupBy.nth with 1 for second rows, because python counts from 0:
df1 = df.groupby('ID', as_index=False).nth(1)
print (df1)
ID number
1 1 49
4 2 40
7 3 51
Another solution with GroupBy.cumcount for counter and filtering by boolean indexing:
df1 = df[df.groupby('ID').cumcount() == 1]
Details:
print (df.groupby('ID').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
EDIT: Solution for second maximal value -s first sorting and then get second row - values has to be unique per groups:
df = (df.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df)
ID number
1 1 49
4 2 40
7 3 51
If want second maximal value if exist duplicates add DataFrame.drop_duplicates:
print (df)
ID number
0 1 50 <-first max
1 1 50 <-first max
2 1 48 <-second max
3 2 47
4 2 40
5 2 31
6 3 60
7 3 51
8 3 42
df3 = (df.drop_duplicates(['ID','number'])
.sort_values(['ID','number'], ascending=[True, False])
.groupby('ID', as_index=False)
.nth(1))
print (df3)
ID number
2 1 48
4 2 40
7 3 51
If that is the case we can use duplicated + drop_duplicates
df=df[df.duplicated('ID')].drop_duplicates('ID')
ID number
1 1 49
4 2 40
7 3 51
Flexible solution cumcount
df[df.groupby('ID').cumcount()==1].copy()
ID number
1 1 49
4 2 40
7 3 51

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

Python Pandas: How to take categorical average of a column?

for a given dataframe as follows:
1 a 10
2 a 20
3 a 30
4 b 10
5 b 100
where column 1 is index, column 2 is some categorical value and column 3 is a number. I want categorical mean over column 2, which should look something like this:
a 20
b 55
The value for a is calculated as
(10+20+30)/3 = 20
The value for b is calculated as
(10+100)/2 = 55
I think you can use groupby with mean and reset_index:
print df
a b c
0 1 a 10
1 2 a 20
2 3 a 30
3 4 b 10
4 5 b 100
df1 = df.groupby('b')['c'].mean().reset_index()
print df1
b c
0 a 20
1 b 55
print df1.c.max()
55
print df1.c.min()
20