how to merge two pandas dataframe correctly - pandas

I have 2 dataframes
df1
Code Sales Store
A 10 alpha
B 5 beta
C 4 gamma
B 3 alpha
df2
Code Unit_Price
A 2
B 3
C 4
D 5
E 6
I want do 2 things here.
First I want to check that all unique codes in df1 are there in df2
Second, I want to merge these 2 df2 by codes
df3, should look like
Code Sales Store unit_price
A 10 alpha 2
B 5 beta 3
C 4 gamma 4
D 3 alpha 5
I did
df3 = df1.merge(df2,on='Code',how='left')
Not sure if I am right , I will appreciate your time and effort to help me in this record

Need numpy.setdiff1d for check membership unique values of columns:
print (np.setdiff1d(df1['Code'].unique(), df1['Code'].unique()))
[]
print (np.setdiff1d(df2['Code'].unique(), df1['Code'].unique()))
['D' 'E']
Your solution is good, especially if need add more columns like:
print (df2)
Code Unit_Price col
0 A 2 7
1 B 3 2
2 C 4 1
3 D 5 0
4 E 6 3
df3 = df1.merge(df2,on='Code',how='left')
print (df3)
Code Sales Store Unit_Price col
0 A 10 alpha 2 7
1 B 5 beta 3 2
2 C 4 gamma 4 1
3 B 3 alpha 3 2
If need add only one column is possible use map by Series what should be faster:
df1['unit_price'] = df1['Code'].map(df2.set_index('Code')['Unit_Price'])
print (df1)
Code Sales Store unit_price
0 A 10 alpha 2
1 B 5 beta 3
2 C 4 gamma 4
3 B 3 alpha 3

Related

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

How to concatenate a dictionary of pandas DataFrames into a signle DataFrame?

I have three DataFrames containing each a single row
dfA = pd.DataFrame( {'A':[3], 'B':[2], 'C':[1], 'D':[0]} )
dfB = pd.DataFrame( {'A':[9], 'B':[3], 'C':[5], 'D':[1]} )
dfC = pd.DataFrame( {'A':[3], 'B':[4], 'C':[7], 'D':[8]} )
for instance dfA is
A B C D
0 3 2 1 0
I organize them in a dictionary:
data = {'row_1': dfA, 'row_2': dfB, 'row_3': dfC}
I want to concatenate them into a single DataFrame
ans = pd.concat(data)
which returns
A B C D
row_1 0 3 2 1 0
row_2 0 9 3 5 1
row_3 0 3 4 7 8
whereas I want to obtain this
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8
That is to say I want to "drop" an index column.
How do I do this?
Use DataFrame.reset_index with second level and parameter drop=True:
df = ans.reset_index(level=1, drop=True)
print (df)
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8
You can reset index:
pd.concat(data).reset_index(level=-1,drop=True)
Output:
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8

transform dataframe columns and keep orginals

If I have a dataframe with 3 columns A, B ,C
Is it possible to get the rank for example of these columns but keeping the original.
so df
A B C
3 4 2
1 2 3
df.add_suffix('_Rank')=df.rank(axis=1)
A B C A_Rank B_Rank C_Rank
3 4 2 2 3 1
1 2 3 1 2 3
Use join with add_suffix in right side:
df = df.join(df.rank(axis=1).add_suffix('_Rank'))
Or add _Rank to columns names for new columns:
df[df.columns + '_Rank'] = df.rank(axis=1)
print (df)
A B C A_Rank B_Rank C_Rank
0 3 4 2 2.0 3.0 1.0
1 1 2 3 1.0 2.0 3.0

Renaming column of one dataframe by extracting from combination of series and dataframe column names

In the line below, I am renaming the columns of pnlsummary dataframe from the column names of three series (totalheldmw, totalcost and totalsellprofit) and one dataframe (totalheldprofit).
The difficulty I have is to iterate over the column names of the dataframe. I have manually assigned the names as you can see below. I would suppose there is an efficient way of iterating over the column names of the dataframe. Please advice.
pnlsummary.columns =
[totalheldmw.name[0],totalcost.name[0],totalsellprofit.name[0],
totalheldprofit.columns[0],totalheldprofit.columns[1],
totalheldprofit.columns[2],totalheldprofit.columns[3]]
I think you need create list by constants and then add columns names converted to list:
pnlsummary.columns = [totalheldmw.name[0],totalcost.name[0],totalsellprofit.name[0]] +
totalheldprofit.columns[0:3].astype(str).tolist()
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df.columns = ['a','s','d'] + df.columns[0:3].tolist()
print (df)
a s d A B C
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b