Compare Pandas dataframes and add column - pandas

I have two dataframe as below
df1 df2
A A C
A1 A1 C1
A2 A2 C2
A3 A3 C3
A1 A4 C4
A2
A3
A4
The values of column 'A' are defined in df2 in column 'C'.
I want to add a new column to df1 with column B with its value from df2 column 'C'
The final df1 should look like this
df1
A B
A1 C1
A2 C2
A3 C3
A1 C1
A2 C2
A3 C3
A4 C4
I can loop over df2 and add the value to df1 but its time consuming as the data is huge.
for index, row in df2.iterrows():
df1.loc[df1.A.isin([row['A']]), 'B']= row['C']
Can someone help me to understand how can I solve this without looping over df2.
Thanks

You can use map by Series:
df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
It is same as map by dict:
d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
df1['B'] = df1.A.map(d)
print (df1)
A B
0 A1 C1
1 A2 C2
2 A3 C3
3 A1 C1
4 A2 C2
5 A3 C3
6 A4 C4
Timings:
len(df1)=7:
In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop
In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop
len(df1)=70k:
In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop
In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop

IIUC you can just merge and rename the col
df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged
Out[103]:
A B
0 A1 C1
1 A2 C2
2 A3 C4
3 A1 C1
4 A2 C2
5 A3 C4
6 A4 C4

Based on searchsorted method, here are three approaches with different indexing schemes -
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]

Related

Sorting columns of multiindex dataframe

I have a very large multi index dataframe with about 500 columns and each column has 2 sub columns.
The dataframe df looks as:
B2 B5 B3
bkt A1 A2 A2 A1 Z2 C1
Date
2019-06-11 0.8 0.2 -6.0 -0.8 -4.1 -0.6
2019-06-12 0.8 0.2 -6.9 -1.6 -5.3 -1.2
df.columns
MultiIndex(levels=[['B2', 'B5', 'B3', .....], ['A1', 'A2' ......]],
labels=[[1, 1, ....], [1, 0, ....]],
names=[None, 'bkt'])
I am trying to sort only the column names and keep the values as it is within each column to get the following desired output:
B2 B3 B5
bkt A1 A2 C1 Z2 A1 A2
Date
2019-06-11 ..
2019-06-12 ..
.. represents the values from the original dataframe. I just didn't retype them.
Setup
df = pd.DataFrame([
[.8, .2, -6., -.8, -4.1, -.6],
[.8, .2, -6.9, -1.6, -5.3, -1.2]
],
pd.date_range('2019-06-11', periods=2, name='Date'),
pd.MultiIndex.from_arrays([
'B2 B2 B5 B5 B3 B3'.split(),
'A1 A2 A2 A1 Z2 C1'.split()
], names=[None, 'bkt'])
)
Using sort_index and assign it back
df.columns=df.sort_index(axis=1,level=[0,1],ascending=[True,False]).columns
And from piR , we do not need create the copy of df, just do modification with the columns
df.columns=df.columns.sort_values(ascending=[True, False])
This should be done using sort_index to move both the column names and data:
df.sort_index(axis=1, level=[0, 1], ascending=[True, False], inplace=True)

Split a dataframe into multiple dataframes

I have a dataframe; I split it using groupby. I understand this splits the dataframes into multiple dataframes. How can I get back those individual dataframes , based on the groups and name them accordingly? So if said df.groupby(['A','B'])
and A has values A1, and B has values B1-B4, I want to get back those 4 dataframes callefdf_A1B1..df_A1B1, df_A1B2...df_A1B4?
This can be done by locals but not recommend
variables = locals()
for i,j in df.groupby(['A','B']):
variables["df_{0[0]}{0[1]}".format(i)] = j
df_01
Out[332]:
A B C
0 0 1 a-1524112-124
Using dict is the right way
{"df_{0[0]}{0[1]}".format(i) : j for i,j in df.groupby(['A','B'])}
Offering an alternate solution, using pandas.DataFrame.xs and some exec magic -
df = pd.DataFrame({'A': ['a1', 'a2']*4,
'B': ['b1', 'b2', 'b3', 'b4']*2,
'val': [i for i in range(8)]
})
df
# A B val
# 0 a1 b1 0
# 1 a2 b2 1
# 2 a1 b3 2
# 3 a2 b4 3
# 4 a1 b1 4
# 5 a2 b2 5
# 6 a1 b3 6
# 7 a2 b4 7
for i in df.set_index(['A', 'B']).index.unique().tolist():
exec("df_{}{}".format(i[0], i[1]) + " = df.set_index(['A','B']).xs(i)")
df_a1b1
# val
# A B
# a1 b1 0
# b1 4

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3

Split value from a data.frame and create additional row to store its component

In R, I have a data frame called df such as the following:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 - 7
a4 b4 c4 2.5
I want to split the value of the third row and D column by the dash and create another row for the second value retaining the other values for that row.
So I want this:
A B C D
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5
a3 b3 c3 7
a4 b4 c4 2.5
Any idea how this can be achieved?
Ideally, I would also want to create an extra column to specify whether the value I split is either a minimum or maximum.
So this:
A B C D E
a1 b1 c1 2.5
a2 b2 c2 3.5
a3 b3 c3 5 min
a3 b3 c3 7 max
a4 b4 c4 2.5
Thanks.
One option would be to use sub to paste 'min' and 'max in the 'D" column where - is found, and then use cSplit to split the 'D' column.
library(splitstackshape)
df1$D <- sub('(\\d+) - (\\d+)', '\\1,min - \\2,max', df1$D)
res <- cSplit(cSplit(df1, 'D', ' - ', 'long'), 'D', ',')[is.na(D_2), D_2 := '']
setnames(res, 4:5, LETTERS[4:5])
res
# A B C D E
#1: a1 b1 c1 2.5
#2: a2 b2 c2 3.5
#3: a3 b3 c3 5.0 min
#4: a3 b3 c3 7.0 max
#5: a4 b4 c4 2.5
Here's a dplyrish way:
DF %>%
group_by(A,B,C) %>%
do(data.frame(D = as.numeric(strsplit(as.character(.$D), " - ")[[1]]))) %>%
mutate(E = if (n()==2) c("min","max") else "")
A B C D E
(fctr) (fctr) (fctr) (dbl) (chr)
1 a1 b1 c1 2.5
2 a2 b2 c2 3.5
3 a3 b3 c3 5.0 min
4 a3 b3 c3 7.0 max
5 a4 b4 c4 2.5
Dplyr has a policy against expanding rows, as far as I can tell, so the ugly
do(data.frame(... .$ ...))
construct is required. If you are open to data.table, it's arguably simpler here:
library(data.table)
setDT(DF)[,{
D = as.numeric(strsplit(as.character(D)," - ")[[1]])
list(D = D, E = if (length(D)==2) c("min","max") else "")
}, by=.(A,B,C)]
A B C D E
1: a1 b1 c1 2.5
2: a2 b2 c2 3.5
3: a3 b3 c3 5.0 min
4: a3 b3 c3 7.0 max
5: a4 b4 c4 2.5
We can use tidyr::separate_rows. I altered the input to include a negative value to makeit more general :
df <- read.table(header=TRUE,stringsAsFactors=FALSE,text=
"A B C D
a1 b1 c1 -2.5
a2 b2 c2 3.5
a3 b3 c3 '5 - 7'
a4 b4 c4 2.5")
library(dplyr)
library(tidyr)
df %>%
mutate(E="", E = replace(E, grepl("[^^]-",D), "min - max")) %>%
separate_rows(D,E,sep = "[^^]-", convert = TRUE)
#> A B C D E
#> 1 a1 b1 c1 -2.5
#> 2 a2 b2 c2 3.5
#> 3 a3 b3 c3 5.0 min
#> 4 a3 b3 c3 7.0 max
#> 5 a4 b4 c4 2.5

Multiple group-by with one common variable with pandas?

I want to mark duplicate values within an ID group. For example
ID A B
i1 a1 b1
i1 a1 b2
i1 a2 b2
i2 a1 b2
should become
ID A An B Bn
i1 a1 2 b1 1
i1 a1 2 b2 2
i1 a2 1 b2 2
i2 a1 1 b2 1
Basically An and Bn count multiplicity within each ID group. How can I do this in pandas? I've found groupBy, but it was quite messy to put everything together. Also I tried individual groupby for ID, A and ID, B. Maybe there is a way to pre-group by ID first and then do all the other variables? (there are many variables and I have very man rows!)
Also I tried individual groupby for ID, A and ID, B
I think this is a straight-forward way to tackle it; As you suggest, you can groupby each separately and then compute the size of the groups. And use transform so you can easily add the results to the original dataframe:
df['An'] = df.groupby(['ID','A'])['A'].transform(np.size)
df['Bn'] = df.groupby(['ID','B'])['B'].transform(np.size)
print df
ID A B An Bn
0 i1 a1 b1 2 1
1 i1 a1 b2 2 2
2 i1 a2 b2 1 2
3 i2 a1 b2 1 1
Of course, with lots of columns you could do:
for col in ['A','B']:
df[col + 'n'] = df.groupby(['ID',col])[col].transform(np.size)
The duplicated method can also be used to give you something similar, but it will mark observations within a group after the first as duplicates:
for col in ['A','B']:
df[col + 'n'] = df.duplicated(['ID',col])
print df
ID A B An Bn
0 i1 a1 b1 False False
1 i1 a1 b2 True False
2 i1 a2 b2 False True
3 i2 a1 b2 False False
EDIT: increasing performance for large data. I did it on a large dataset (4 million rows) and it was significantly faster if I avoided transform with something like the following (it is much less elegant):
for col in ['A','B']:
x = df.groupby(['ID',col]).size()
df.set_index(['ID',col],inplace=True)
df[col + 'n'] = x
df.reset_index(inplace=True)