Pandas - Groupby by three columns with cumsum or cumcount [duplicate] - pandas

I need to create a new "identifier column" with unique values for each combination of values of two columns. For example, the same "identifier" should be used when ID and phase are the same (e.g. r1 and ph1 [but a new, unique value should be added to the column when r1 and ph2])
df
ID phase side values
r1 ph1 l 12
r1 ph1 r 34
r1 ph2 l 93
s4 ph3 l 21
s3 ph2 l 88
s3 ph2 r 54
...
I would need a new column (idx) like so:
new_df
ID phase side values idx
r1 ph1 l 12 1
r1 ph1 r 34 1
r1 ph2 l 93 2
s4 ph3 l 21 3
s3 ph2 l 88 4
s3 ph2 r 54 4
...
I've tried applying code from this question but could no achieve a way to increment the values in idx.

Try with groupby ngroup + 1, use sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['idx'] = df.groupby(['ID', 'phase'], sort=False).ngroup() + 1
df:
ID phase side values idx
0 r1 ph1 l 12 1
1 r1 ph1 r 34 1
2 r1 ph2 l 93 2
3 s4 ph3 l 21 3
4 s3 ph2 l 88 4
5 s3 ph2 r 54 4

Related

Merging dataframes on multi-index AND column

TL;DR: Merge two dataframes based on their multi-indices and a column that they share.
The two multi-index dataframes (call them dfA and dfB) do not have unique indices and are of different shapes. However level 0 of the index specifies group, level 1 specifies material for both. In dataframe dfA and dfB, there is a column called SR.
The correct identification would therefore involve both indices and the value of the SR column.
import pandas as pd
import numpy as np
tupA = [('G1','M1')]*3 + [('G1','M2')] + [('G2','M3')]*2
indA = pd.MultiIndex.from_tuples(tupA, names = ['Group', 'Material'])
dfA = pd.DataFrame({'SR': [3,5,10, 3,5,15],
'ValA': [1,2,1,4,5,6]},
index = ind
)
tupB = [('G1','M1')]*2 + [('G1','M2')] + [('G2','M3')]
indB = pd.MultiIndex.from_tuples(tupB, names = ['Group', 'Material'])
dfB = pd.DataFrame({'SR': [3,5,3,5],
'ValB': [2,4,5,6]},
index = indB
)
print(dfA,'\n', dfB)
Yields:
SR ValA
Group Material
G1 M1 3 1
M1 5 2
M1 10 1
M2 3 4
G2 M3 5 5
M3 15 6
SR ValB
Group Material
G1 M1 3 2
M1 5 4
M2 3 5
G2 M3 5 6
The Task:
To merge the two dataframes based on the multi-index and the column SR. The SR entries in dfB which are not in dfA should be replaced by np.nan.
Desired Output:
The merged dataframe should be something like this:
SR ValA ValB
Group Material
G1 M1 3 1 2.0
M1 5 2 4.0
M1 10 1 NaN
M2 3 4 5.0
G2 M3 5 5 6.0
M3 15 6 NaN
It has all the rows of dfA but for those SR values which are not in dfB it has NaN.
Attempt at Solution:
I tried a number of left and outer joins but I can't get the NaN. The documentation does have an example which gives NaN but it's not using multi-indices.
Would appreciate some help.
You can use the merge function with your three keys passed to the on argument:
dfA.merge(dfB,on=["Group","Material","SR"], how="left")
Output :
SR ValA ValB
Group Material
G1 M1 3 1 2.0
M1 5 2 4.0
M1 10 1 NaN
M2 3 4 5.0
G2 M3 5 5 6.0
M3 15 6 NaN

keep all column after sum and groupby including empty values

I have the following dataframe:
source name cost other_c other_b
a a 7 dd 33
b a 6 gg 44
c c 3 ee 55
b a 2
d b 21 qw 21
e a 16 aq
c c 10 55
I am doing a sum of name and source with:
new_df = df.groupby(['source', 'name'], as_index=False)['cost'].sum()
but it is dropping the remaining 6 columns in my dataframe. Is there a way to keep the rest of the columns? I'm not looking to add new column, just carry over the columns from the original dataframe

finding rows with one difference in DataFrame

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.
Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7
You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7

What is the R equivalent of SQL "SELECT * FROM table GROUP BY c1, c2"?

I want to reduce my data frame (EDIT: in a cpu-efficient way) to rows with unique values of the pair c3, c4, while keeping all columns. In other words I want to transform my data frame
> df <- data.frame(c1=seq(7), c2=seq(4, 10), c3=c("A", "B", "B", "C", "B", "A", "A"), c4=c(1, 2, 3, 3, 2, 2, 1))
c1 c2 c3 c4
1 1 4 A 1
2 2 5 B 2
3 3 6 B 3
4 4 7 C 3
5 5 8 B 2
6 6 9 A 2
7 7 10 A 1
to the data frame
c1 c2 c3 c4
1 1 4 A 1
2 2 5 B 2
3 3 6 B 3
4 4 7 C 3
6 6 9 A 2
where the values of c1 and c2 could be any value which occurs for a unique pair of c3, c4. Also the order of the resulting data frame is not of importance.
EDIT: My data frame has around 250 000 rows and 12 columns and should be grouped by 2 columns – therefore I need a CPU-efficient solution.
Working but unsatisfactory alternative
I solved this problem with
> library(sqldf)
> sqldf("Select * from df Group By c3, c4")
but in order to speed up and parallelize my program I have to eliminate the calls to sqldf.
EDIT: Currently the sqldf solution clocks at 3.5 seconds. I consider this a decent time. The problem is that I cannot start various queries in parallel therefore I am searching for an alternative way.
Not working attempts
duplicate()
> df[duplicated(df, by=c("c3", "c4")),]
[1] c1 c2 c3 c4
<0 rows> (or 0-length row.names)
selects duplicate rows and does not select rows where only columns c3 and c4 are duplicates.
aggregate()
> aggregate(df, by=list(df$c3, df$c4))
Error in match.fun(FUN) : argument "FUN" is missing, with no default
aggregate requires a function applied to all lines with the same values of c3 and c4
data.table's by
> library(data.table)
> dt <- data.table(df)
> dt[,list(c1, c2) ,by=list(c3, c4)]
c3 c4 c1 c2
1: A 1 1 4
2: A 1 7 10
3: B 2 2 5
4: B 2 5 8
5: B 3 3 6
6: C 3 4 7
7: A 2 6 9
does not kick out the rows which have non-unique values of c3 and c4, whereas
> dt[ ,length(c1), by=list(c3, c4)]
c3 c4 V1
1: A 1 2
2: B 2 2
3: B 3 1
4: C 3 1
5: A 2 1
does discard the values of c1 and c2 and reduces them to one dimension as specified with the passed function length.
Here is a data.table solution.
library(data.table)
setkey(setDT(df),c3,c4) # convert df to a data.table and set the keys.
df[,.SD[1],by=list(c3,c4)]
# c3 c4 c1 c2
# 1: A 1 1 4
# 2: A 2 6 9
# 3: B 2 2 5
# 4: B 3 3 6
# 5: C 3 4 7
The SQL you propose seems to extract the first row having a given combination of (c3,c4) - I assume that's what you want.
EDIT: Response to OP's comments.
The result you cite seems really odd. The benchmarks below, on a dataset with 12 columns and 2.5e5 rows, show that the data.table solution runs in about 25 milliseconds without setting keys, and in about 7 milliseconds with keys set.
set.seed(1) # for reproducible example
df <- data.frame(c3=sample(LETTERS[1:10],2.5e5,replace=TRUE),
c4=sample(1:10,2.5e5,replace=TRUE),
matrix(sample(1:10,2.5e6,replace=TRUE),nc=10))
library(data.table)
DT.1 <- as.data.table(df)
DT.2 <- as.data.table(df)
setkey(DT.2,c3,c4)
f.nokeys <- function() DT.1[,.SD[1],by=list(c3,c4)]
f.keys <- function() DT.2[,.SD[1],by=list(c3,c4)]
library(microbenchmark)
microbenchmark(f.nokeys(),f.keys(),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f.nokeys() 23.73651 24.193129 24.609179 25.747767 26.181288 10
# f.keys() 5.93546 6.207299 6.395041 6.733803 6.900224 10
In what ways is your dataset different from this one??
Drawback (maybe): All solutions sort the result by group variables.
Using aggregate
Solution mentioned by Martin: aggregate(. ~ c3 + c4, df, head, 1)
My old solution:
> aggregate(df,by=list(df$c3,df$c4),FUN=head,1)
Group.1 Group.2 c1 c2 c3 c4
1 A 1 1 4 A 1
2 A 2 6 9 A 2
3 B 2 2 5 B 2
4 B 3 3 6 B 3
5 C 3 4 7 C 3
> aggregate(df,by=list(df$c3,df$c4),FUN=head,1)[,-(1:2)]
c1 c2 c3 c4
1 1 4 A 1
2 6 9 A 2
3 2 5 B 2
4 3 6 B 3
5 4 7 C 3
Using ddply
> require(plyr)
Loading required package: plyr
> ddply(df, ~ c3 + c4, head, 1)
c1 c2 c3 c4
1 1 4 A 1
2 6 9 A 2
3 2 5 B 2
4 3 6 B 3
5 4 7 C 3
Some dplyr options:
library(dplyr)
group_by(df, c3, c4) %>% filter(row_number() == 1)
group_by(df, c3, c4) %>% slice(1)
group_by(df, c3, c4) %>% do(head(.,1))
group_by(df, c3, c4) %>% summarise_each(funs(first))
group_by(df, c3, c4) %>% summarise_each(funs(.[1]))
group_by(df, c3, c4) %>% summarise_each(funs(head(.,1)))
group_by(df, c3, c4) %>% distinct()
Here's a dplyr-only benchmark:
library(microbenchmark)
set.seed(99)
df <- data.frame(matrix(sample(500, 25e4*12, replace = TRUE), ncol = 12))
dim(df)
microbenchmark(
f1 = {group_by(df, X1, X2) %>% filter(row_number() == 1)},
f2 = {group_by(df, X1, X2) %>% summarise_each(funs(first))},
f3 = {group_by(df, X1, X2) %>% summarise_each(funs(.[1]))},
f4 = {group_by(df, X1, X2) %>% summarise_each(funs(head(., 1)))},
f5 = {group_by(df, X1, X2) %>% distinct()},
times = 10
)
Unit: milliseconds
expr min lq median uq max neval
f1 498 505 509 527 615 10
f2 726 766 794 815 823 10
f3 1485 1504 1545 1571 1639 10
f4 25170 25668 26027 26188 26406 10
f5 618 622 631 653 675 10
I excluded the version with do(head(.,1)) since it's just not a very good option and takes too long.
You can use interaction and duplicated:
subset(df, !duplicated(interaction(c3, c4)))
# c1 c2 c3 c4
# 1 1 4 A 1
# 2 2 5 B 2
# 3 3 6 B 3
# 4 4 7 C 3
# 6 6 9 A 2

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)