I have a dataframe like:
col1 customer
1 a
3 a
1 b
2 b
3 b
5 b
I want the logic to be like this:
col1 customer col2
1 a 1
3 a 1
1 b 1
2 b 2
3 b 3
5 b 3
as you can see, if the customer has consistent values in col1, give it, if not, give the last consistent number which is 3
I tried using the df.shift() but I was stuck
Further Example:
col1
1
1
1
3
5
8
10
he should be given a value of 1 because that's the last consistent value for him!
Update
If you have more than one month, you can use this version:
import numpy as np
inc_count = lambda x: np.where(x.diff(1) == 1, x, x.shift(fill_value=x.iloc[0]))
df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Maybe you want to increment a counter if the next row value following the current one:
# Same as df['col1'].diff().eq(1).cumsum().add(1)
df['col2'] = df['col1'].eq(df['col1'].shift()+1).cumsum().add(1)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Or for each customer:
inc_count = lambda x: x.eq(x.shift()+1).cumsum().add(1)
df['col2'] = df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Related
With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
My dataframe
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
5 A 1
8 A 1
In the above I would like to ungroup TEAM in the same level as ID
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
1 2 A 1
B 1
2 4 A 1
2 5 A 1
2 8 A 1
If you want to treat columns TEAM and ID on same level, then you can create a column by combining these two columns and group on this new column.
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TEAM_ID"] = df["TEAM"].astype(str) + "_" + df["ID"].astype(str)
df = df[["TEAM_ID", "TYPE", "VALUE"]].groupby(["TEAM_ID", "TYPE"]).sum()
print(df)
Output:
VALUE
TEAM_ID TYPE
1_1 A 1
B 1
1_2 A 1
B 1
2_4 A 1
2_5 A 1
2_8 A 1
and then split values when to be used
OR
df = df.groupby(["TEAM_ID", "TYPE"]).sum()
to keep repeated TEAM and ID values.
TEAM ID VALUE
TEAM_ID TYPE
1_1 A 1 1 1
B 1 1 1
1_2 A 1 2 1
B 1 2 1
2_4 A 2 4 1
2_5 A 2 5 1
2_8 A 2 8 1
I have a sample dataframe df with columns as:
a b c a a b b c c
0 2 2 1 2 2 1 1 2 2
1 2 2 2 2 2 1 2 1 2
. . .
. . .
I want to remove the duplicate columns named with only 'a' and keep other as same
The expected o/p is:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
Here is a general solution to drop any duplicates of a column, no matter where these columns are in the dataframe and what the content of these columns is.
First we get all column indexes for the given column name and drop the first occurrence. Then we "substract" these indexes from all indexes and return the remaining columns:
to_drop = 'a'
dup = [i for i,v in enumerate(df.columns) if v==to_drop][1:]
df = df.iloc[:, list(set(range(len(df.columns))) - set(dup))]
Result:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
df = df.T.reset_index().drop_duplicates().set_index('index').T
del df.columns.name
Exp
since the column a has only dupe values, so we can simply transpose with reset index
df.T.reset_index()
index 0 1
0 a 2 2
1 b 2 2
2 c 1 2
3 b 1 1
4 b 1 2
5 c 2 1
6 c 2 2
Apply drop_duplicate on above df and only the dupes will get removed. It serves the purpose in those instances too where there are more than one column which has dupe value
Output
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
I am trying to do count by grouping. see below input and output.
input:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['col3'] = [1,1,1,1,1,1,1]
output:
col4
0 2
1 2
2 2
3 2
4 1
5 1
6 1
Tried playing around with groupby and count, by doing:
s = df.groupby(['col1','col2'])['col3'].sum()
and the output I got was
a 4 2
5 2
b 6 1
7 1
8 1
how do I add it just as a column on the main df.
Thanks vm!
Use transform len or size:
df['count'] = df.groupby(['col1','col2'])['col3'].transform(len)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
df['count'] = df.groupby(['col1','col2'])['col3'].transform('size')
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
But column col3 is not necessary, you can use col1 or col2:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['count'] = df.groupby(['col1','col2'])['col1'].transform(len)
df['count1'] = df.groupby(['col1','col2'])['col2'].transform(len)
print (df)
col1 col2 count count1
0 a 4 2 2
1 a 4 2 2
2 a 5 2 2
3 a 5 2 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
try this,
df['count'] = df.groupby(['col1','col2'])['col3'].transform(sum)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
What is an apposite function of pivot in Pandas?
For example I have
a b c
1 1 2
2 2 3
3 1 2
What I want
a newcol newcol2
1 b 1
1 c 2
2 b 2
2 c 3
3 b 1
3 c 2
use pd.melt http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,1],'c':[2,3,2]})
pd.melt(df,id_vars=['a'])
Out[8]:
a variable value
0 1 b 1
1 2 b 2
2 3 b 1
3 1 c 2
4 2 c 3
5 3 c 2