Unmerge specific group in multiindex groupby dataframe - pandas

My dataframe
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
5 A 1
8 A 1
In the above I would like to ungroup TEAM in the same level as ID
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
1 2 A 1
B 1
2 4 A 1
2 5 A 1
2 8 A 1

If you want to treat columns TEAM and ID on same level, then you can create a column by combining these two columns and group on this new column.
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TEAM_ID"] = df["TEAM"].astype(str) + "_" + df["ID"].astype(str)
df = df[["TEAM_ID", "TYPE", "VALUE"]].groupby(["TEAM_ID", "TYPE"]).sum()
print(df)
Output:
VALUE
TEAM_ID TYPE
1_1 A 1
B 1
1_2 A 1
B 1
2_4 A 1
2_5 A 1
2_8 A 1
and then split values when to be used
OR
df = df.groupby(["TEAM_ID", "TYPE"]).sum()
to keep repeated TEAM and ID values.
TEAM ID VALUE
TEAM_ID TYPE
1_1 A 1 1 1
B 1 1 1
1_2 A 1 2 1
B 1 2 1
2_4 A 2 4 1
2_5 A 2 5 1
2_8 A 2 8 1

Related

Pandas shift logic

I have a dataframe like:
col1 customer
1 a
3 a
1 b
2 b
3 b
5 b
I want the logic to be like this:
col1 customer col2
1 a 1
3 a 1
1 b 1
2 b 2
3 b 3
5 b 3
as you can see, if the customer has consistent values in col1, give it, if not, give the last consistent number which is 3
I tried using the df.shift() but I was stuck
Further Example:
col1
1
1
1
3
5
8
10
he should be given a value of 1 because that's the last consistent value for him!
Update
If you have more than one month, you can use this version:
import numpy as np
inc_count = lambda x: np.where(x.diff(1) == 1, x, x.shift(fill_value=x.iloc[0]))
df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Maybe you want to increment a counter if the next row value following the current one:
# Same as df['col1'].diff().eq(1).cumsum().add(1)
df['col2'] = df['col1'].eq(df['col1'].shift()+1).cumsum().add(1)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3
Or for each customer:
inc_count = lambda x: x.eq(x.shift()+1).cumsum().add(1)
df['col2'] = df['col2'] = df.groupby('customer')['col1'].transform(inc_count)
print(df)
# Output
col1 customer col2
0 1 a 1
1 3 a 1
2 1 b 1
3 2 b 2
4 3 b 3
5 5 b 3

Pandas groupby of specific catergorical column

With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0

change all values in a dataframe with other values from another dataframe

I just started with learning pandas.
I have 2 dataframes.
The first one is
val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
and the second one is
0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2
I want to change my second dataframe so that the values present in the dataframe are compared with the column val in the first dataframe and every values that is the same needs then to be changed in the values that is present in de the num column from dataframe 1. Which means that in the end i need to get the following dataframe:
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1
How do i do that in pandas?
You can use DataFrame.replace() to do this:
df2.replace(df1.set_index('val')['num'])
Explanation:
The first step is to set the val column of the first DataFrame as the index. This will change how the matching is performed in the third step.
Convert the first DataFrame to a Series, by sub-setting to the index and the num column. It looks like this:
val
1 0
2 1
3 2
4 3
5 4
Name: num, dtype: int64
Next, use DataFrame.replace() to do the replacement in the second DataFrame. It looks up each value from the second DataFrame, finds a matching index in the Series, and replaces it with the value from the Series.
Full reproducible example:
import pandas as pd
import io
s = """ val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4"""
df1 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
s = """ 0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2"""
df2 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
print(df2.replace(df1.set_index('val')['num']))
Creat the mapping dict , then replace
mpd = dict(zip(df1.val,df1.num))
df2.replace(mpd, inplace=True)
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1

How to remove one specific duplicate named column in columns of a dataframe?

I have a sample dataframe df with columns as:
a b c a a b b c c
0 2 2 1 2 2 1 1 2 2
1 2 2 2 2 2 1 2 1 2
. . .
. . .
I want to remove the duplicate columns named with only 'a' and keep other as same
The expected o/p is:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
Here is a general solution to drop any duplicates of a column, no matter where these columns are in the dataframe and what the content of these columns is.
First we get all column indexes for the given column name and drop the first occurrence. Then we "substract" these indexes from all indexes and return the remaining columns:
to_drop = 'a'
dup = [i for i,v in enumerate(df.columns) if v==to_drop][1:]
df = df.iloc[:, list(set(range(len(df.columns))) - set(dup))]
Result:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
df = df.T.reset_index().drop_duplicates().set_index('index').T
del df.columns.name
Exp
since the column a has only dupe values, so we can simply transpose with reset index
df.T.reset_index()
index 0 1
0 a 2 2
1 b 2 2
2 c 1 2
3 b 1 1
4 b 1 2
5 c 2 1
6 c 2 2
Apply drop_duplicate on above df and only the dupes will get removed. It serves the purpose in those instances too where there are more than one column which has dupe value
Output
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2

pandas grouping based on conditions on other columns

import pandas as pd
df = pd.DataFrame(columns=['A','B'])
df['A']=['A','B','A','A','B','B','B']
df['B']=[2,4,3,5,6,7,8]
df
A B
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
df.columns=['id','num']
df
id num
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
I would like to apply groupby on id column but some condition on num column
I want to have 2 columns is_even_count and is_odd_count columns in final data frame where is_even_count only counts even numbers from num column after grouping and is_odd_count only counts odd numbers from num column after grouping.
My output will be
is_even_count is_odd_count
A 1 2
B 3 1
how can i do this in pandas
Use modulo division by 2 and compare by 1 with map:
d = {True:'is_odd_count', False:'is_even_count'}
df = df.groupby(['id', (df['num'] % 2 == 1).map(d)]).size().unstack(fill_value=0)
print (df)
num is_even_count is_odd_count
id
A 1 2
B 3 1
Another solution with crosstab:
df = pd.crosstab(df['id'], (df['num'] % 2 == 1).map(d))
Alternative with numpy.where:
a = np.where(df['num'] % 2 == 1, 'is_odd_count', 'is_even_count')
df = pd.crosstab(df['id'], a)