How to unpivot table from boolean form - pandas

I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B

You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C

You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A

You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.
NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?
You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0
You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

How to transform dataframe column containing list of values in to its own individual column with count of occurrence?

I have a frame like this
presence_data = pd.DataFrame({
"id": ["id1", "id2"],
"presence": [
["A", "B", "C", "A"],
["G", "A", "B", "I", "B"],
]
})
id
presence
id1
[A, B, C, A]
id2
[G, A, B, I, B]
I want to transform above into something like this...
id
A
B
C
G
I
id1
2
1
1
0
0
id2
1
2
0
1
1
Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.
edited based on feedback from Henry Ecker in comments, might as well have the better answer here:
You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.
df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])
This gave me the following:
presence A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
from collections import Counter
(presence_data
.set_index('id')
.presence
.map(Counter)
.apply(pd.Series)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive
Alternatively, you can create a new dataframe ( and reduce the number of method calls):
(pd.DataFrame(map(Counter, presence_data.presence),
index = presence_data.id)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.
df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
print(df)
A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
You can use this after if you want to have id as a column, rather than the index.
df.reset_index(inplace=True)
print(df)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2