How to unpivot table from boolean form

How to unpivot table from boolean form - pandas

I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B

You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C

You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A

You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.

NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

How to check pair of string values in a column, after grouping the dataframe using ID column?

My Doubt in a Table/Dataframe viewI have a dataframe containing 2 columns: ID and Code.
ID Code Flag
1 A 0
1 C 1
1 B 1
2 A 0
2 B 1
3 A 0
4 C 0
Within each ID, if Code 'A' exists with 'B' or 'C', then it should flag 1.
I tried Groupby('ID') with filter(). but it is not showing the perfect result. Could anyone please help ?

You can do the following:
First use pd.groupby('ID') and concatenate the codes using 'sum' to create a new column. Then assing the value 1 if a row contains A or B as Code and when the new column contains an A:
df['s'] = df.groupby('ID').Code.transform('sum')
df['Flag'] = 0
df.loc[((df.Code == 'B') | (df.Code == 'C')) & df.s.str.contains('A'), 'Flag'] = 1
df = df.drop(columns = 's')
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

You can use boolean masks, direct for B/C, per group for A, then combine them and convert to integer:
# is the Code a B or C?
m1 = df['Code'].isin(['B', 'C'])
# is there also a A in the same group?
m2 = df['Code'].eq('A').groupby(df['ID']).transform('any')
# if both are True, flag 1
df['Flag'] = (m1&m2).astype(int)
Output:
ID Code Flag
0 1 A 0
1 1 C 1
2 1 B 1
3 2 A 0
4 2 B 1
5 3 A 0
6 4 C 0

How to transform dataframe column containing list of values in to its own individual column with count of occurrence?

I have a frame like this
presence_data = pd.DataFrame({
"id": ["id1", "id2"],
"presence": [
["A", "B", "C", "A"],
["G", "A", "B", "I", "B"],
]
})
id
presence
id1
[A, B, C, A]
id2
[G, A, B, I, B]
I want to transform above into something like this...
id
A
B
C
G
I
id1
2
1
1
0
0
id2
1
2
0
1
1
Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.

edited based on feedback from Henry Ecker in comments, might as well have the better answer here:
You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.
df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])
This gave me the following:
presence A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1

from collections import Counter
(presence_data
.set_index('id')
.presence
.map(Counter)
.apply(pd.Series)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive
Alternatively, you can create a new dataframe ( and reduce the number of method calls):
(pd.DataFrame(map(Counter, presence_data.presence),
index = presence_data.id)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.
df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
print(df)
A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
You can use this after if you want to have id as a column, rather than the index.
df.reset_index(inplace=True)
print(df)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1

I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!

Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to unpivot table from boolean form - pandas

I have a table like this where type (A, B, C) is represented as boolean form ID A B C One 1 0 0 Two 0 0 1 Three 0 1 0 I want to have the table like ID Type One A Two C Three B

You can melt and select the rows with 1 with loc while using pop to remove the intermediate values: out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)] output: ID Type 0 One A 5 Three B 7 Two C

You can do: x,y = np.where(df.iloc[:, 1:]) out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]}) Output: ID Type 0 One ID 1 Two B 2 Three A

Related

Multimatch join in pandas

How to check pair of string values in a column, after grouping the dataframe using ID column?

How to transform dataframe column containing list of values in to its own individual column with count of occurrence?

Pandas merge conflict rows by counts?

Extract rows with maximum values in pandas dataframe

Categories

Resources