How to transform dataframe column containing list of values in to its own individual column with count of occurrence? - pandas

I have a frame like this
presence_data = pd.DataFrame({
"id": ["id1", "id2"],
"presence": [
["A", "B", "C", "A"],
["G", "A", "B", "I", "B"],
]
})
id
presence
id1
[A, B, C, A]
id2
[G, A, B, I, B]
I want to transform above into something like this...
id
A
B
C
G
I
id1
2
1
1
0
0
id2
1
2
0
1
1
Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.

edited based on feedback from Henry Ecker in comments, might as well have the better answer here:
You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.
df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])
This gave me the following:
presence A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1

from collections import Counter
(presence_data
.set_index('id')
.presence
.map(Counter)
.apply(pd.Series)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive
Alternatively, you can create a new dataframe ( and reduce the number of method calls):
(pd.DataFrame(map(Counter, presence_data.presence),
index = presence_data.id)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.
df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
print(df)
A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
You can use this after if you want to have id as a column, rather than the index.
df.reset_index(inplace=True)
print(df)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

Related

How to unpivot table from boolean form

I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B
You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C
You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A
You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

pandas series exact match with a list of lists

I have a dataframe of the form :
ID | COL
1 A
1 B
1 C
1 D
2 A
2 C
2 D
3 A
3 B
3 C
I also have a list of list containing sequences,for example seq = [[A,B,C],[A,C,D]].
I am trying to count the number of IDs in the dataframe where in COL matchs exactly an entry in seq. I am currently doing it the following way :-
df.groupby('ID')['COL'].apply(lambda x: x.reset_index(drop = True).equals(pd.Series(vs))).reset_index()['COL'].count()
iterating over vs,where vs is a list from seq.
Expected Output :-
ID | is_in_seq
1 0
2 1
3 1
Since the sequence in COL for ID 1 is ABCD, which is not a sequence in seq, the value against it is 0.
Questions:-
1.) Is there a vectorized way of doing this operation? The approach I've outlined above takes a lot of time even for a single entry from seq, seeing that there can be upto 30 - 40 values in col per ID and maintaining the order in COL is critical.
IIUC:
You will only ever produce a zero or a one. Because you'll be checking if the group as a whole (and there is only one whole) is in seq. If seq is unique (I'm assuming it is) then you'll only ever have the group in seq or not.
First step is to make seq a set of tuples
seq = set(map(tuple, seq))
Second step is to produces an aggregated pandas object that contains tuples
tups = df.groupby('ID')['COL'].agg(tuple)
tups
ID
1 (A, B, C, D)
2 (A, C, D)
3 (A, B, C)
Name: COL, dtype: object
Step three, we can use isin
tups.isin(seq).astype(int).reset_index(name='is_in_seq')
ID is_in_seq
0 1 0
1 2 1
2 3 1
IIUC, use groupby.sum
to get a string with the complete sequence. Then use map and ''.join with DataFrame.isin to check matches
new_df = (df.groupby('ID')['COL']
.sum()
.isin(map(''.join, seq))
#.isin(list(map(''.join, seq))) #if neccesary list
.astype(int)
.reset_index(name = 'is_in_seq')
)
print(new_df)
ID is_in_seq
0 1 0
1 2 1
2 3 1
Detail
df.groupby('ID')['COL'].sum()
ID
1 ABCD
2 ACD
3 ABC
Name: COL, dtype: object

how to split one column into many columns and count the frequency

Here is the question I have in mind, given a table
Id type
0 1 [a,b]
1 2 [c]
2 3 [a,d]
I want to convert it into the form of:
Id a b c d
0 1 1 1 0 0
1 2 0 0 1 0
2 3 1 0 0 1
I need a very efficient way to convert a large table. any comment is welcome.
====================================
I have received several good answers, and really appreciate your help.
Now a new question comes along, which is my laptop memory is insufficient to generating the whole dataframe by using pd.dummies.
is there anyway to generate a sparse vector row by row and stack then together?
Try this
>>> df
Id type
0 1 [a, b]
1 2 [c]
2 3 [a, d]
>>> df2 = pd.DataFrame([x for x in df['type'].apply(
... lambda item: dict(map(
... lambda x: (x,1),
... item))
... ).values]).fillna(0)
>>> df2.join(df)
a b c d Id type
0 1 1 0 0 1 [a, b]
1 0 0 1 0 2 [c]
2 1 0 0 1 3 [a, d]
It basically convert the list of list to list of dict and construct a DataFrame out of this
[ ['a', 'b'], ['c'], ['a', 'd'] ] # list of list
[ {'a':1, 'b':1}, {'c':1}, {'a':1, 'd':1} ] # list of dict
Make DataFrame out of this
try this:
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
to explain:
df.type.apply(lambda x: pd.Series([i for i in x]
gets you a column for index position in your lists. You can then use get dummies to get the count of each value
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
outputs:
a c b d
0 1 0 1 0
1 0 1 0 0
2 1 0 0 1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2