We can use .idxmax to get the maximum value of a dataframe(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2
Related
I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B
You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C
You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A
You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B
A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1
I have a dataframe of the form :
ID | COL
1 A
1 B
1 C
1 D
2 A
2 C
2 D
3 A
3 B
3 C
I also have a list of list containing sequences,for example seq = [[A,B,C],[A,C,D]].
I am trying to count the number of IDs in the dataframe where in COL matchs exactly an entry in seq. I am currently doing it the following way :-
df.groupby('ID')['COL'].apply(lambda x: x.reset_index(drop = True).equals(pd.Series(vs))).reset_index()['COL'].count()
iterating over vs,where vs is a list from seq.
Expected Output :-
ID | is_in_seq
1 0
2 1
3 1
Since the sequence in COL for ID 1 is ABCD, which is not a sequence in seq, the value against it is 0.
Questions:-
1.) Is there a vectorized way of doing this operation? The approach I've outlined above takes a lot of time even for a single entry from seq, seeing that there can be upto 30 - 40 values in col per ID and maintaining the order in COL is critical.
IIUC:
You will only ever produce a zero or a one. Because you'll be checking if the group as a whole (and there is only one whole) is in seq. If seq is unique (I'm assuming it is) then you'll only ever have the group in seq or not.
First step is to make seq a set of tuples
seq = set(map(tuple, seq))
Second step is to produces an aggregated pandas object that contains tuples
tups = df.groupby('ID')['COL'].agg(tuple)
tups
ID
1 (A, B, C, D)
2 (A, C, D)
3 (A, B, C)
Name: COL, dtype: object
Step three, we can use isin
tups.isin(seq).astype(int).reset_index(name='is_in_seq')
ID is_in_seq
0 1 0
1 2 1
2 3 1
IIUC, use groupby.sum
to get a string with the complete sequence. Then use map and ''.join with DataFrame.isin to check matches
new_df = (df.groupby('ID')['COL']
.sum()
.isin(map(''.join, seq))
#.isin(list(map(''.join, seq))) #if neccesary list
.astype(int)
.reset_index(name = 'is_in_seq')
)
print(new_df)
ID is_in_seq
0 1 0
1 2 1
2 3 1
Detail
df.groupby('ID')['COL'].sum()
ID
1 ABCD
2 ACD
3 ABC
Name: COL, dtype: object
I have a dataframe
A B C
1 2 3
2 3 4
3 8 7
I want to take only rows where there is a sequence of 3,4 in columns C (in this scenario - first two rows)
What will be the best way to do so?
You can use rolling for general solution working with any pattern:
pat = np.asarray([3,4])
N = len(pat)
mask= (df['C'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all(), raw=True)
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool))
df = df[mask]
print (df)
A B C
0 1 2 3
1 2 3 4
Explanation:
use rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit for filling first NANs values by last previous one
fillna NaNs to 0
last cast to bool by astype
Use shift
In [1085]: s = df.eq(3).any(1) & df.shift(-1).eq(4).any(1)
In [1086]: df[s | s.shift()]
Out[1086]:
A B C
0 1 2 3
1 2 3 4
I have a DataFrame like this
DataFrame({"key":["a","b","c","d","e"], "value": [5,4,3,2,1]})
I am mainly interested in row "a", "b" and "c". I want to merge everything else into an "others" row like this
key value
0 a 5
1 b 4
2 c 3
3 others 3
I wonder how can this be done.
First create a dataframe without d and e:
df2 = df[df.key.isin(["a","b","c"])]
Then find the value that you want the other column to have (using the sum function in this example):
val = df[~df["key"].isin(["a","b","c"])].sum()["value"]
Finally, append this column to the second df:
df2.append({"key":"others", "value":val},ignore_index=True)
df2 is now:
key value
0 a 5
1 b 4
2 c 3
3 others 3
I have found a way to do it. Not sure if it is the best way.
In [3]: key_map = {"a":"a", "b":"b", "c":"c"}
In [4]: data['key1'] = data['key'].map(lambda k: key_map.get(k, "others"))
In [5]: data.groupby("key1").sum()
Out[5]:
value
key1
a 5
b 4
c 3
others 3