Delete duplicated according values in Pandas - pandas

I have a dataframe like that
id col1
2 T
2 T
4 R
4 T
6 G
6 G
I want deduplicate with this way :
If I have T and T for the same id I want to keep the 2 lines
If I have G or R and G or R for the same id I want to keep the 2 lines
If I have T and (G or R) for the same IT I just want to keep the line with T (delete one of the two lines)
I want this result :)
id col1
2 T
2 T
4 T
6 G
6 G
Thank you :)

Use boolean indexing for filtering:
m1 = df['col1'].eq('T')
m2 = m1.groupby(df['id']).transform('sum').ne(1)
df = df[m1 | m2 ]
print (df)
id col1
0 2 T
1 2 T
3 4 T
4 6 G
5 6 G
Explanation:
Compare col1 for T with eq (==):
m1 = df['col1'].eq('T')
print (m1)
0 True
1 True
2 False
3 True
4 False
5 False
Name: col1, dtype: bool
Count True values per groups by transform with sum:
print (m1.groupby(df['id']).transform('sum'))
0 2.0
1 2.0
2 1.0
3 1.0
4 0.0
5 0.0
Name: col1, dtype: float64
Compare for not equal by ne (!=):
m2 = m1.groupby(df['id']).transform('sum').ne(1)
print (m2)
0 True
1 True
2 False
3 False
4 True
5 True
Name: col1, dtype: bool
And chain together by | for bitwise OR:
print (m1 | m2)
0 True
1 True
2 False
3 True
4 True
5 True
Name: col1, dtype: bool

Related

Pandas groupby when group criteria repeat

A B C
a b 1
c d 1
e f 2
g h 2
i j 2
K l 1
J K 1
L M 1
I have a dataset that looks something like this. I want to group them based on C. The data is sequential and I want to give unique ids to each group. How can I achieve this?
The classical trick is to use the non-equality between successive rows (True where this happens), then a cumulative sum to forward fill and increment the Trues as increasing numerical values.
Using shift and ne, then cumsum to form the group. ngroup to get the group ID:
grouper = df['C'].ne(df['C'].shift()).cumsum()
df['group'] = df.groupby(grouper).ngroup()
Or with diff, and ne then cumsum:
grouper = df['C'].diff().ne(0).cumsum()
output:
A B C group
0 a b 1 0
1 c d 1 0
2 e f 2 1
3 g h 2 1
4 i j 2 1
5 K l 1 2
6 J K 1 2
7 L M 1 2
Intermediates of the logic to construct the grouper:
C non-eq implicit int cumsum
0 1 True 1 1
1 1 False 0 1
2 2 True 1 2
3 2 False 0 2
4 2 False 0 2
5 1 True 1 3
6 1 False 0 3
7 1 False 0 3

pandas change all rows with Type X if 1 Type X Result = 1

Here is a simple pandas df:
>>> df
Type Var1 Result
0 A 1 NaN
1 A 2 NaN
2 A 3 NaN
3 B 4 NaN
4 B 5 NaN
5 B 6 NaN
6 C 1 NaN
7 C 2 NaN
8 C 3 NaN
9 D 4 NaN
10 D 5 NaN
11 D 6 NaN
The object of the exercise is: if column Var1 = 3, set Result = 1 for all that Type.
This finds the rows with 3 in Var1 and sets Result to 1,
df['Result'] = df['Var1'].apply(lambda x: 1 if x == 3 else 0)
but I can't figure out how to then catch all the same Type and make them 1. In this case it should be all the As and all the Cs. Doesn't have to be a one-liner.
Any tips please?
Create boolean mask and for True/False to 1/0 mapp convert values to integers:
df['Result'] = df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']).astype(int)
#alternative
df['Result'] = np.where(df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']), 1, 0)
print (df)
Type Var1 Result
0 A 1 1
1 A 2 1
2 A 3 1
3 B 4 0
4 B 5 0
5 B 6 0
6 C 1 1
7 C 2 1
8 C 3 1
9 D 4 0
10 D 5 0
11 D 6 0
Details:
Get all Type values if match condition:
print (df.loc[df['Var1'].eq(3), 'Type'])
2 A
8 C
Name: Type, dtype: object
Test original column Type by filtered types:
print (df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
Name: Type, dtype: bool
Or use GroupBy.transform with any for test if match at least one value, thi solution is slowier if larger df:
df['Result'] = df['Var1'].eq(3).groupby(df['Type']).transform('any').astype(int)

How to plot values of my columns being above a certain treshold?

I've been stuck with this problem for a while. I have a dataset which looks more or less like this:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
Now, I want to create a barplot using pandas and seaborn showing how many students:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
I tried with:
n_subject = dataset['Subject'].str.count('\w+')
dataset['NumberSubjects']= n_subject
n_over = dataset[dataset.n_subject >= 3.0]
But it does not work and I'm stuck. I'm sure it is a very basic problem but I don't know what to do.
3 or more subjects:
df["Subject"].str.count("\w+") >= 3
Has one or more marks of 3:
df["Mark"].str.count("3") >= 1
Both:
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
Boolean representation:
Students Subject Mark one two three
0 1 M F 7 4 3 7 False True False
1 2 I 5 6 False False False
2 3 M F I S 2 3 0 True True True
3 4 M 2 2 False False False
4 5 F M I 5 1 True False False
5 6 I M F 6 2 3 True True True
6 7 I M 7 False False False
I am not really sure what should be the barplot representing (summary of Mark?) But here is what you need for filtering purposes. Also, string count counts empty spaces too, but there are multiple ways of handling this. I am just giving you an idea what / how to do it.
>>> m1 = df.Subject.apply(lambda x: len(x.split()) >= 3)
>>> m2 = df.Mark.str.contains('3')
>>> m3 = m1|m2
>>> df[m1]
Students Subject Mark
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3
>>> df[m2]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
5 6 I M F 6 2 3
>>> df[m3]
Students Subject Mark
0 1 M F 7 4 3 7
2 3 M F I S 2 3 0
4 5 F M I 5 1
5 6 I M F 6 2 3

Mask minimum values per group in a `pd.DataFrame`

Given a pd.DataFrame containing different time series in different groups, I want to create a mask over all rows that indicates per group at which timepoints the minima of value is reached in respect of type 0:
For example, given the pd.DataFrame:
>>> df
group type time value
0 A 0 0 4
1 A 0 1 5
2 A 1 0 6
3 A 1 1 7
4 B 0 0 11
5 B 0 1 10
6 B 1 0 9
7 B 1 1 8
In group A the minima for type 0 is reached at the timepoint 0. For group B the minima for type 0 is reached at the timepoint 1. Therefore, the resulting column should look like:
is_min
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
I have created a version that seems very cumbersome, first finding out the minima locations and then constructing the final column:
def get_minima(df):
type_mask = df.type == 0
min_value = df[type_mask].value.min()
value_mask = df.value == min_value
return df[type_mask & value_mask].time.max()
min_ts = df.groupby('group').apply(get_minima)
df['is_min'] = df.apply(lambda row: min_ts[row.group] == row.time, axis=1)
IIUC, you can try with groupby+apply and min
df['is_min']= df.groupby(['group','type'])['value']
.apply(lambda x: x==x.min())
Same as this with transform+min to get the minimal and eq to create the mask desired:
df['is_min']= df.groupby(['group','type'])['value']
.transform('min').eq(df['value'])
Output:
df
group type time value is_min
0 A 0 0 4 True
1 A 0 1 5 False
2 A 1 0 6 True
3 A 1 1 7 False
4 B 0 0 11 False
5 B 0 1 10 True
6 B 1 0 9 False
7 B 1 1 8 True
You can remove the rows with an excluding merge. sort the values, subset to only "type==0" and drop_duplicates to get the times per group you need to exclude. Then merge with an indicator to exclude.
m = (df.sort_values('value').query('type == 0').drop_duplicates('group')
.drop(columns=['type', 'value']))
# group time
#0 A 0
#5 B 1
df = (df.merge(m, how='outer', indicator=True).query('_merge == "left_only"')
.drop(columns='_merge'))
group type time value
2 A 0 1 5
3 A 1 1 7
4 B 0 0 11
5 B 1 0 9
If you separately need the mask and don't want to automatically query to subset the rows, map the indicator
df = df.merge(m, how='outer', indicator='is_min')
df['is_min'] = df['is_min'].map({'left_only': False, 'both': True})
group type time value is_min
0 A 0 0 4 True
1 A 1 0 6 True
2 A 0 1 5 False
3 A 1 1 7 False
4 B 0 0 11 False
5 B 1 0 9 False
6 B 0 1 10 True
7 B 1 1 8 True

Pandas: fetch rows that continuously have a similar value

I have a dataframe like so..
id time status
-- ---- ------
a 1 T
a 2 F
b 1 T
b 2 T
a 3 T
a 4 T
b 3 F
b 4 T
b 5 T
I would like to fetch the ids that continuously have the status 'T' for a certain threshold number of times (say 2 in this case).
Thus the fetched rows would be...
id time status
-- ---- ------
b 1 T
b 2 T
a 3 T
a 4 T
b 4 T
b 5 T
I can think of an iterative solution. What I am looking for is something more pandas/sql like. I think an order by id and then time followed by a group by first by id and then status should work, but I'd like to be sure.
Compare values by Series.eq for T and count consecutive values with Series.shift and Series.cumsum, count by Series.value_counts and Series.map to original - get counts per consecutive groups. Then compare by Series.ge and last filter by boolean indexing chain both mask by bitwise AND:
N = 2
m1 = df['status'].eq('T')
g = df['status'].ne(df['status'].shift()).cumsum()
m2 = g.map(g.value_counts()).ge(N)
df = df[m1 & m2]
print (df)
id time status
2 b 1 T
3 b 2 T
4 a 3 T
5 a 4 T
7 b 4 T
8 b 5 T
Details:
print (df.assign(m1=m1, g=g, counts=g.map(g.value_counts()), m2=m2))
id time status m1 g counts m2
0 a 1 T True 1 1 False
1 a 2 F False 2 1 False
2 b 1 T True 3 4 True
3 b 2 T True 3 4 True
4 a 3 T True 3 4 True
5 a 4 T True 3 4 True
6 b 3 F False 4 1 False
7 b 4 T True 5 2 True
8 b 5 T True 5 2 True