Drop row based on condition - pandas

I have this DataFrame:
pd.DataFrame(
{'name': ['Adam', 'Adam', 'Adam', 'Bill', 'Bill', 'Charlie', 'Charlie', 'Charlie', 'Charlie'],
'message': ['start', 'stuck', 'finish', 'start', 'stuck', 'start', 'stuck', 'finish', 'finish']}
)
and I want to drop all rows with message "stuck" from all rows that don't have a message "finish":
pd.DataFrame(
{'name': ['Adam', 'Adam', 'Bill', 'Bill', 'Charlie', 'Charlie', 'Charlie'],
'message': ['start', 'finish', 'start', 'stuck', 'start', 'finish', 'finish']}
)
So Bill never "finished", so his message will remain "stuck".

To get if any student has finished, group by student and use any, here we want it back in the original shape of the dataframe so we use groupby.transform:
>>> sf = df['message'].eq('finish').groupby(df['name']).transform('any')
>>> sf
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
8 True
Name: message, dtype: bool
From there it’s easy to remove messages that are stuck from students that have not finished yet:
>>> df[~sf | df['message'].ne('stuck'))]
name message
0 Adam start
2 Adam finish
3 Bill start
4 Bill stuck
5 Charlie start
7 Charlie finish
8 Charlie finish

This will work:
df[~((df.name.isin(df[df.message=="finish"]['name'])) & (df.message=='stuck'))]
Output:
name
message
Adam
start
Adam
finish
Bill
start
Bill
stuck
Charlie
start
Charlie
finish
Charlie
finish

Related

Inputting first and last name to output a value in Pandas Dataframe

I am trying to create an input function that returns a value for the corresponding first and last name.
For this example i'd like to be able to enter "Emily" and "Bell" and return "attempts: 3"
Heres my code so far:
import pandas as pd
import numpy as np
data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',
'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'lastname': ['Thompson','Wu', 'Downs','Hunter','Bell','Cisneros', 'Becker', 'Sims', 'Gallegos', 'Horne'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no',
'yes', 'yes', 'no', 'no', 'yes']
}
data
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
fname = input()
lname = input()
print(f"{fname} {lname}'s number of attempts: {???}")
I thought there would be specific documentation for this but I cant find any on the pandas dataframe documentation. I am assuming its pretty simple but can't find it.
fname = input()
lname = input()
# use loc to filter the row and then capture the value from attempts columns
print(f"{fname} {lname}'s number of attempts:{df.loc[df['name'].eq(fname) & df['lastname'].eq(lname)]['attempts'].squeeze()}")
Emily
Bell
Emily Bell's number of attempts:2
alternately, to avoid mismatch due to case
fname = input().lower()
lname = input().lower()
print(f"{fname} {lname}'s number of attempts:{df.loc[(df['name'].str.lower() == fname) & (df['lastname'].str.lower() == lname)]['attempts'].squeeze()}")
emily
BELL
emily bell's number of attempts:2
Try this:
df[(df['name'] == fname) & (df['lastname'] == lname)]['attempts'].squeeze()

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

Pandas: How to filter rows in dataframe which is not equal to the combination of columns in other dataframe?

Below are the two dataframes. I was trying to filter rows in df_2 which are not equal to the combination of df_count rows. How can I achieve this objective?
import pandas as pd
df_1 = pd.DataFrame({'Name_1':['tom', 'jack', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack', 'tom', 'jack'],
'Name_2':['sam', 'sam', 'ruby', 'sam','sam', 'jack', 'ruby', 'sam','ruby', 'sam']})
df_count = df_1.groupby(['Name_1','Name_2']).size().reset_index().rename(columns={0:'count'}).sort_values(['count'], ascending = False)
df_count = df_count.head(2)
df_count = df_count[['Name_1','Name_2']]
df_2 = pd.DataFrame({'Name_1':['tom', 'nick', 'tom', 'jack', 'tom', 'nick', 'tom', 'jack'],
'Name_2':['sam', 'mike', 'ruby', 'sam', 'sam', 'jack', 'ruby', 'sam'],
'Salary':[200, 500, 1000, 7000, 100, 300, 1200, 900],
'Currency':['AUD', 'CAD', 'JPY', 'USD', 'GBP', 'CAD', 'INR', 'USD']})
pd.merge(df_2,df_count, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)
Output:
Name_1 Name_2 Salary Currency
0 tom sam 200 AUD
1 tom sam 100 GBP
2 nick mike 500 CAD
7 nick jack 300 CAD
Answer taken from here.

Numbering rows in pandas dataframe

i have a dataframe looks like:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]})
And i am working on solution to numbering 0 values in Number column.
My code, with isnot working right
for i, row in df.iterrows():
df.loc[df['number'] == 0, 'number'] = i+1
df
This code replaces 0 with 1, but must replace first 0 with 1..second 0 with 2 etc...
i would like to have solution based on iteration method(.
Note: numbers "29", "52" etc, must not be changed
Try np.where on a Boolean index based on 0 values in df then replace with cumsum of the index to enumerate:
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
Or use Series.mask
m = df['number'].eq(0)
df['number'] = df['number'].mask(m, m.cumsum())
df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
m.cumsum():
0 1
1 1
2 2
3 2
4 3
5 4
Name: number, dtype: int32
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]
})
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
print(df)
Try via boolean masking and loc accessor:
mask=df['number']==0 #created boolean mask
df.loc[mask,'number']=mask.cumsum()
OR
via where() method:
df['number']=df.where(~mask,mask.cumsum(),axis=0)['number']
OR
via boolean masking and assign() method
df[mask]=df[mask].assign(number=mask.cumsum())
Output of df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
Alternative via replace and fillna:
df.number = df.number.replace(0,np.NAN).fillna(df.number.eq(0).cumsum())

Fill column in df.A based on comparison values in df.A and df.B

So I have this code:
import pandas as pd
import numpy as np
frame1 = {'Season': ['S19', 'S20', 'S21',
'S19', 'S20', 'S21',
'S19', 'S20', 'S21'],
'DateFrom': ['2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01',
'2019-01-01', '2020-01-01', '2021-01-01'],
'DateTo': ['2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30',
'2019-12-30', '2020-12-30', '2021-12-30'],
'Currency': ['EUR', 'EUR', 'EUR',
'USD', 'USD', 'USD',
'MAD', 'MAD', 'MAD'],
'Rate': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
df1 = pd.DataFrame(data=frame1)
frame2 = {'Room': ['Double', 'Single', 'SeaView'],
'Season': ['S20', 'S20', 'S19'],
'DateFrom': ['2020-05-01', '2020-07-05', '2019-03-25'],
'Currency': ['EUR', 'MAD', 'USD'],
'Rate': [0, 0, 0]
}
df2 = pd.DataFrame(data=frame2)
df1[['DateFrom', 'DateTo']] = df1[['DateFrom', 'DateTo']].apply(pd.to_datetime)
df2[['DateFrom']] = df2[['DateFrom']].apply(pd.to_datetime)
print(df1.dtypes)
print(df2.dtypes)
df2['Rate'] = np.where((
df2['Season'] == df1['Season'] &
df2['Currency'] == df1['Currency'] &
(df2['DateFrom'] > df1['DateFrom'] & df2['DateFrom'] < df1['DateTo'])
), df1['Rates'], 'MissingData')
print(df2)
What I am trying to achieve is to fill Rate values in df2 with Rate values from df1 based on conditions where:
df2.Season == df1.Season &
df2.Currency == df1.Currency &
df2.DateFrom must be between df1.DateFrom and df1.DateTo
So my result in 'Rates' should be 2,8,4
I was hoping that code above will work but its not, i am getting error:
"TypeError: unsupported operand type(s) for &: 'str' and 'str'"
Any help how to make it work will be appreciated.
You can first merge then compare:
out = df1.merge(df2[['Season','Currency','DateFrom']],on=['Season','Currency'],
suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df1.columns).copy())
print(out)
Season DateFrom DateTo Currency Rate
0 S20 2020-01-01 2020-12-30 EUR 2
1 S19 2019-01-01 2019-12-30 USD 4
2 S20 2020-01-01 2020-12-30 MAD 8
EDIT per comments:
out = df1.merge(df2,on=['Season','Currency'],suffixes=('','_y'))
out = (out[out['DateFrom_y'].between(out['DateFrom'],out['DateTo'])]
.reindex(columns=df2.columns).copy())
Room Season DateFrom Currency Rate
0 Double S20 2020-01-01 EUR 2
1 SeaView S19 2019-01-01 USD 4
2 Single S20 2020-01-01 MAD 8