Matching conditions in columns - pandas

I am trying to match conditions so that if text is present in both columns A and B and a 0 is in column C, the code should return 'new' in column C (overwriting the 0). Example dataframe below:
import pandas as pd
df = pd.DataFrame({"A":['something',None,'filled',None], "B":['test','test','test',None], "C":['rt','0','0','0']})
I have tried the following, however it only seems to accept the last condition so that any '0' entries in column C become 'new' regardless of None in columns A or B. (in this example I only expect 'new' to appear on row 2.
import numpy as np
conditions = [(df['A'] is not None) & (df['B'] is not None) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
Appreciate any help!

You will need to use .isna() and filter where it is not nan/none (using ~) as below:
conditions = [~(df['A'].isna()) & ~(df['B'].isna()) & (df['C'] == '0')]
output:
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0

Use Series.notna for test None or NaNs:
conditions = [df['A'].notna() & df['B'].notna() & (df['C'] == '0')]
Or:
conditions = [df[['A','B']].notna().all(axis=1) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
print (df)
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0

Use
mask = df[['A', 'B']].notna().all(1) & df['C'].eq('0')
df.loc[mask, 'C'] = 'new'

Related

Numpy.selectand assign new colum in df with condition from values of two column

df = df.assign[test = np.select[df.trs = 'iw' & df.rp == 'yu'],[1,0],'null']
I want if df.trs == iw' and df.rp == 'yu'` than new column should be created should be 0 else 1 only for condotion fullfilling row not every row
I tried no.slect and with condition array. But not getting desired output
You don't need numpy.select, a simple boolean operator is sufficient:
df['test'] = (df['trs'].eq('iw') & df['rp'].eq('yu')).astype(int)
If you really want to use numpy, this would require numpy.where:
df['test'] = np.where(df['trs'].eq('iw') & df['rp'].eq('yu'), 1, 0)

to_string(index = False) results in non empty string even when dataframe is empty

I am doing the following in my python script and I want to hide the index column when I print the dataframe. So I used .to_string(index = False) and then use len() to see if its zero or not. However, when i do to_string(), if the dataframe is empty the len() doesn't return zero. If i print the procinject1 it says "Empty DataFrame". Any help to fix this would be greatly appreciated.
procinject1=dfmalfind[dfmalfind["Hexdump"].str.contains("MZ") == True].to_string(index = False)
if len(procinject1) == 0:
print(Fore.GREEN + "[✓]No MZ header detected in malfind preview output")
else:
print(Fore.RED + "[!]MZ header detected within malfind preview (Process Injection indicator)")
print(procinject1)
That's the expected behaviour in Pandas DataFrame.
In your case, procinject1 stores the string representation of the dataframe, which is non-empty even if the corresponding dataframe is empty.
For example, check the below code snippet, where I create an empty dataframe df and check it's string representation:
df = pd.DataFrame()
print(df.to_string(index = False))
print(df.to_string(index = True))
For both index = False and index = True cases, the output will be the same, which is given below (and that is the expected behaviour). So your corresponding len() will always return non-zero.
Empty DataFrame
Columns: []
Index: []
But if you use a non-empty dataframe, then the outputs for index = False and index = True cases will be different as given below:
data = [{'A': 10, 'B': 20, 'C':30}, {'A':5, 'B': 10, 'C': 15}]
df = pd.DataFrame(data)
print(df.to_string(index = False))
print(df.to_string(index = True))
Then the outputs for index = False and index = True cases respectively will be -
A B C
10 20 30
5 10 15
A B C
0 10 20 30
1 5 10 15
Since pandas handles empty dataframes differently, to solve your problem, you should first check whether your dataframe is empty or not, using pandas.DataFrame.empty.
Then if the dataframe is actually non-empty, you could print the string representation of that dataframe, while keeping index = False to hide the index column.

Python3 to speed up the computing of dataframe

I have a dataframe (df) as following
id date t_slot dayofweek label
1 2021-01-01 2 0 1
1 2021-01-02 3 1 0
2 2021-01-01 4 6 1
.......
The data frame is very large(6 million rows). the t_slot is from 1 to 6 value. dayofweek is from 0-6.
I want to get the rate:
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
I have used loop to compute the rate, but it is very slow, do you have fast way to compute it. My code is copied as following:
def get_time_slot_rate(df):
import numpy as np
if len(df)==0:
return np.nan, np.nan, np.nan, np.nan
else:
work = df.loc[df['dayofweek']<5]
weekend = df.loc[df['dayofweek']>=5]
if len(work)==0:
work_14, work_56 = np.nan, np.nan
else:
work_14 = len(work.loc[(work['time_slot']<5)*(work['label']==1)])/len(work)
work_56 = len(work.loc[(work['time_slot']>5)*(work['label']==1)])/len(work)
if len(weekend)==0:
weekend_14, weekend_56 = np.nan, np.nan
else:
weekend_14 = len(weekend.loc[(weekend['time_slot']<5)*(weekend['label']==1)])/len(weekend)
weekend_56 = len(weekend.loc[(weekend['time_slot']>5)*(weekend['label']==1)])/len(weekend)
return work_14, work_56, weekend_14, weekend_56
import datetime as d_t
lst_id = list(df['id'])
lst_date = list(df['date'])
lst_t14_work = []
lst_t56_work = []
lst_t14_weekend = []
lst_t56_weekend = []
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = lst_date[i]
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& ((df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date))].reset_index(drop=True)
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df_s)
lst_t14_work.append(work_14_rate)
lst_t56_work.append(work_56_rate)
lst_t14_weekend.append(weekend_14_rate)
lst_t56_weekend.append(weekend_56_rate)
I could only fix your function and it's completely untested, but here we go:
Import only once by putting the imports at the top of your .py.
try/except blocks are more efficient than if/else statements.
True and False equals to 1 and 0 respectively in Python.
Don't multiply boolean selectors and use the reverse operator ~
Create the least amount of copies.
import numpy as np
def get_time_slot_rate(df):
# much faster than counting
if df.empty:
return np.nan, np.nan, np.nan, np.nan
# assuming df['label'] is either 0 or 1
df = df.loc[df['label']]
# create boolean selectors to be inverted with '~'
weekdays = df['dayofweek']<=5
slot_selector = df['time_slot']<=5
weekday_count = np.sum(weekdays)
try:
work_14 = len(df.loc[weekdays & slot_selector])/weekday_count
work_56 = len(df.loc[weekdays & ~slot_selector])/weekday_count
except ZeroDivisionError:
work_14 = work_56 = np.nan
weekend_count = np.sum(~weekdays)
try:
weekend_14 = len(df.loc[~weekdays & slot_selector])/weekend_count
weekend_56 = len(df.loc[~weekdays & ~slot_selector])/weekend_count
except ZeroDivisionError:
weekend_14 = weekend_56 = np.nan
return work_14, work_56, weekend_14, weekend_56
The rest of your script doesn't really make sense, see my comments:
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = date[i]
# what is d_t ?
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& (df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date)].reset_index(drop=True)
# is it df or df_s ?
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df)
If your date column is a datetime object than you can compare dates directly (no need for strings).

Subset two consecutive event occurrence in pandas

I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341