Drop rows from pandas dataframe - pandas

I need to drop some rows from a pandas dataframe aa based on a query as follows:
aa.loc[(aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil')]
How do I drop this selection from the datafram aa?

You can do add '~'
out = aa.loc[~((aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil'))]
Or
idx = aa.index[(aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil')]
out = aa.drop(idx)

Related

Numpy.selectand assign new colum in df with condition from values of two column

df = df.assign[test = np.select[df.trs = 'iw' & df.rp == 'yu'],[1,0],'null']
I want if df.trs == iw' and df.rp == 'yu'` than new column should be created should be 0 else 1 only for condotion fullfilling row not every row
I tried no.slect and with condition array. But not getting desired output
You don't need numpy.select, a simple boolean operator is sufficient:
df['test'] = (df['trs'].eq('iw') & df['rp'].eq('yu')).astype(int)
If you really want to use numpy, this would require numpy.where:
df['test'] = np.where(df['trs'].eq('iw') & df['rp'].eq('yu'), 1, 0)

How to group this Time column based on some interval and create new column based on it and assign new simpler values to it

This is the dataframe image where in from Time column I am trying to make some group intervals and assigning some simple values on the Time_group column
This is the code I am trying:
for i in sales_df['Time'].str[:2]:
if (i == '10') & (i == '11') & (i == '12'):
sales_df['Time_group'] = 1
if (i == '13') & (i == '14') & (i == '15'):
sales_df['Time_group'] = 2
if (i == '16') & (i == '17') & (i == '18'):
sales_df['Time_group'] = 3
if (i == '19') & (i == '20') & (i == '21'):
sales_df['Time_group'] = 4

Matching conditions in columns

I am trying to match conditions so that if text is present in both columns A and B and a 0 is in column C, the code should return 'new' in column C (overwriting the 0). Example dataframe below:
import pandas as pd
df = pd.DataFrame({"A":['something',None,'filled',None], "B":['test','test','test',None], "C":['rt','0','0','0']})
I have tried the following, however it only seems to accept the last condition so that any '0' entries in column C become 'new' regardless of None in columns A or B. (in this example I only expect 'new' to appear on row 2.
import numpy as np
conditions = [(df['A'] is not None) & (df['B'] is not None) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
Appreciate any help!
You will need to use .isna() and filter where it is not nan/none (using ~) as below:
conditions = [~(df['A'].isna()) & ~(df['B'].isna()) & (df['C'] == '0')]
output:
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0
Use Series.notna for test None or NaNs:
conditions = [df['A'].notna() & df['B'].notna() & (df['C'] == '0')]
Or:
conditions = [df[['A','B']].notna().all(axis=1) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
print (df)
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0
Use
mask = df[['A', 'B']].notna().all(1) & df['C'].eq('0')
df.loc[mask, 'C'] = 'new'

Multiple Comparison of Different Indexes Pandas Dataframe

New to Python/Pandas. Trying to iterate through a dataframe and check for duplicates. If a duplicate is found, compare the duplicates 'BeginTime' at index to 'BeginTime' at index + 1. If true, assign a new time to a different dataframe. When I run the code the first duplicate should produce a new time of 'Grab & Go' but I think my comparison statement isn't right. I get '1130' as a new time for the first duplicate.
import pandas as import pd
df = pd.DataFrame({'ID': [97330, 97330, 95232, 95232, 95232],
'BeginTime': [1135, 1255, 1135, 1255, 1415]})
Expected Output:
ID NewTime
97330 Grab & Go
95232 Grab & Go
# iterate through df
for index, row in df.iterrows():
# check for duplicates in the ID field comparing index to index + 1
if df.loc[index, 'ID'] == df.shift(+1).loc[index, 'ID']
# if a duplciate, compare 'BeginTime' of index to 'BeginTime' of index + 1 of the duplicate,
# if true assign a new time to a different df
if df.loc[index, 'BeginTime'] == 1135 and df.shift(+1).loc[index, 'BeginTime'] == 1255:
print('Yes, a duplicate', dfnew['NewTime'] = 'Grab & Go')
elif df.loc[index, 'BeginTime'] == 1255:
print('Yes, a duplicate', dfnew['NewTime'] = '1130')
else:
print('No, not a duplicate')

Joining Multiple Dataframes with and conditions in Pyspark

I want to Implement beolw sql join condition using Pyspark dataframe.
select *
FROM tableA A, tablec C, tableB B
A.sno = C.sno AND A.sno = B.sno AND
A.sdt = C.sdt AND A.sdt = B.sdt
AND A.sid = C.sid AND A.sid = B.sid
I have tried below code(df_A0,df_C0,df_B0 are 3 different dataframes)
join_data = df_A0.join(df_C0, (df_A0.sno===df_C0.sno).join(df_B0, (df_A0.sno===df_B0.sno)) & \
(df_A0.sdt === df_C0.sdt) & (df_A0.sdt === df_B0.sdt) & (df_A0.sid === df_C0.sid) & \
df_A0.sid = df_B0.sid,how='inner'))
but when i executing this its showing invalid syntax error.
Can someone please guide me how should write it in pyspark dataframes.
You can try below like below,
join_data = df_A0.alias("df_A0").join(df_C0.alias("df_C0"), (df_A0.sno == df_C0.sno) & (df_A0.sdt == df_C0.sdt) & (df_A0.sid == df_C0.sid),"inner")\
.join(df_B0.alias("df_B0"), (df_A0.sno == df_B0.sno) & (df_A0.sdt == df_B0.sdt) & (df_A0.sid == df_B0.sid), "inner")
More on Spark-SQL click on this.