I need to drop some rows from a pandas dataframe aa based on a query as follows:
aa.loc[(aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil')]
How do I drop this selection from the datafram aa?
You can do add '~'
out = aa.loc[~((aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil'))]
Or
idx = aa.index[(aa['_merge'] == 'right_only') & (aa['Context Interpretation'] == 'Topsoil')]
out = aa.drop(idx)
Related
df = df.assign[test = np.select[df.trs = 'iw' & df.rp == 'yu'],[1,0],'null']
I want if df.trs == iw' and df.rp == 'yu'` than new column should be created should be 0 else 1 only for condotion fullfilling row not every row
I tried no.slect and with condition array. But not getting desired output
You don't need numpy.select, a simple boolean operator is sufficient:
df['test'] = (df['trs'].eq('iw') & df['rp'].eq('yu')).astype(int)
If you really want to use numpy, this would require numpy.where:
df['test'] = np.where(df['trs'].eq('iw') & df['rp'].eq('yu'), 1, 0)
This is the dataframe image where in from Time column I am trying to make some group intervals and assigning some simple values on the Time_group column
This is the code I am trying:
for i in sales_df['Time'].str[:2]:
if (i == '10') & (i == '11') & (i == '12'):
sales_df['Time_group'] = 1
if (i == '13') & (i == '14') & (i == '15'):
sales_df['Time_group'] = 2
if (i == '16') & (i == '17') & (i == '18'):
sales_df['Time_group'] = 3
if (i == '19') & (i == '20') & (i == '21'):
sales_df['Time_group'] = 4
I am trying to match conditions so that if text is present in both columns A and B and a 0 is in column C, the code should return 'new' in column C (overwriting the 0). Example dataframe below:
import pandas as pd
df = pd.DataFrame({"A":['something',None,'filled',None], "B":['test','test','test',None], "C":['rt','0','0','0']})
I have tried the following, however it only seems to accept the last condition so that any '0' entries in column C become 'new' regardless of None in columns A or B. (in this example I only expect 'new' to appear on row 2.
import numpy as np
conditions = [(df['A'] is not None) & (df['B'] is not None) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
Appreciate any help!
You will need to use .isna() and filter where it is not nan/none (using ~) as below:
conditions = [~(df['A'].isna()) & ~(df['B'].isna()) & (df['C'] == '0')]
output:
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0
Use Series.notna for test None or NaNs:
conditions = [df['A'].notna() & df['B'].notna() & (df['C'] == '0')]
Or:
conditions = [df[['A','B']].notna().all(axis=1) & (df['C'] == '0')]
values = ['new']
df['C'] = np.select(conditions, values, default=df["C"])
print (df)
A B C
0 something test rt
1 None test 0
2 filled test new
3 None None 0
Use
mask = df[['A', 'B']].notna().all(1) & df['C'].eq('0')
df.loc[mask, 'C'] = 'new'
New to Python/Pandas. Trying to iterate through a dataframe and check for duplicates. If a duplicate is found, compare the duplicates 'BeginTime' at index to 'BeginTime' at index + 1. If true, assign a new time to a different dataframe. When I run the code the first duplicate should produce a new time of 'Grab & Go' but I think my comparison statement isn't right. I get '1130' as a new time for the first duplicate.
import pandas as import pd
df = pd.DataFrame({'ID': [97330, 97330, 95232, 95232, 95232],
'BeginTime': [1135, 1255, 1135, 1255, 1415]})
Expected Output:
ID NewTime
97330 Grab & Go
95232 Grab & Go
# iterate through df
for index, row in df.iterrows():
# check for duplicates in the ID field comparing index to index + 1
if df.loc[index, 'ID'] == df.shift(+1).loc[index, 'ID']
# if a duplciate, compare 'BeginTime' of index to 'BeginTime' of index + 1 of the duplicate,
# if true assign a new time to a different df
if df.loc[index, 'BeginTime'] == 1135 and df.shift(+1).loc[index, 'BeginTime'] == 1255:
print('Yes, a duplicate', dfnew['NewTime'] = 'Grab & Go')
elif df.loc[index, 'BeginTime'] == 1255:
print('Yes, a duplicate', dfnew['NewTime'] = '1130')
else:
print('No, not a duplicate')
I want to Implement beolw sql join condition using Pyspark dataframe.
select *
FROM tableA A, tablec C, tableB B
A.sno = C.sno AND A.sno = B.sno AND
A.sdt = C.sdt AND A.sdt = B.sdt
AND A.sid = C.sid AND A.sid = B.sid
I have tried below code(df_A0,df_C0,df_B0 are 3 different dataframes)
join_data = df_A0.join(df_C0, (df_A0.sno===df_C0.sno).join(df_B0, (df_A0.sno===df_B0.sno)) & \
(df_A0.sdt === df_C0.sdt) & (df_A0.sdt === df_B0.sdt) & (df_A0.sid === df_C0.sid) & \
df_A0.sid = df_B0.sid,how='inner'))
but when i executing this its showing invalid syntax error.
Can someone please guide me how should write it in pyspark dataframes.
You can try below like below,
join_data = df_A0.alias("df_A0").join(df_C0.alias("df_C0"), (df_A0.sno == df_C0.sno) & (df_A0.sdt == df_C0.sdt) & (df_A0.sid == df_C0.sid),"inner")\
.join(df_B0.alias("df_B0"), (df_A0.sno == df_B0.sno) & (df_A0.sdt == df_B0.sdt) & (df_A0.sid == df_B0.sid), "inner")
More on Spark-SQL click on this.