Empty cells when using an apply function - pandas

So I am trying to calculate a value from one column or another based based on which one has data available into a new column. This is the code I have right now. It doesn't seem to notice when there is no data present and always goes to the "else" statement. My dataframe is an imported excel file. Thanks for any advice!
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')

This is can be done by using numpy.where
Import numpy as np
df['newcol'] = np.where(df["Sulphate-S(HCL Leachable)_%S"].isna(),df["Total-S_%S"]- df["Sulphate-S(HCL Leachable)_%S"],df["Total-S_%S"]- df["Sulphate-S_%S"])

Related

Using pandas udf to return a full column containing average

This is very weird I tried using pandas udf on a spark df and it works only if i do select and return one value which is the average of the column
but if i try to fill the whole column with this value then it doesnt work
the following works:
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return cl
df.select(avg('col' ))
this works and returns a df of one row containing the value average of column.
but the following doesnt work
df.withColumn('avg', F.lit( avg(col))
why? if avg(col) is a value then why cant i use that to fill the column with a lit()?
like the following example which does work. This does work when i return a constant number
#pandas_udf(DoubleType())
def avg(col ) :
return 5
df.withColumn('avg', avg(col)
I also tried returning a series and didnt work either
#pandas_udf(DoubleType())
def avg(col ) :
cl = np.average(col)
return pd.Series([cl]* col.size())
df.withColumn('avg', avg(col))
doesnt work. But does work if i use a constant instead of cl
So basically how could i return a full column containing the same value of the average to fill up the whole column with that value?
lit is evaluated on driver and not executed on the data on the executor.
The best way to achieve this would be to simply define a window spec for the entire dataset and call the aggregate function over the window. This would eliminate the need for an extra UDF.
windowSpec = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('avg', avg(col).over(windowSpec))
Type cast it to float().
I am not sure what you are trying to achieve here. The UDF is called for each row. So, inside the UDF, the "col" represents each individual cell value - it does not represent the entire column.
If your column is of type array/list:
df = spark.createDataFrame(
[
[[1.0, 2.0, 3.0, 4.0]],
[[5.0, 6.0, 7.0, 8.0]],
],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+--------------------+---+
| num|avg|
+--------------------+---+
|[1.0, 2.0, 3.0, 4.0]|2.5|
|[5.0, 6.0, 7.0, 8.0]|6.5|
+--------------------+---+
But if your column is a scalar type like double/float, then the average of it via UDF will always return the same column value:
df = spark.createDataFrame(
[[1.0],[2.0],[3.0],[4.0]],
["num"]
)
#F.udf(returnType=DoubleType())
def avg(col):
import numpy as np
return float(np.average(col))
#
df = df.withColumn("avg", avg("num"))
+---+---+
|num|avg|
+---+---+
|1.0|1.0|
|2.0|2.0|
|3.0|3.0|
|4.0|4.0|
+---+---+

How to deal with nested data in pandas dataframe via "for loop"?

I have got a nested data in pandas dataframe and I want to flatten the column, "names" by using "pd.Dataframe ()" function. When I attempt to flatten via "for loop" it produces 5 different dataframe list, which I do not expect to have and rather only one dataframe list with all values listed. I have already tried "concat" or "append" methods but it did not give any clue to move forward. Any help/comment is welcome, thanks so much. Here is my "for loop":
x=df['names'].iloc[0:4]
name_data = pd.DataFrame(x)
data_row=[]
for data in x:
data_row =pd.DataFrame(data)
st.write(data_row)
If I understand correctly, you want to concat the 5 tables in the example images above to only 1 table and show the result table on streamlit.
All you have to do are
change from data_row =pd.DataFrame(data) to data_row += [pd.DataFrame(data)]
After loop for loop finished
you can concat all dataframes in data_row to one dataframe by using data_row = pd.concat(data_row)
and then, show the result table with streamlit by using st.write(data_row)
Here is example for tackling your problem.
df = pd.DataFrame({
'names': [[{'name':'a'},{'name':'b'}], [{'name':'c'}]]
})
x=df['names'].iloc[0:2]
data_row = []
for data in x:
data_row += [pd.DataFrame(data)]
data_row = pd.concat(data_row)
st.write(data_row)
or you can create the list of dictionary and create dataframe by using the example below
data_row = []
for data in x:
data_row += data
data_row = pd.DataFrame(data_row)

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

How to fill a pandas dataframe in a list comprehension?

I need to fill a pandas dataframe in a list comprehension.
Although rows satisfying the criterias are appended to the dataframe.
However, at the end, dataframe is empty.
Is there a way to resolve this?
In real code, I'm doing many other calculations. This is a simplified code to regenerate it.
import pandas as pd
main_df = pd.DataFrame(columns=['a','b','c','d'])
main_df=main_df.append({'a':'a1', 'b':'b1','c':'c1', 'd':'d1'},ignore_index=True)
main_df=main_df.append({'a':'a2', 'b':'b2','c':'c2', 'd':'d2'},ignore_index=True)
main_df=main_df.append({'a':'a3', 'b':'b3','c':'c3', 'd':'d3'},ignore_index=True)
main_df=main_df.append({'a':'a4', 'b':'b4','c':'c4', 'd':'d4'},ignore_index=True)
print(main_df)
sub_df = pd.DataFrame()
df_columns = main_df.columns.values
def search_using_list_comprehension(row,sub_df,df_columns):
if row[0]=='a1' or row[0]=='a2':
dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', dict)
sub_df=sub_df.append(dict, ignore_index=True)
print('sub_df.shape: ', sub_df.shape)
[search_using_list_comprehension(row,sub_df,df_columns) for row in main_df.values]
print(sub_df)
print(sub_df.shape)
The problem is that you define an empty frame with sub_df = dp.DataFrame() then you assign the same variable within the function parameters and within the list comprehension you provide always the same, empty sub_df as parameter (which is always empty). The one you append to within the function is local to the function only. Another “issue” is using python’s dict variable as user defined. Don’t do this.
Here is what can be changed in your code in order to work, but I would strongly advice against it
import pandas as pd
df_columns = main_df.columns.values
sub_df = pd.DataFrame(columns=df_columns)
def search_using_list_comprehension(row):
global sub_df
if row[0]=='a1' or row[0]=='a2':
my_dict= {a:b for a,b in zip(df_columns,row)}
print('dict: ', my_dict)
sub_df = sub_df.append(my_dict, ignore_index=True)
print('sub_df.shape: ', sub_df)
[search_using_list_comprehension(row) for row in main_df.values]
print(sub_df)
print(sub_df.shape)

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)