PySpark dataframe Pandas UDF returns empty dataframe - pandas

I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.
This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.
Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID', 'Number']]

I come across this issue with null DataFrame in some groups. I solve this by checking for empty DataFrame and return a DataFrame with schema defined:
if df_out.empty:
# change the schema as needed
return pd.DataFrame({'fullVisitorId': pd.Series([], dtype='str'),
'time': pd.Series([], dtype='datetime64[ns]'),
'total_transactions': pd.Series([], dtype='int')})

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.

Split and merge nested DataFrame in Python

I have a dataframe, which has two columns. One of the columns is also another dataframe. It looks like below:
I want to have a dataframe with 3 columns, containing "Date_Region", "transformed_weight" and "Barcode", which would replicate each "Date_Region" row times the length of its "Weight-Barcode" dataframe. The final dataframe should looks like below:
This will do:
pd.concat(
iter(final_df.apply(
lambda row: row['Weights-Barcode'].assign(
Date_Region=row['Date_Region'],
),
axis=1,
)),
ignore_index=True,
)[['Date_Region', 'transformed_weight', 'Barcode']]
From the inside out:
final_df.apply(..., axis=1) will call the lambda function on each row.
The lambda function uses assign() to return the nested DataFrame from that row with an addition of the Date_Region column with the value from the outside.
Calling iter(...) on the resulting series results in an iterable of the DataFrames already including the added column.
Finally, using pd.concat(...) on that iterable to concatenate them all together. I'm using ignore_index=True here to just reindex everything again (it doesn't seem to me your index is meaninful, and not ignoring them means you'd end up with duplicates.)
Finally, I'm reordering the columns, so the added Date_Region column becomes the leftmost one.

pandas DatatFrame.corr returns a one by one DF

I can't get what is possibly wrong in the way I use df.corr() function.
For a DF with 2 columns it returns only 1*1 resulting DF.
In:
merged_df[['Citable per Capita','Citations']].corr()
Out:
one by one resulting DF
What can be the problem here? I expected to see as many rows and columns as many columns were there in the original DF
I found the problem - it was the wrong dtype of the first column values.
To change type of all the columns, use:
df=df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Note that apply creates a copy of df. That is why reassignment is necessary here