Aggregating DataFrame string columns expected to be the same - pandas

I am calling DataFrame.agg on a dataframe with various numeric and string columns. For string columns, I want the result of the aggregation to be (a) the value of an arbitrary row if every row has that same string value or (b) an error otherwise.
I could write a custom aggregation function to do this, but is there a canonical way to approach this?

You can test numbers and add some aggregate function like sum and then if same strings column get first else raise error:
df = pd.DataFrame({'a':['s','s3'], 'b':[5,6]})
def f(x):
if np.issubdtype(x.dtype, np.number):
return x.sum()
else:
if x.eq(x.iat[0]).all():
return x.iat[0]
else:
raise ValueError('not same strings values')
s = df.agg(f)

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Why pandas fillna function turns non empty values to empty values?

I'm trying to fill empty values with the element with max count after grouping the dataframe. Here is my code.
def fill_with_maxcount(x):
try:
return x.value_counts().index.tolist()[0]
except Exception as e:
return np.NaN
df_all["Surname"] = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
If there is an error occurred in try catch, it would return np.NaN value. But in the function fill_with_maxcount I tried logging the error also. But there is no exception occurred during the try catch.
Before the execution of the code lines, there are 294 nan values. After the execution it has incresed to 857 nan values, which means it has turned non-empty values into nan values. I can't figure out why. I did some experiments using print statements. It returns a non-empty value (a string) as the result of the function. So the problem should be with the pandas dataframe's apply or fillna function. But I have used this same method in other places without any problem.
Can someone give me a suggestion. Thank you
Finally found it after some testings with code.
df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
The above part returns a series with filled values. But however in the rows where the fields used for grouping are empty, it doesn't consider it for applying the function. So those indexes will be returned as null. then that series is directly assigned into the Surname column. So those values become null too.
As the solution I changed the code as the following.
def fill_with_maxcount(x):
try:
return x.value_counts().index.tolist()[0]
except Exception as e:
return np.NaN
def replace_only_null(x,z):
for i in range(len(x)):
if x[i]==None or x[i]==np.NaN:
yield z[i]
else:
yield x[i]
result_1 = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
replaced = pd.Series(np.array(list(replace_only_null(df_all.Surname,result_1))))
df_all.Surname = replaced
The replace_only_null function will compare the result with current Surname columns and replace only null values with result retrieved by applying fill_with_maxcount function.

Confusion about modifying column in dataframe with pandas

I'm working on a Bangaluru House Price Data csv from Kaggle. There is a column called 'total_sqft'. In this column, there are values that are a range of numbers (e.g.: 1000-1500), and I want to identify all those entries. I created this function to do so:
def is_float(x):
try:
float(x)
except:
return False
return True
I applied it to the column:
df3[~df3['total_sqft'].apply(is_float)]
This works, but I don't understand why this doesn't:
df3['total_sqft'] = ~df3['total_sqft'].apply(is_float)
This just returns 'False' for everything instead of the actual entries
Answer from comment:
In the first version you are selecting the rows that contain true values from the apply function. In the second you are setting the values to be the values of the apply function. Tilde means negation btw.

Remove a specific string value from the whole dataframe without specifying the column or row

I have a dataframe that has some cells with the value of "?". now this value causes an error ("Could not convert string to float: "?") whenever i try to use the multi information metric.
I already found a solution by simply using:
df.replace("?",0,inplace=True)
And it worked. BUT i'm wondering if i wanted to remove the whole row if one of its cells has the value of "?", how can i do that?
Notice that i don't have the column names that contains this value. it's spread in different column and that's why i can't use df.drop.
You can check for each cell if they are equal to "?" and then get a boolean series over rows that contain that character in any one of their cells. Then get the indices of rows that gave True and drop them:
has_ques_mark = df.eq("?").any(axis=1) # a boolean series
inds = has_ques_mark[has_ques_mark].index # row indices where above is True
new_df = df.drop(inds)
You can do it the following way:
df.drop(df.loc[df['column_name'] == "?"].index, inplace=True)
or in a slightly simpler syntax but maybe a bit less performant:
df = df.loc[df['column_name'] != "?"]

PySpark dataframe Pandas UDF returns empty dataframe

I'm trying to apply a pandas_udf to my PySpark dataframe for some filtering, following the groupby('Key').apply(UDF) method. To use the pandas_udf I defined an output schema and have a condition on the column Number. As an example, the simplified idea here is that I wish only to return the ID of the rows with odd Number.
This now brings up a problem that sometimes there is no odd Number in a group therefore the UDF just returns an empty dataframe, which is in conflict with the defined schema to return an int for Number.
Is there a way to solve this problem and only output and combine all the odd Number rows as a new dataframe?
schema = StructType([
StructField("Key", StringType()),
StructField("Number", IntegerType())
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def get_odd(df):
odd = df.loc[df['Number']%2 == 1]
return odd[['ID', 'Number']]
I come across this issue with null DataFrame in some groups. I solve this by checking for empty DataFrame and return a DataFrame with schema defined:
if df_out.empty:
# change the schema as needed
return pd.DataFrame({'fullVisitorId': pd.Series([], dtype='str'),
'time': pd.Series([], dtype='datetime64[ns]'),
'total_transactions': pd.Series([], dtype='int')})