Why pandas fillna function turns non empty values to empty values? - pandas

I'm trying to fill empty values with the element with max count after grouping the dataframe. Here is my code.
def fill_with_maxcount(x):
try:
return x.value_counts().index.tolist()[0]
except Exception as e:
return np.NaN
df_all["Surname"] = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
If there is an error occurred in try catch, it would return np.NaN value. But in the function fill_with_maxcount I tried logging the error also. But there is no exception occurred during the try catch.
Before the execution of the code lines, there are 294 nan values. After the execution it has incresed to 857 nan values, which means it has turned non-empty values into nan values. I can't figure out why. I did some experiments using print statements. It returns a non-empty value (a string) as the result of the function. So the problem should be with the pandas dataframe's apply or fillna function. But I have used this same method in other places without any problem.
Can someone give me a suggestion. Thank you

Finally found it after some testings with code.
df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
The above part returns a series with filled values. But however in the rows where the fields used for grouping are empty, it doesn't consider it for applying the function. So those indexes will be returned as null. then that series is directly assigned into the Surname column. So those values become null too.
As the solution I changed the code as the following.
def fill_with_maxcount(x):
try:
return x.value_counts().index.tolist()[0]
except Exception as e:
return np.NaN
def replace_only_null(x,z):
for i in range(len(x)):
if x[i]==None or x[i]==np.NaN:
yield z[i]
else:
yield x[i]
result_1 = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
replaced = pd.Series(np.array(list(replace_only_null(df_all.Surname,result_1))))
df_all.Surname = replaced
The replace_only_null function will compare the result with current Surname columns and replace only null values with result retrieved by applying fill_with_maxcount function.

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

Aggregating DataFrame string columns expected to be the same

I am calling DataFrame.agg on a dataframe with various numeric and string columns. For string columns, I want the result of the aggregation to be (a) the value of an arbitrary row if every row has that same string value or (b) an error otherwise.
I could write a custom aggregation function to do this, but is there a canonical way to approach this?
You can test numbers and add some aggregate function like sum and then if same strings column get first else raise error:
df = pd.DataFrame({'a':['s','s3'], 'b':[5,6]})
def f(x):
if np.issubdtype(x.dtype, np.number):
return x.sum()
else:
if x.eq(x.iat[0]).all():
return x.iat[0]
else:
raise ValueError('not same strings values')
s = df.agg(f)

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Converting multiple types of values to integer values

I have a Pandas DF in which I want to convert column values to integer values. The information should be stored in meters but can be stored in kilometers as well, resulting in the following possible values:
23145 (correct)
23145.0 (.0 should be removed)
101.1 (should be multiplied *1000)
47,587 (should be multiplied *1000)
'No value known'
I tried different options with converting data types, but I always seem to break the existing integers and cannot check for them correctly because the type 'object' is the dtype. Sometimes faulty values or strings block conversion as well.
Any ideas how to check if the value currently is an integer and do nothing, remove .0 from applicable values and multiply where applicable.
I also have some other columns with integers (e.g. number 22321323) where randomly a .0 is assigned (e.g. number 22321323.0). How can I correctly convert these values to not include the .0?
If you use a .apply on the column, you should be able to very easily convert these values while casing on their type. For example:
import pandas as pd
def convert(x):
if isinstance(x, int):
return x
elif isinstance(x, float):
return int(x)
else:
# Defaults to 0 when not convertable
return 0
print(x)
df = pd.DataFrame({'col': [23145, 23145.0, 'No value known']})
df['col'] = df['col'].apply(convert)

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.