Pandas DataFrame: sort_values by an index with empty strings - pandas

I have a pandas DataFrame with multi level index. I want to sort by one of the index levels. It has float values, but occasionally few empty strings too which I want to be considered as nan.
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df.sort_values('i')
TypeError: '<' not supported between instances of 'str' and 'int'
One way to solve the problem is to replace the empty strings with nan, do the sort, and then replace nan with empty strings again.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
Why there are empty strings in the first place?
In my application, actually the data read has missing values which is read as np.nan. But, np.nan values cause problem with groupby. So, they are replace to empty strings. I wish we had a constant like nan which is treated like empty string by groupby and like nan for numeric operations.

I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
In pandas missing values are not empty values, only if save DataFrame with missing values then are replaced by empty strings.
Btw, main problem is mixed values - numeric with strings (empty values), best is convert all strings to numeric for avoid it.
You can replace empty values by missing values by rename:
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df = df.rename({'':np.nan})
df = df.sort_values('i')
print (df)
x
i
1.0 1
2.0 2
3.0 3
NaN 4
Possible solution if cannot be changed original data is get positions of sorted values by Index.argsort and change order by DataFrame.iloc:
df = df.iloc[df.rename({'':np.nan}).index.argsort()]
print (df)
x
i
1 1
2 2
3 3
4

Related

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Parse dictionary inside dataframe

One column of my df has either 1.a nested dictionary or 2. NAN as value
The dicts has 2 key-value pairs like this one
{'value': '1', 'info': {....}}
I wish to only get the value of “value”, the value of “info” is not useful, we can leave “NAN” if it is NAN value
What is the easiest way to achieve this?
BTW I tried df_september_p1['that_column_name']==np.nan
and df_september_p1['that columnname']==’nan’,
which yield the same Boolean values. The weird thing is I see the 2nd row has NAN as value but the yield result is False for 2nd row… don’t get why
You can use Series.str.get working well with dictioanries or with missing values NaNs:
df_september_p1['val'] = df_september_p1['that_column_name'].str.get('value')

pandas valid null values

I am looking for the list of valid null values that pandas fillna() method will replace, e.g. 'NaN', 'NA', 'NULL', 'NaT'. I could not find it in the documentation.
fillna method will only replace actual missing values represented as NaN or NaT or None but not as strings ('NaN' or anyother string).
Before using fillna you can check what will be replaced in a column COL of your dataframe df using isnull():
df.loc[df['COL'].isnull()]
will show you the subset of your dataframe for which the column 'COL' is NaN/NaT/None.
You can replace strings to NaN using replace. Say you have strings like "NAN":
from numpy import nan
df = df.replace('NAN', nan)
refer to this post

Convert floats to ints in pandas dataframe

I have a pandas dataframe with a column ‘distance’ and it is of datatype ‘float64’.
Distance
14.827379
0.754254
0.2284546
1.833768
I want to convert these numbers to whole numbers (14,0,0,1). I tried with this but I get the error “ValueError: Cannot convert NA to integer”.
df['distance(kmint)'] = result['Distance'].astype('int')
Any help would be appreciated!!
I filtered out the NaN's from the dataframe using this:
result = result[np.isfinite(result['distance(km)'])]
Then, I was able to convert from float to int.
An alternative approach would be to convert the NaN values as part of your data import and cleaning processes. The more generalized solution could involve specifying the values that are NaN in the read_table command by setting the na_values flag. What you want to make sure of is that there isn't some malfored data like 1.5km in one of your fields that getting picked up as a NaN value.
pandas.read_table(..., na_values=None, keep_default_na=True, na_filter=True, ....)
Subsequently, once the dataframe is populated and the NaN values are identified properly, you can use the fillna method to substitute in zeros or the values that you identified as your distances.
Finally, it would be best to probably use notnull versus isfinite to convert the over to integers.