How to retain NaN values using pandas factorize()? - pandas

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I convert (or, encode) these entries to numerical values using factorize() as follows:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize(na_sentinel=None)[0]
The columns have several NaN entries, so I let na_sentinel=None to retain the NaN entries. However, the NaN values are not retained (they get converted to numerical entries), which is not what I desire. My Pandas version is 1.3.5. Is there something I am missing?

Factorize converts NaN values by default to -1. The NaN values are retained in this way since the NaN values can be identified by the -1. You would probably want to keep the default which is:
na_sentinel =-1
see
https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

Related

Trying to assign string values to column. But all I get is Nan. What to do?

I'm trying to update a Pandas df column with a column from another df, which changes daily.
What I mean to do is to transplant what's in this:
Daily schedule of workers for all year
To this:
Schedule of workers for today, column in red
I'd like to do it every day until July. It usually worked quite well. In 2023, the calendar was made with slight changes in the format, and I can't make Pandas read the data as I'd like.
I cannot actually assign a column from one database to a columns from the other. The code is accepted by Python, but all I get is NaN, not the strings I hoped for. All the values from the other column are strings. What am I doing wrong?
Thanks!
Here's my code:
today = datetime.today().strftime("%d/%m/%Y")
df["status"] = df_diario[today].astype('str')
df["status"]
nome
Ad NaN
Al NaN
An NaN
Ca NaN
Cl NaN
Da NaN
El NaN
Ga NaN
Hu NaN
Jo NaN
Jo NaN
Jo NaN
Jo NaN
Le NaN
Lu NaN
Lu NaN
Lui NaN
Ma NaN
Mar NaN
Mau NaN
Om NaN
Pa NaN
Pau NaN
Pe NaN
Ro inativo
Ro NaN
Ro NaN
Ron NaN
Vi NaN
Name: status, dtype: object
what is the variable today that you used as the DataFrame accessor? Also, please format your answer so that others can read clearly and help you better.
However, if you do check your answer, there's one line that is not NaN, it is inactivo. It could be that both DataFrames are incompatible and having different indices. If you want to do an reassignment this way, you need to have identical index in both DataFrames.
Found the answer! It actually was in MS Excel. If you type some text that looks like a date, it will automatically define it in a date format. For this reason, I could not transplant my column as I'd like.
Pandas "inherits" the date format from the Excel spreadsheet, so to speak. It would import some of my dates as datetime objects, unrecognizable by the code I had written. It didn't import all of them as such because I had done the 2022 table with Python itself, from Jan 24th on. Because I had manually typed 10/01/2023, in this year's table, Pandas interpreted it as datetime and thus my code didn't work. To prevent the mess, I had to type an apostrophe before the date in an Excel cell.

Nan columns not dropping

i have the data-set that contains some NAN values. i tried this to drop it but it is still showing
df['string_tweet'].dropna(inplace=True)
df['string_tweet']
this is the output
113 apc started let ’ finish started
235 upon vote katsina , apc government left state ...
1796 two people contesting office , one person win ...
1798 deji said peter obi jumping church church.na d...
1850 amnesia set , lem say deleting incriminating p...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: string_tweet, Length: 63664, dtype: object
please check the length and the row, they are not corresponding
If you have proper NaN values, use the subset argument to work on the whole dataframe:
df.dropna(subset=['string_tweet'], inplace=True)
If your dataframe includes "nan" strings as suggested by #99_m4n, you may filter them out using:
df = df[df['string_tweet']!='nan']
I guess, that the pictured nan are of type numpy.ndarray try to convert your column before droping the NaN.
df['string_tweet']=df['string_tweet'].astype(float)

while pre-processing i am having huge count of columns with nan values! any possible way to replace with all columns nan with "Zero" or 'N'

For example, i am converted all the columns as list sample['ib_home_market_value','ib_comm_involve_don_cultural','ib_comm_involve_political','ib_home_furnishings', 'ib_magazines','ib_womens_apparel']
similar i am having 200 + columns.
Total rows - 10L
Sample [ib_comm_involve_don_cultural]- Y -309639 NAN -690361
similar i am need to work for all columns to change either 'Zero' and 'N'. I am required function to change all the columns nan values.
Am doing preprocessing for clustering model :
Sorry for the code un-readable, i am tried and fillna applied.
for i in list1:
df1[i].fillna('N', inplace=True)

Convert floats to ints in pandas dataframe

I have a pandas dataframe with a column ‘distance’ and it is of datatype ‘float64’.
Distance
14.827379
0.754254
0.2284546
1.833768
I want to convert these numbers to whole numbers (14,0,0,1). I tried with this but I get the error “ValueError: Cannot convert NA to integer”.
df['distance(kmint)'] = result['Distance'].astype('int')
Any help would be appreciated!!
I filtered out the NaN's from the dataframe using this:
result = result[np.isfinite(result['distance(km)'])]
Then, I was able to convert from float to int.
An alternative approach would be to convert the NaN values as part of your data import and cleaning processes. The more generalized solution could involve specifying the values that are NaN in the read_table command by setting the na_values flag. What you want to make sure of is that there isn't some malfored data like 1.5km in one of your fields that getting picked up as a NaN value.
pandas.read_table(..., na_values=None, keep_default_na=True, na_filter=True, ....)
Subsequently, once the dataframe is populated and the NaN values are identified properly, you can use the fillna method to substitute in zeros or the values that you identified as your distances.
Finally, it would be best to probably use notnull versus isfinite to convert the over to integers.

What is the functionality of the filling method when reindexing?

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.