i have the data-set that contains some NAN values. i tried this to drop it but it is still showing
df['string_tweet'].dropna(inplace=True)
df['string_tweet']
this is the output
113 apc started let ’ finish started
235 upon vote katsina , apc government left state ...
1796 two people contesting office , one person win ...
1798 deji said peter obi jumping church church.na d...
1850 amnesia set , lem say deleting incriminating p...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: string_tweet, Length: 63664, dtype: object
please check the length and the row, they are not corresponding
If you have proper NaN values, use the subset argument to work on the whole dataframe:
df.dropna(subset=['string_tweet'], inplace=True)
If your dataframe includes "nan" strings as suggested by #99_m4n, you may filter them out using:
df = df[df['string_tweet']!='nan']
I guess, that the pictured nan are of type numpy.ndarray try to convert your column before droping the NaN.
df['string_tweet']=df['string_tweet'].astype(float)
Related
I'm trying to update a Pandas df column with a column from another df, which changes daily.
What I mean to do is to transplant what's in this:
Daily schedule of workers for all year
To this:
Schedule of workers for today, column in red
I'd like to do it every day until July. It usually worked quite well. In 2023, the calendar was made with slight changes in the format, and I can't make Pandas read the data as I'd like.
I cannot actually assign a column from one database to a columns from the other. The code is accepted by Python, but all I get is NaN, not the strings I hoped for. All the values from the other column are strings. What am I doing wrong?
Thanks!
Here's my code:
today = datetime.today().strftime("%d/%m/%Y")
df["status"] = df_diario[today].astype('str')
df["status"]
nome
Ad NaN
Al NaN
An NaN
Ca NaN
Cl NaN
Da NaN
El NaN
Ga NaN
Hu NaN
Jo NaN
Jo NaN
Jo NaN
Jo NaN
Le NaN
Lu NaN
Lu NaN
Lui NaN
Ma NaN
Mar NaN
Mau NaN
Om NaN
Pa NaN
Pau NaN
Pe NaN
Ro inativo
Ro NaN
Ro NaN
Ron NaN
Vi NaN
Name: status, dtype: object
what is the variable today that you used as the DataFrame accessor? Also, please format your answer so that others can read clearly and help you better.
However, if you do check your answer, there's one line that is not NaN, it is inactivo. It could be that both DataFrames are incompatible and having different indices. If you want to do an reassignment this way, you need to have identical index in both DataFrames.
Found the answer! It actually was in MS Excel. If you type some text that looks like a date, it will automatically define it in a date format. For this reason, I could not transplant my column as I'd like.
Pandas "inherits" the date format from the Excel spreadsheet, so to speak. It would import some of my dates as datetime objects, unrecognizable by the code I had written. It didn't import all of them as such because I had done the 2022 table with Python itself, from Jan 24th on. Because I had manually typed 10/01/2023, in this year's table, Pandas interpreted it as datetime and thus my code didn't work. To prevent the mess, I had to type an apostrophe before the date in an Excel cell.
I've got a 3-column dataset with 7100 rows.
data.isna().sum() shows, that one column contains 117 NaN vlaues, the others 0.
data.isnull().sum() shows also 117 for one and 0 for the other columns.
data.dropna(inplace=True) drops 351 rows. Can anyone explain this to me? Am I doing anything wrong?
Edit:
I now examined the deleted rows. There are 351 rows deleted, where dropped.isna().sum().sum() shows a total of 117 NaN values.
dropped[~dropped['description'].isna()] shows an empty table. So the result seems to be correct as far as I can see.
Now I'm just curious how the difference in counting occurs.
Sadly I'm not able/allowed to provide a data sample.
data.isna().sum() returns the total number of NAN values in your dataframe and using data.dropna() will drop all NAN values. You can specifically check the number of NAN values by creating a subset for example: nan_rows=dataframe[dataframe.columnNameWithNanValues.isna()] to check for the NAN values and then return the shape of your dataframe.
Next, use .dropna() without the inplace=True argument to drop NAN and Null values.
Found the solution. pretty simple...
I've got three columns, one column contains 117 NaN values. 117 values for 3 columns are a total of 351 fields to be deleted. Since i used the df.size to measure the deleted size, which counts fields and not rows, I got 351 "deleted fields", which is totally correct.
I have a Pandas data frame with several columns, with some columns comprising categorical entries. I convert (or, encode) these entries to numerical values using factorize() as follows:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize(na_sentinel=None)[0]
The columns have several NaN entries, so I let na_sentinel=None to retain the NaN entries. However, the NaN values are not retained (they get converted to numerical entries), which is not what I desire. My Pandas version is 1.3.5. Is there something I am missing?
Factorize converts NaN values by default to -1. The NaN values are retained in this way since the NaN values can be identified by the -1. You would probably want to keep the default which is:
na_sentinel =-1
see
https://pandas.pydata.org/docs/reference/api/pandas.factorize.html
Why do simple DataFrame op DataFrame operations result in a union'ed DataFrame? Pandas documentation mentions unionizing because of alignment issues. I don't see any alignment issues with df1 and df2. Aren't alignment issues about different shapes, different dtypes, or different indexes?
df1 = pd.DataFrame([[1,2],[3,4]],columns=list('AB'))
df2 = pd.DataFrame([[5,6],[7,8]],columns=list('CD'))
>> df1*df2
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
Another source of alignment issues is non-matching column names. Here, alignment requires identical column names. Either make the column names the same or use .values. Using .values on just the right-hand DataFrame will retain the DataFrame type.
>> df1*df2.values
A B
0 5 12
1 21 32
When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.
When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.
import pandas as pd
import numpy as np
hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()
ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')
print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')
To fill as one (or rather I) would expect, I need to do this:
daily2 = daily1.fillna(method='ffill')
If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.
I write my comment on the github here as well:
The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:
A B C
1 1.242 NaN 0.110
3 NaN -0.185 -0.209
5 -0.581 1.483 NaN
and i want to keep all nan as nan, it makes much more sense to have:
df.reindex( [2, 4, 6], method='ffill' )
A B C
2 1.242 NaN 0.110
4 NaN -0.185 -0.209
6 -0.581 1.483 NaN
just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.
This is completely different from
df.reindex( [2, 4, 6], method=None )
which produces
A B C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN
Here is an example:
np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.
in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.
Reindexing should not enforce a mandatory fillna on the data.