pandas dropna drops more than isna counts - pandas

I've got a 3-column dataset with 7100 rows.
data.isna().sum() shows, that one column contains 117 NaN vlaues, the others 0.
data.isnull().sum() shows also 117 for one and 0 for the other columns.
data.dropna(inplace=True) drops 351 rows. Can anyone explain this to me? Am I doing anything wrong?
Edit:
I now examined the deleted rows. There are 351 rows deleted, where dropped.isna().sum().sum() shows a total of 117 NaN values.
dropped[~dropped['description'].isna()] shows an empty table. So the result seems to be correct as far as I can see.
Now I'm just curious how the difference in counting occurs.
Sadly I'm not able/allowed to provide a data sample.

data.isna().sum() returns the total number of NAN values in your dataframe and using data.dropna() will drop all NAN values. You can specifically check the number of NAN values by creating a subset for example: nan_rows=dataframe[dataframe.columnNameWithNanValues.isna()] to check for the NAN values and then return the shape of your dataframe.
Next, use .dropna() without the inplace=True argument to drop NAN and Null values.

Found the solution. pretty simple...
I've got three columns, one column contains 117 NaN values. 117 values for 3 columns are a total of 351 fields to be deleted. Since i used the df.size to measure the deleted size, which counts fields and not rows, I got 351 "deleted fields", which is totally correct.

Related

Nan columns not dropping

i have the data-set that contains some NAN values. i tried this to drop it but it is still showing
df['string_tweet'].dropna(inplace=True)
df['string_tweet']
this is the output
113 apc started let ’ finish started
235 upon vote katsina , apc government left state ...
1796 two people contesting office , one person win ...
1798 deji said peter obi jumping church church.na d...
1850 amnesia set , lem say deleting incriminating p...
...
378726 nan
378727 nan
378728 nan
378729 nan
378730 nan
Name: string_tweet, Length: 63664, dtype: object
please check the length and the row, they are not corresponding
If you have proper NaN values, use the subset argument to work on the whole dataframe:
df.dropna(subset=['string_tweet'], inplace=True)
If your dataframe includes "nan" strings as suggested by #99_m4n, you may filter them out using:
df = df[df['string_tweet']!='nan']
I guess, that the pictured nan are of type numpy.ndarray try to convert your column before droping the NaN.
df['string_tweet']=df['string_tweet'].astype(float)

How to retain NaN values using pandas factorize()?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I convert (or, encode) these entries to numerical values using factorize() as follows:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize(na_sentinel=None)[0]
The columns have several NaN entries, so I let na_sentinel=None to retain the NaN entries. However, the NaN values are not retained (they get converted to numerical entries), which is not what I desire. My Pandas version is 1.3.5. Is there something I am missing?
Factorize converts NaN values by default to -1. The NaN values are retained in this way since the NaN values can be identified by the -1. You would probably want to keep the default which is:
na_sentinel =-1
see
https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

Fast remove element of list if contained by pandas dataframe

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!
I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

while pre-processing i am having huge count of columns with nan values! any possible way to replace with all columns nan with "Zero" or 'N'

For example, i am converted all the columns as list sample['ib_home_market_value','ib_comm_involve_don_cultural','ib_comm_involve_political','ib_home_furnishings', 'ib_magazines','ib_womens_apparel']
similar i am having 200 + columns.
Total rows - 10L
Sample [ib_comm_involve_don_cultural]- Y -309639 NAN -690361
similar i am need to work for all columns to change either 'Zero' and 'N'. I am required function to change all the columns nan values.
Am doing preprocessing for clustering model :
Sorry for the code un-readable, i am tried and fillna applied.
for i in list1:
df1[i].fillna('N', inplace=True)

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)