while pre-processing i am having huge count of columns with nan values! any possible way to replace with all columns nan with "Zero" or 'N' - pandas

For example, i am converted all the columns as list sample['ib_home_market_value','ib_comm_involve_don_cultural','ib_comm_involve_political','ib_home_furnishings', 'ib_magazines','ib_womens_apparel']
similar i am having 200 + columns.
Total rows - 10L
Sample [ib_comm_involve_don_cultural]- Y -309639 NAN -690361
similar i am need to work for all columns to change either 'Zero' and 'N'. I am required function to change all the columns nan values.
Am doing preprocessing for clustering model :

Sorry for the code un-readable, i am tried and fillna applied.
for i in list1:
df1[i].fillna('N', inplace=True)

Related

pandas dropna drops more than isna counts

I've got a 3-column dataset with 7100 rows.
data.isna().sum() shows, that one column contains 117 NaN vlaues, the others 0.
data.isnull().sum() shows also 117 for one and 0 for the other columns.
data.dropna(inplace=True) drops 351 rows. Can anyone explain this to me? Am I doing anything wrong?
Edit:
I now examined the deleted rows. There are 351 rows deleted, where dropped.isna().sum().sum() shows a total of 117 NaN values.
dropped[~dropped['description'].isna()] shows an empty table. So the result seems to be correct as far as I can see.
Now I'm just curious how the difference in counting occurs.
Sadly I'm not able/allowed to provide a data sample.
data.isna().sum() returns the total number of NAN values in your dataframe and using data.dropna() will drop all NAN values. You can specifically check the number of NAN values by creating a subset for example: nan_rows=dataframe[dataframe.columnNameWithNanValues.isna()] to check for the NAN values and then return the shape of your dataframe.
Next, use .dropna() without the inplace=True argument to drop NAN and Null values.
Found the solution. pretty simple...
I've got three columns, one column contains 117 NaN values. 117 values for 3 columns are a total of 351 fields to be deleted. Since i used the df.size to measure the deleted size, which counts fields and not rows, I got 351 "deleted fields", which is totally correct.

How to retain NaN values using pandas factorize()?

I have a Pandas data frame with several columns, with some columns comprising categorical entries. I convert (or, encode) these entries to numerical values using factorize() as follows:
for column in df.select_dtypes(['category']):
df[column] = df[column].factorize(na_sentinel=None)[0]
The columns have several NaN entries, so I let na_sentinel=None to retain the NaN entries. However, the NaN values are not retained (they get converted to numerical entries), which is not what I desire. My Pandas version is 1.3.5. Is there something I am missing?
Factorize converts NaN values by default to -1. The NaN values are retained in this way since the NaN values can be identified by the -1. You would probably want to keep the default which is:
na_sentinel =-1
see
https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

Fast remove element of list if contained by pandas dataframe

I have a list of strings, and two separate pandas dataframes. One of the dataframes contains NaNs. I am trying to find a fast way of checking if any item in the list is contained in either of the dataframes, and if so, to remove it from the list.
Currently, I do this with list comprehension. I first concatenate the two dataframes. I then loop through the list, and using an if statement check if it is contained in the concatenated dataframe values.
patches = [patch for patch in patches if not patch in bad_patches.values]
The first 5 elements of my list of strings:
patches[1:5]
['S2A_MSIL2A_20170613T101031_11_52',
'S2A_MSIL2A_20170717T113321_35_89',
'S2A_MSIL2A_20170613T101031_12_39',
'S2A_MSIL2A_20170613T101031_11_77']
An example of one of my dataframes, with the second being the same but containing less rows. Note first row contains patches[2].
cloud_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89
1 S2A_MSIL2A_20170717T113321_39_84
2 S2B_MSIL2A_20171112T114339_0_13
3 S2B_MSIL2A_20171112T114339_0_52
4 S2B_MSIL2A_20171112T114339_0_53
The concatenated dataframe:
bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()
0 S2A_MSIL2A_20170717T113321_35_89 S2B_MSIL2A_20170831T095029_27_76
1 S2A_MSIL2A_20170717T113321_39_84 S2B_MSIL2A_20170831T095029_27_85
2 S2B_MSIL2A_20171112T114339_0_13 S2B_MSIL2A_20170831T095029_29_75
3 S2B_MSIL2A_20171112T114339_0_52 S2B_MSIL2A_20170831T095029_30_75
4 S2B_MSIL2A_20171112T114339_0_53 S2B_MSIL2A_20170831T095029_30_78
and the tail, showing the NaNs of one column:
bad_patches.tail()
61702 NaN S2A_MSIL2A_20180228T101021_43_6
61703 NaN S2A_MSIL2A_20180228T101021_43_8
61704 NaN S2A_MSIL2A_20180228T101021_43_11
61705 NaN S2A_MSIL2A_20180228T101021_43_13
61706 NaN S2A_MSIL2A_20180228T101021_43_16
Column headers are all (poorly) named 0.
The second element of patches should be removed as it's contained in the first row of bad_patches. My method does work but takes absolutely ages. Bad_patches is 60,000 rows and the length of patches is variable. Right now for a length of 1000 patches it takes a 2.04 seconds but I need to scale up to 500k patches so hoping there is a faster way. Thanks!
I would create a set with the values from cloud_patches and snow_patches. Then also create a set of patches:
patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)
Now you just subtract all values in patch_set from the values in patches, and you will be left with only values in patches that do not show up in cloud_patches nor snow_patches:
cleaned_list = list(patches - patch_set)

Pandas integer colums remove last three digits

A pandas df as an example with columns all integers but some are with NAN.
raw
capitalSurplus 188883000
totalLiab 2589242000
totalStockholderEquity 6740732000
minorityInterest 27549000
otherCurrentLiab 40412000
totalAssets 9357523000
endDate 1483142400
commonStock 5818867000
retainedEarnings 732982000
otherLiab 746117000
otherAssets 6034000
totalCurrentLiabilities 436539000
propertyPlantEquipment 9135741000
totalCurrentAssets 212758000
longTermInvestments 2990000
netTangibleAssets 6740732000
netReceivables 201288000
longTermDebt 1406586000
accountsPayable 396127000
otherCurrentAssets NAN
ps. df is transposed.
expect results are last three digits('000') are removed from all columns despite NAN columns
and also keep endDate unchanged:
endDate 1483142400
If the NAN is not a np.nan , you can replace them using df.replace.
Post which , I renamed the columns as A,B by using df.columns = ['A','B']
Then you can just do the below using floordiv() which is a builtin function:
df.B.update(df[df.A!='endDate']['B'].floordiv(1000))
This will remove the last 3 zeros except the endDate row and update the column B in the respective indices.
Alternatively you can also use // to remove the last 3 zeros as shown below:
df.B.update(df[df.A!='endDate']['B'] // 1000)