whether pandas.decribe() excludes missing value rows? - pandas

pandas.describe() function generate descriptive statistics that summarize the dataset, excluding NaN values. But does the exclusion here means that the total count (i.e., rows of a variable) vary or fixed?
For example, I calculate the mean by using describe() for a df with missing values:
varA
1
1
1
1
NaN
Is the mean = 4/5 or 4/4 here?
And how does it apply to other results in describe? For example, the standard deviation, quartiles?
Thanks!

As ayhan pointed out, in the current 0.21 release NaN values are excluded from all summary statistics provided by pandas.DataFrame.describe().
With NaN:
data_with_nan = list(range(20)) + [np.NaN]*20
df = pd.DataFrame(data=data_with_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000
Without:
data_without_nan = list(range(20))
df = pd.DataFrame(data=data_without_nan, columns=['col1'])
df.describe()
col1
count 20.00000
mean 9.50000
std 5.91608
min 0.00000
25% 4.75000
50% 9.50000
75% 14.25000
max 19.00000

Related

Filter pandas dataframe to keep outliers only for every column

I have the following dataframe :
year
var1
var2
var3
…
1970
14.52
2.88
20510
…
1970
12.36
5.5
22320
…
1970
11.85
3.12
21640
…
1970
18.30
6.3
25200
…
For each and every column (vari), I would like to keep outliers only (like with df[vari].quantile(0.99)). Every column has different meaning so the boundary condition is column dependant.
I found many similar questions but most of them deal with either a single column to filter, an common outlier boundary across all columns, or results in a dataframe with values that respect the condition for every column whereas I need them for any column.
I need to plot every column separately so I have to keep rows that have at least an outlier in one of the columns.
My idea is to replace non-outlier values in each column by NaN but I can't find a simple way to filter my data.
Expected output is like :
year
var1
var2
var3
…
1970
14.52
NaN
NaN
…
1970
NaN
5.5
24500
…
1970
NaN
Nan
NaN
…
1970
18.30
6.3
25200
…
I tried something using .apply and lambda with no success :
outliers = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() > 3).any(axis=1)]
This seems to return rows that have an outlier in every column.
I also tried something less elegant by looping on columns but the execution fails (probably due to erroneous logic… code below for the record) :
for col in df.columns:
filtered_col = df[col].apply(lambda x: np.NaN if (x <= df[col].quantile(0.99) and x >= df[col].quantile(0.01)) else x )
df[col] = filtered_col
Do you have any idea on how I can tackle this issue ?
Your code works just fine, except that it mask the year column as well. So maybe your problem is just choosing the correct columns:
var_cols = df.columns[1:]
for col in var_cols:
mm, MM = df[col].quantile([.01,.99])
df[col] =df[col].where(df[col].between(mm,MM))
Output:
year var1 var2 var3
0 1970 14.52 NaN NaN
1 1970 12.36 5.50 22320.0
2 1970 NaN 3.12 21640.0
3 1970 NaN NaN NaN
You can aslo do without the for loop:
mm = df[var_cols].quantile(.01)
MM = df[var_cols].quantile(.99)
df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))
For mean and std:
mean, std = df[var_cols].mean(), df[var_cols].std()
mm, MM = mean - 3*std, mean + 3*std
df[var_cols] = df[var_cols].where(df[var_cols].le(MM) & df[var_cols].ge(mm))

How to perform calculations on pandas groupby dataframes

I have a dataframe which for the sake of showing a minimal example, I simplified to this:
df = pd.DataFrame({'pnl':[+10,+23,+15,-5,+20],'style':['obs','obs','obs','bf','bf']})
I would like to do the following:
Group the dataframe by style
Count the positive entries of pnl and divide by the total of entries of that same style.
For example, style 'bf' has 2 entries, one positive and one negative, so 1/2 (total) = 0.5.
This should yield the following result:
style win_rate
bf 0.5
osb 1
dtype: float64
I thought of having a list of the groups, iterate over them and build a new df... But it seems to me like an antipattern. I am pretty sure there is an easier / more pythonic solution.
Thanks.
You can do groupby on df['pnl'].gt(0) series with df['style'] and calculate mean
In [14]: df['pnl'].gt(0).groupby(df['style']).mean()
Out[14]:
style
bf 0.5
obs 1.0
Name: pnl, dtype: float64
You can try pd.crosstab, which uses groupby in the background, to get percentage of both positive and non-positive:
pd.crosstab(df['style'], df['pnl'].gt(0), normalize='index')
Output:
pnl False True
style
bf 0.5 0.5
obs 0.0 1.0

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

Sum multiple columns according to NaN status of other columns

I'm trying to get the sum of four columns based on the NaN status of four other columns
My dataframe looks like this:
A B C D A_pct B_pct C_pct D_pct
100 NaN NaN 100 5% 95% 0% 0%
100 NaN 250 100 1% 84% 15% 0%
I want to sum A_pct >> D_pct where the condition that A-D <> NaN is being met.
Hence, for the first row, the result would be 5%; for the second row, the result would be 16%
What I've done is as follows:
Pct_Sum = (df.loc[(df["A"].notna()) & (df["B"].notna()) & (df["C"].notna()) & df["D"].notna()),
df[["A_pct","B_pct","C_pct","D_pct"]]].sum(axis=1))
However, this returns the ValueError: "Cannot index with multidimensional key"
Please can you steer me in the right direction to correct this?
Thank you!
try:
df.apply(lambda x: sum(x[:4].notnull().values * (x[4:].str[:-1].astype(float)/100).values), axis=1)

Series of if statements applied to data frame

I have a question on how to this task. I want to return or group a series of numbers in my data frame, the numbers are from the column 'PD' which ranges from .001 to 1. What I want to do is to group those that are .91>'PD'>.9 to .91 (or return a value of .91), .92>'PD'>=.91 to .92, ..., 1>='PD' >=.99 to 1. onto a column named 'Grouping'. What I have been doing is manually doing each if statement then merging it with the base data frame. Can anyone please help me with a more efficient way of doing this? Still on the early stages of using python. Sorry if the question seems to be easy. Thank you for answering and for your time.
Let your data look like this
>>> df = pd.DataFrame({'PD': np.arange(0.001, 1, 0.001), 'data': np.random.randint(10, size=999)})
>>> df.head()
PD data
0 0.001 6
1 0.002 3
2 0.003 5
3 0.004 9
4 0.005 7
Then cut-off the last decimal of the PD column. This is a bit tricky since you get a lot of issues with rounding when doing it without str conversion. E.g.
>>> df['PD'] = df['PD'].apply(lambda x: float('{:.3f}'.format(x)[:-1]))
>>> df.tail()
PD data
994 0.99 1
995 0.99 3
996 0.99 2
997 0.99 1
998 0.99 0
Now you can use the pandas-groupby. Do with data whatever you want, e.g.
>>> df.groupby('PD').agg(lambda x: ','.join(map(str, x)))
data
PD
0.00 6,3,5,9,7,3,6,8,4
0.01 3,5,7,0,4,9,7,1,7,1
0.02 0,0,9,1,5,4,1,6,7,3
0.03 4,4,6,4,6,5,4,4,2,1
0.04 8,3,1,4,6,5,0,6,0,5
[...]
Note that the first row is one item shorter due to missing 0.000 in my sample.