Plot histogram on non distirbuted data - pandas

I have a pandas series like
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
11 58.0
12 20.0
13 39.0
14 14.0
15 55.0
16 2.0
17 NaN
while trying to plot on a an histogram like
plt.hist(train_df['Age'])
I have the following error:
ValueError: max must be larger than min in range parameter

Related

Grouping columns with same values without aggregating columns with different values in pandas

I currently using pandas to summarize my data. I have a data listed like this (the real data have ten of thousands of entries).
A
B
Intensity
Area
3
4
20.2
55
3
4
20.7
23
3
4
30.2
17
3
4
51.8
80
5
6
79.6
46
5
6
11.9
77
5
7
56.7
19
5
7
23.4
23
I would like to group the columns (A & B) together and list down the all the intensity and area values without aggregating the values (eg calculate mean, median, mode etc)
A,B
Intensity
3,4
20.2
20.7
30.2
51.8
5,6
79.6
11.9
NaN
NaN
5,7
56.7
23.4
NaN
NaN
Area
3,4
55
23
17
80
5,6
46
77
NaN
NaN
5,7
19
23
NaN
NaN
Here is one way to do it
# Melt to make wide layout to long, bring area and intensity as rows
df2=df.melt(id_vars=['A', 'B'])
# concat A and B into a single column
df2['A,B']=df2['A'].astype(str)+','+df2['B'].astype(str)
# drop A and B
df2.drop(columns=['A','B'], inplace=True)
# create a sequence number to aid in creating column in result
df2['seq']=df2.assign(seq=1).groupby(['variable','A,B'])['seq'].cumsum()
# do a pivot, and format the resultset
df2=(df2.pivot(index=['variable','A,B'], columns='seq', values='value')
.reset_index()
.rename_axis(columns=None)
.rename(columns={'variable':''}))
df2
A,B 1 2 3 4
0 Area 3,4 55.0 23.0 17.0 80.0
1 Area 5,6 46.0 77.0 NaN NaN
2 Area 5,7 19.0 23.0 NaN NaN
3 Intensity 3,4 20.2 20.7 30.2 51.8
4 Intensity 5,6 79.6 11.9 NaN NaN
5 Intensity 5,7 56.7 23.4 NaN NaN
you can use:
df['class']=df['A'].astype(str) + ',' + df['B'].astype(str)
def convert_values(col_name):
dfx=pd.DataFrame(df[[col_name,'class']].groupby('class').agg(list)[col_name].to_list(),index=df[[col_name,'class']].groupby('class').agg(list).index).reset_index()
dfx.index=[col_name] * len(dfx)
return dfx
df1=convert_values('Intensity')
df2=convert_values('Area')
final=pd.concat([df1,df2])
print(final)
'''
class 0 1 2 3
Intensity 3,4 20.2 20.7 30.2 51.8
Intensity 5,6 79.6 11.9 nan nan
Intensity 5,7 56.7 23.4 nan nan
Area 3,4 55 23 17.0 80.0
Area 5,6 46 77 nan nan
Area 5,7 19 23 nan nan
'''

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Use condition in a dataframe to replace values in another dataframe with nan

I have a dataframe that contains concentration values for a set of samples as follows:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
20
20
A
30
23
20
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
87
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
I have second dataframe that contains the proportion of concentration values that are not missing in the first dataframe:
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
0.75
1
0.5
0.25
B
1
0.75
0.25
1
C
1
0
0
1
I would like to replace value in the first dataframe with nan when the condition in the second dataframe is 0.5 or less. Hence, the resulting dataframe would look like that below. Any help would be great!
Sample
Ethanol
Acetone
Formaldehyde
Methane
A
20
20
nan
nan
A
30
23
nan
nan
A
20
23
nan
nan
A
nan
20
nan
nan
B
21
46
nan
54
B
23
74
nan
54
B
23
67
nan
53
B
23
nan
nan
33
C
23
nan
nan
66
C
22
nan
nan
88
C
22
nan
nan
90
C
22
nan
nan
88
Is it what your are looking for:
>>> df2.set_index('Sample').mask(lambda x: x <= 0.5) \
.mul(df1.set_index('Sample')).reset_index()
Sample Ethanol Acetone Formaldehyde Methane
0 A 15.0 20.00 NaN NaN
1 A 22.5 23.00 NaN NaN
2 A 15.0 23.00 NaN NaN
3 A NaN 20.00 NaN NaN
4 B 21.0 34.50 NaN 54.0
5 B 23.0 55.50 NaN 54.0
6 B 23.0 50.25 NaN 53.0
7 B 23.0 NaN NaN 33.0
8 C 23.0 NaN NaN 66.0
9 C 22.0 NaN NaN 88.0
10 C 22.0 NaN NaN 90.0
11 C 22.0 NaN NaN 88.0

How to add 1 to previous data if NaN in pandas

I was wondering if it is possible to add 1 (or n) to missing values in a pandas DataFrame / Series.
For example:
1
10
nan
15
25
nan
nan
nan
30
Would return :
1
10
11
15
25
26
27
28
30
Thank you,
Use .ffill + the result of a groupby.cumcount to determine n
df[0].ffill() + df.groupby(df[0].notnull().cumsum()).cumcount()
0 1.0
1 10.0
2 11.0
3 15.0
4 25.0
5 26.0
6 27.0
7 28.0
8 30.0
dtype: float64

Compute a sequential rolling mean in pandas as array function?

I am trying to calculate a rolling mean on dataframe with NaNs in pandas, but pandas seems to reset the window when it meets a NaN, hears some code as an example...
import numpy as np
from pandas import *
foo = DataFrame(np.arange(0.0,13.0))
foo['1'] = np.arange(13.0,26.0)
foo.ix[4:6,0] = np.nan
foo.ix[4:7,1] = np.nan
bar = rolling_mean(foo, 4)
gives the rolling mean that resets the window after each NaN's, not just skipping out the NaNs
bar =
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 8.5 NaN
11 9.5 22.5
12 10.5 23.5
I have found an ugly iter/ dropna() work around that gives the right answer
def sparse_rolling_mean(df_data, window):
...: f_data = DataFrame(np.nan,index=df_data.index, columns=df_data.columns)
...: for i in f_data.columns:
...: f_data.ix[:,i] = rolling_mean(df_data.ix[:,i].dropna(),window)
...: return f_data
bar = sparse_rolling_mean(foo,4)
bar
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
does anyone know if it is possible to do this as an array function ?
many thanks in advance.
you may do:
>>> def sparse_rolling_mean(ts, window):
... return rolling_mean(ts.dropna(), window).reindex_like(ts)
...
>>> foo.apply(sparse_rolling_mean, args=(4,))
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
[13 rows x 2 columns]
you can control what get's naned out with the min_periods arg
In [12]: rolling_mean(foo, 4,min_periods=1)
Out[12]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 2.0 15.0
5 2.5 15.5
6 3.0 16.0
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
You can do this if you want results, except when the original was nan
In [27]: rolling_mean(foo, 4,min_periods=1)[foo.notnull()]
Out[27]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
Your expected are a bit odd, as the first 3 rows should have values.