How to plot values from the DataFrame? Python 3.0 - dataframe

I'm trying to plot the values from the A column against the index(Of the DataFrame table), but it doesnt allow me to. How to do it?
INDEX is the index from the DataFrame and not the declared variable.

You need plot column A only, index is used for x and values for y by default in Series.plot:
#line is default method, so omitted
Test['A'].plot(style='o')
Another solution is reset_index for column from index and then DataFrame.plot:
Test.reset_index().plot(x='index', y='A', style='o')
Sample:
Test=pd.DataFrame({'A':[3.0,4,5,10], 'B':[3.0,4,5,9]})
print (Test)
A B
0 3.0 3.0
1 4.0 4.0
2 5.0 5.0
3 10.0 9.0
Test['A'].plot(style='o')
print (Test.reset_index())
index A B
0 0 3.0 3.0
1 1 4.0 4.0
2 2 5.0 5.0
3 3 10.0 9.0
Test.reset_index().plot(x='index', y='A', style='o')

Related

Average of certain values in pandas dataframe with if condition

index
column 1
column 2
1
1
1.2
2
1.2
1.5
3
2.2
2.5
4
3
3.1
5
3.3
3.5
6
3.6
3.8
7
3.9
4.0
8
4.0
4.0
9
4.0
4.1
10
4.1
4.0
I created a moving average with df.rolling(). But I just want to have the average of the "constant" value (here around 4), that is not changing more than 10% any more.
My first approach was to try it with an if condition, but my attemps to just create an average of certain values in the column failed.
Does anyone have ideas?

How to represent the column with max Nan values in pandas df?

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

Python/Pandas fill NaN values [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Out[2]:
GROUP A B
0 1.0 2.0 5.0
1 1.0 5.0 7.0
2 2.0 3.0 6.0
3 2.0 NaN NaN
4 2.0 NaN NaN
5. 2.0 8.0 4.0
Desired output:
Out[2]:
GROUP A B
0 1.0 2.0 5.0
1 1.0 5.0 7.0
2 2.0 3.0 6.0
3 2.0 6.0 7.0
4 2.0 7.0 8.0
5. 2.0 8.0 4.0
Try:
blocks = df['GROUP'].ne(df['GROUP'].shift()).cumsum()
df['END'] = df['END'].fillna(df.fillna(1).groupby(blocks)['END'].cumsum())
df['START'] = df['START'].fillna(df['END'].shift())
There is no built-in vectorized solution for your case, but you can solve it by iterating and processing each NaN section at a time.
# initialize starting and ending values
df['START'] = df['START'].mask(df['START'].isna(), df['END'].shift())
df['END'] = df['END'].mask(df['END'].isna(), df['START'].shift(-1))
while df['END'].isna().any():
i = df['END'].loc[df['END'].isna()].index[0] # get idx of first NaN
k = df['END'].loc[i:].loc[~df['END'].isna()].index[0] # get idx of next valid
if df.loc[i, 'GROUP'] != df.loc[k, 'GROUP']:
# you did not specify what to do in case a group started or ended in NaN
# this will replace with a temp string and later replace back to NaN
df.loc[i:k, 'START':'END'] = 'temp'
continue
n = k - i + 1
start = df.loc[i, 'START'] # get value
end = df.loc[k, 'END'] # get value
delta = (end - start) / n
df.loc[i:k, 'START':'END'] = [
[start + row * delta, start + (row + 1) * delta]
for row in range(n)
]
df = df.replace('temp', np.nan)
Output
GROUP START END
0 1 2.0 5.0
1 1 5.0 7.0
2 2 3.0 6.0
3 2 6.0 7.0
4 2 7.0 8.0
5 2 8.0 4.0
Notice that some error handling would be needed to account for the first or last row of the dataframe being NaN.

Pandas: using SimpleImputer transforms dataframe into a series?

I've got a dataframe with some NaNs. I'd like to fill them with the column mean values. It's all good but after applying the code below, the dataframe seems to have been change to a series, all values suddenly have precision of lots of places after the decimal point, the column names of the original dataframe have been lost and replaced with 0,1,2, I know I can recreate/reset all of this but is it possible to use SimpleImputer without changing the underlying structure/type of the data?
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(dfn)
dfn_mean=impute.transform(dfn)
I think you can use only pandas solution with DataFrame.fillna and mean, where by default are omited non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,np.nan,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,np.nan,4],
'F':list('aaabbb')
})
df = df.fillna(df.mean())
print (df)
A B C D E F
0 a 4 7.0 1 5.0 a
1 b 5 8.0 3 3.0 a
2 c 4 9.0 5 6.0 a
3 d 5 4.0 7 9.0 b
4 e 5 6.2 1 5.4 b
5 f 4 3.0 0 4.0 b
Your solution should be changed with processing only floats columns by DataFrame.select_dtypes:
from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan,strategy='mean')
c = df.select_dtypes(np.floating).columns
df[c] = impute.fit_transform(df[c])
print (df)
A B C D E F
0 a 4 7.0 1 5.0 a
1 b 5 8.0 3 3.0 a
2 c 4 9.0 5 6.0 a
3 d 5 4.0 7 9.0 b
4 e 5 6.2 1 5.4 b
5 f 4 3.0 0 4.0 b
Or only numeric, but then integers columns are converted to floats:
from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan,strategy='mean')
c = df.select_dtypes(np.number).columns
df[c] = impute.fit_transform(df[c])
print (df)
A B C D E F
0 a 4.0 7.0 1.0 5.0 a
1 b 5.0 8.0 3.0 3.0 a
2 c 4.0 9.0 5.0 6.0 a
3 d 5.0 4.0 7.0 9.0 b
4 e 5.0 6.2 1.0 5.4 b
5 f 4.0 3.0 0.0 4.0 b

Pandas: replace outliers in all columns with nan

I have a data frame with 3 columns, for ex
c1,c2,c3
10000,1,2
1,3,4
2,5,6
3,1,122
4,3,4
5,5,6
6,155,6
I want to replace the outliers in all the columns which are outside 2 sigma. Using the below code, I can create a dataframe without the outliers.
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 2).all(axis=1)]
c1,c2,c3
1,3,4
2,5,6
4,3,4
5,5,6
I can find the outliers for each column separately and replace with "nan", but that would not be the best way as the number of lines in the code increases with the number of columns. There must be a better way of doing this. May be boolean output from the above command for rows and then replace "TRUE" with "nan".
Any suggestions, many thanks.
pandas
Use pd.DataFrame.mask
df.mask(df.sub(df.mean()).div(df.std()).abs().gt(2))
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0
numpy
v = df.values
mask = np.abs((v - v.mean(0)) / v.std(0)) > 2
pd.DataFrame(np.where(mask, np.nan, v), df.index, df.columns)
c1 c2 c3
0 NaN 1.0 2.0
1 1.0 3.0 4.0
2 2.0 5.0 6.0
3 3.0 1.0 NaN
4 4.0 3.0 4.0
5 5.0 5.0 6.0
6 6.0 NaN 6.0
lb = df.quantile(0.01)
ub = df.quantile(0.99)
df_new = df[(df < ub) & (df > lb)]
df_new
I am using interquatile range method to detect outliers. Firstly it calculates the lower bound and upper bound of the df using quantile function. Then based on the condition that all the values should be between lower bound and upper bound it returns a new df with outlier values replaced by NaN.