outlier dedection with z-score, but - dataframe

I wrote a code to outlier dedection with Python. I used the z-score method to do this. You can see my data and my codes below.
data =[5,10,15,20,25,30,36,22]
data.append(180)
data = pd.DataFrame(data, columns = ["Data"])
z = np.abs(stats.zscore(data))
print(z)
print(np.where( z > 1.5))
I wrote this code to detect outliers. Actually, I wanted to getthe indices of values with z-score higher than 1.5. But I think something is wrong with output.
Data
0 0.649600
1 0.551506
2 0.453412
3 0.355318
4 0.257224
5 0.159130
6 0.041417
7 0.316080
8 2.783688
(array([8], dtype=int64), array([0], dtype=int64))
The 8th element of the data's z-score is higher than 1.5 and it's already written on output, I'm okay with this but the 0th's z-score 0.64. What am i doing wrong?

You could do something like this:
import numpy as np
from scipy import stats
data =[5,10,15,20,25,30,36,22]
data.append(180)
z = stats.zscore(data)
np.where(z > 1.5)[0]
output:
array([8])

Related

How to calculate pearsonr (and correlation significance) with pandas groupby?

I would like to do a groupby correlation using pandas and pearsonr.
Currently I have:
df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))
df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]
However I would like to calculate the correlation significance using pearsonr (scipy package) like this:
from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])
How do I combine the groupby with the pearsonr, something like this:
corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))
If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.
To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.
from scipy import stats
df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))
The output is:
corr pval
A B
0 0 -0.318048 0.404239
1 0.750380 0.007804
2 -0.536679 0.109723
3 -0.160420 0.567917
4 -0.479591 0.229140
.. ... ...
9 5 0.218743 0.602752
6 -0.114155 0.662654
7 0.053370 0.883586
8 -0.436360 0.091069
9 -0.047767 0.882804
[100 rows x 2 columns]
In jupyter:
Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:
corr_df["qval"] = p_adjust_bh(corr_df.pval)
I used the p_adjust_bh function from here (answer from #Eric Talevich)

Pandas rounding

I have the following sample dataframe:
Market Value
0 282024800.37
1 317460884.85
2 1260854026.24
3 320556927.27
4 42305412.79
I am trying to round the values in this dataframe to the nearest whole number. Desired output:
Market Value
282024800
317460885
1260854026
320556927
42305413
I tried:
df.values.round()
and the result was
Market Value
282025000.00
317461000.00
1260850000.00
320557000.00
42305400.00
What am I doing wrong?
Thanks
This might be more appropriate posted as a comment, but put here for proper format.
I can't produce your result. With numpy 1.18.1 and Pandas 1.1.0,
df.round().astype('int')
gives me:
Market Value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413
The only thing I can think of is that you may have a 32 bit system, where
df.astype('float32').round().astype('int')
gives me
Market Value
0 282024800
1 317460896
2 1260854016
3 320556928
4 42305412
The following will keep your data information intact as a float put will have it display/print to the nearest int.
Big caveat: it is only possible to have this apply to ALL dataframes at once (it is a pandas wide option) rather than just a single dataframe.
pd.set_option("display.precision", 0)
If you like #noah's solution but don't want to have to change the variables back if you output something, you can use the following helper function:
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_pandas_options(options):
seen_options = set()
old_values = {}
if isinstance(options, dict):
options_pairs = list(options.items())
else:
options_pairs = options
for option, value in options_pairs:
assert not option in seen_options, f"Already saw option {option}"
old_values[option] = pd.get_option(option)
pd.set_option(option, value)
yield
for option, old_value in old_values.items():
pd.set_option(option, old_value)
Then you can run
with temp_pandas_options({'display.float_format': '{:.0f}'.format}):
print(market_value_df)
and get
Market value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T

How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have.
I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below
How to do this using nltk or scikit learn?
I wrote the below code which takes a string as input. How to extend it to series/dataframe?
from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()
If your data is like
import pandas as pd
df = pd.DataFrame([
'must watch. Good acting',
'average movie. Bad acting',
'good movie. Good acting',
'pathetic. Avoid',
'avoid'], columns=['description'])
You could use the CountVectorizer of the package sklearn:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
Which gives you :
frequency
good 3
pathetic 1
average movie 1
movie bad 2
watch 1
good movie 1
watch good 3
good acting 2
must 1
movie good 2
pathetic avoid 1
bad acting 1
average 1
must watch 1
acting 1
bad 1
movie 1
avoid 1
EDIT
fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.
Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.
EDIT 2
For big dataframes, you can speed up the frequencies computation by using:
frequencies = sum(sparse_matrix).data
or
frequencies = sparse_matrix.sum(axis=0).T

Log values by SFrame column

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...
numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.
The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))
#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....