print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0 2317.0
1 2293.0
2 1190.0
3 972.0
4 1391.0
Name: r6000, dtype: float64
0.0 2317.0
1.0 2293.0
3.0 1190.0
4.0 972.0
5.0 1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999
The problem is that scipy was reporting a different correlation value than pandas.
Edit to add:
The issue is the indexes are off. Pandas does automatic intrinsic data alignment, but scipy doesn't. I've answered it below.
Pandas does't have a function that calculates p-values, so it is better to use SciPy to calculate correlation since it will give you both p-value and correlation coefficient. The other alternative is to calculate the p-value yourself....using Scipy. Note one thing: If you are calculating the correlation of a sample of your data using pandas, the risk is that the correlations change if you change your sample is high. This is why you need the p-value.
I made a copy and called reset_index() on the series before correlating them. That fixed it.
The issue is intrinsic automatic data alignment in pandas based on the indexes.
scipy library doesn't do automatic data alignment, likely just converts it to a numpy array.
Related
I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one.
The standard solution are of the following syntax:
df.loc[row_mask, cols] = assigned_val
Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.
Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.
Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.
I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.
You could give np.where a try.
Here is an simple example of np.where
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])
np.where() do nothing if condition fails
has another example.
In your case, it might be
df[col] = np.where(df[col]==row_condition, assigned_val, df[col])
I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.
I have created a dask dataframe from geopandas futures that each yield a pandas dataframe following the example here: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
daskdf = dd.from_delayed(lazy_dataframes,lazy_dataframes, meta=lazy_dataframes[0].compute())
All dtypes seem reasonable
daskdf.dtypes
left float64
bottom float64
right float64
top float64
score object
label object
height float64
area float64
geometry geometry
shp_path object
geo_index object
Year int64
Site object
dtype: object
but dd groupby operations fails
daskdf.groupby(['Site']).height.mean().compute()
...
"/Users/ben/miniconda3/envs/crowns/lib/python3.7/site-packages/dask/dataframe/utils.py", line 577, in _nonempty_series
data = np.array([entry, entry], dtype=dtype)
builtins.TypeError: data type not understood
whereas pandas has no problem with the same process on the same data.
daskdf.compute().groupby(['Site']).height.mean()
Site
SOAP 15.102355
Name: height, dtype: float64
What might be happening here with the metadata types that could cause this. As I scale my workflow, I would like to perform distributed operations on persisted data.
The problem is the 'geometry' dtype which comes from geopandas. My pandas dataframe came from loading a shapefile using geopandas.read_file(). Future users beware, drop this column when creating a dask dataframe. I know there was a dask-geopandas attempt some time ago. This was harder to follow since the statement
daskdf.groupby(['Site']).height.mean().compute()
does not involve the geometry column. Dask must check the dtypes of all columns, not just the ones used in an operation. Be careful!
Dropping the geometry column yields the expected result.
daskdf.drop(columns="geometry")
daskdf.groupby(['Site']).height.mean().compute()
Tagging with geopandas in hopes future users can find this.
Currently working on a regression problem, I'm facing some issues in the performance of models. In order to have 'maybe' a better performance, I've some outliers that I'd like to remove.
Problem: Remove outliers from a dataframe containing different types.
The DF looks like:
df.dtypes
CONTRACT_TYPE object
CONTRACT_COC object
ORIGINATION_DATE datetime64[ns]
MATURITY_DATE datetime64[ns]
ORIGINAL_TERM float64
REMAINING_TERM int64
INTEREST_RATE_INTERNAL float64
INTEREST_RATE_FUNDING float64
However, after trying this code as shown bellow, without success and even the zscore, I'm asking some help.
# Computing IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
To summarize, I'd like to see in the plots (scatter, boxplot) a more 'normal' distribution without or with the less of outliers.
Please, do not hesitate if you need more information.
First of all, I assume that your data distribution is Normal.
Here is a great strategy for removing outliers.
Make a Pandas Dataframe with all numeric features, which has outliers.
Use sklearn.preprocessing.StandardScaler on your Dataframe. It standardize features by removing the mean and scaling to unit variance. The implementation is as easy as follows;
# Declare Sklearn standard_scaler
standard_scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
# Fitting
standard_scaler.fit(x_train_df)
# Transforming
x_train_normal_scaled_df = standard_scaler.transform(x_train_df)
# Fitting and Transforming together
x_train_normal_scaled_df = x_scaler_lev1.fit_transform(x_train_df)
# Inverting the transformed data back.
x_train_df = standard_scaler.inverse_transform()
print(x_train_normal_scaled_df.describe())
x_train_normal_scaled_df.plot()
You should find out how much of your data is outlier. Empirical Rule of Normal Distribution can help here.
Experimentally, I always choose the data in the range of 3 times of standard deviation as my main data and out of this range would be the outlier. Normal distribution would guarantee that the main data have about 99.73% of information.
I have two numpy arrays, one of which contains about 1% NaNs.
a = np.array([-2,5,nan,6])
b = np.array([2,3,1,0])
I'd like to compute the mean squared error of a and b using sklearn's mean_squared_error.
So my question is, what's the pythonic way of removing all NaNs from a while at the same time deleting all corresponding entries from b as efficiently as possible?
You can simply use vanilla NumPy's np.nanmean for this purpose:
In [136]: np.nanmean((a-b)**2)
Out[136]: 18.666666666666668
If this didn't exist, or you really wanted to use the sklearn method, you could create a mask to index the NaNs:
In [148]: mask = ~np.isnan(a)
In [149]: mean_squared_error(a[mask], b[mask])
Out[149]: 18.666666666666668
I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785