print multiple random permutations - numpy

I am trying to do multiple permutations. From code:
# generate random Gaussian values
from numpy.random import seed
from numpy.random import randn
# seed random number generator
seed(1)
# generate some Gaussian values
values = randn(100)
print(values)
But now I would like to generate, for example, 20 permutations (of values).
With code:
import numpy as np
import random
from itertools import permutations
result = np.random.permutation(values)
print(result)
I can only observe one permutation (or "manually" get others). I wish I had many permutations (20 or more) and so automatically calculate the Durbin-Watson statistic for each permutation (from values).
from statsmodels.stats.stattools import durbin_watson
sm.stats.durbin_watson(np.random.permutation(values))
How can I do?

To get 20 permutations out of some collection, intialize the itertools.permutations iterator and then use next() to take the first twenty:
import numpy as np
import itertools as it
x = np.random.random(100) # 100 random floats
p = it.permutations(x) # an iterator which produces permutations (DON'T TRY TO CALCULATE ALL OF THEM)
first_twenty_permutations = [next(p) for _ in range(20)]
Of course, these won't be random permutations (i.e., they are calculated in an organized manner, try with it.permutations("abcdef") and you'll see what I mean). If you need random permutations, you can use np.random.permutation much in the same way:
[np.random.permutation(x) for _ in range(20)]
To then calculate the Durbin Watson statistic:
permutations = np.array([np.random.permutation(x) for _ in range(20)])
np.apply_along_axis(durbin_watson, axis=1, arr=permutations)

Related

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

Using SMOTE with NaN values

Is there a way one can use SMOTE with NaNs?
Here is a dummy prog to try using SMOTE in presence of NaN values
# Imports
from collections import Counter
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Imputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
# Initial number of samples per class
print('Number of samples for both classes: {} and {}.'.format(*Counter(y).values()))
# SMOTEd class distribution
print('Dataset has %s missing values.' % np.isnan(X).sum())
_, y_resampled = SMOTE().fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
# Generate artificial missing values
X[X > 1.0] = np.nan
print('Dataset has %s missing values.' % np.isnan(X).sum())
#_, y_resampled = make_pipeline(Imputer(), SMOTE()).fit_sample(X, y)
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X, y)
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_resampled).values()))
I get the following output/error:
Number of samples for both classes: 212 and 357.
Dataset has 0 missing values.
Number of samples for both classes: 357 and 357.
Dataset has 6051 missing values.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
You already include the answer. Notice that fit_resample is used instead of fit_sample. You should use the make_pipeline as follows:
# Imports
import numpy as np
from collections import Counter
from sklearn.datasets import load_breast_cancer
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTEENN
# Load data
bc = load_breast_cancer()
X, y = bc.data, bc.target
X[X > 1.0] = np.nan
# Over-sampling
smote = SMOTE(ratio='auto',k_neighbors=5, n_jobs=-1)
smote_enn = make_pipeline(SimpleImputer(), SMOTEENN(smote=smote))
_, y_res = smote_enn.fit_resample(X, y)
# Class distribution
print('Number of samples for both classes: {} and {}.'.format(*Counter(y_res).values()))
Check also your imbalanced-learn version.
Generally no, SMOTE is preparing a data set for further model fitting.
Usual models (like random forest, etc.) do not work with NAin the label variable, because what are you actually predicting here? The same goes for NAin the predictor variables where most algorithms either do not work or simply ignore cases with NA.
So the error is pretty much by design because you cannot and should not have missing values in your training data set for the algorithm and logically you do not want to "balance" cases with missing values, you only want to SMOTE cases with valid labels.
If you feel that missing labels still represent valid information that should be balanced (e.g. you actually want to oversample the NAclass because you think it is underpreresented), then it should not be a missing value but rather a defined value called "Unknown" or something else, indicating a known class with the characteristic of "NA" but I do not really see any research questions where this makes sense.
Update 1:
Another way to go is to impute the missing values first so that you actually have three steps in fitting your model:
Imputing missing values (using MICE or similar)
SMOTE to balance training set
Fit algorithm/model

How to simulate random returns with numpy

What is a quick way to simulate random returns. I'm aware of numpy.random. However, that doesn't guide me towards how to model asset returns.
I've tried:
import numpy as np
r = np.random.rand(100)
But this doesn't feel accurate. How are others dealing doing this?
I'd suggest one of two approaches:
One: Assume returns are normally distributed with mean equal to 0.1% and stadard deviation about 1%. This looks like:
import numpy as np
np.random.seed(314)
r = np.random.randn(100) / 100 + 0.001
seed(314) sets the random number generator at a specific point so that if we both use the same seed, we should see the same results.
randn pulls from the normal distribution.
I'd also recommend using pandas. It's a library that implements a DataFrame object similar to R
import pandas as pd
df = pd.DataFrame(r)
You can then plot the cumulative returns like this:
df.add(1).cumprod().plot()
Two:
The second way is to assume returns are log normally distributed. That means the log(r) is normal. In this scenario, we pull normally distributed random numbers and then use those values as the exponent of e. It looks like this:
r = np.exp(np.random.randn(100) / 100 + 0.001) - 1
If you plot it, it looks like this:
pd.DataFrame(r).add(1).cumprod().plot()