statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved - matplotlib

Trying to plot a CDF with seaborns, then encountered this error:
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
args=endog)[0] for i in range(1, gridsize)]
Some minutes after pressing the return key
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
args=endog)[0] for i in range(1, gridsize)]
Code:
plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()
If it could be of help:
print(len(data))
4360700
Sample data:
print(data[:10])
[ 0.00362846 0.00123409 0.00013711 -0.00029235 0.01515175 0.02780404
0.03610236 0.03410224 0.03887933 0.0307084 ]
Have no idea what the subdivisions are, is there a way to increase it?

A kde plot is created by summing one gaussian bell shape for every data point. Summing 4 million curves will create memory and performance problems, which might cause come functions to fail. The exact error message can be very cryptic.
The easiest way to work around the problem, is to subsample the data, as for a more or less smooth distribution the kde (and the cumultative kde or cdf) will look very similar whether the data is subsampled or not. Subsampling every 100th entry is easy using slicing data[::100].
Alternatively, with that many data, the "real" cdf can be drawn by plotting the sorted data versus N evenly spaced numbers from 0 to 1. (Where N is the number of data points.)
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
q = np.linspace(0, 1, data.size)
data.sort()
plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
plt.legend()
plt.show()
Note that in a similar, though slightly more involved, way you can draw a "more honest" kde given the differences of a large enough array of sorted data. np.interp interpolates the quantiles to a regularly spaced x-axis. As the raw differences are rather jaggy, some smoothing is needed.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
p = np.linspace(0, 1, data.size)
data.sort()
x = np.linspace(data.min(), data.max(), 1000)
y = np.interp(x, data, p)
# use lowess filter to smoothen the curve
lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')
# plt.plot((x[:-1]+x[1:])/2,
# np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
# label='test np.diff')
plt.legend()
plt.show()

Related

polynomial fitting of a signal and plotting the fitted signal

I am trying to use a polynomial expression that would fit my function (signal). I am using numpy.polynomial.polynomial.Polynomial.fit function to fit my function(signal) using the coefficients. Now, after generating the coefficients, I want to put those coefficients back into the polynomial equation - get the corresponding y-values - and plot them on the graph. But I am not getting what I want (orange line) . What am I doing wrong here?
Thanks.
import math
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**i)
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
coeffs= test1.coef
#print(test1.coef)
coef_y= getYValueFromCoeff(x, test1.coef)
#print(coef_y)
plt.plot(x,y)
plt.plot(x, coef_y)

If you check out the documentation, consider the two properties: poly.domain and poly.window. To avoid numerical issues, the range poly.domain = [x.min(), x.max()] of independent variable (x) that we pass to the fit() is being normalized to poly.window = [-1, 1]. This means the coefficients you get from poly.coef apply to this normalized range. But you can adjust this behaviour (sacrificing numerical stability) accordingly, that is, adjustig the poly.window will make your curves match:
...
test1 = poly.fit(x, y, deg=no_of_coef, window=[x.min(), x.max()])
...
But unless you have a good reason to do that, I'd stick to the default behaviour of fit().
As a side note: Evaluating polynomials or lists of coefficients is already implemented in numpy, e.g. using directly
coef_y = test1(x)
or alternatively using np.polyval.

I always like to see original solutions to problems. I urge you to continue to pursue that as that is the best way to learn how to fit functions programmatically. I also wanted to provide the solution that is much more tailored towards a standard numpy implementation. As for your custom function, you did really well. The only issue is that the coefficients are from high to low order, while you were counting up in powers from 0 to highest power. Simply counting down from highest power to 0, allows your function to give the correct result. Notice how your function overlays perfectly with the numpy polyval.
import numpy as np
import matplotlib.pyplot as plt
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**(len(coeff_list)-i-1))
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
no_of_coef = 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
coeffs = np.polyfit(x,y,no_of_coef)
coef_y = np.polyval(coeffs,x)
COEF_Y = getYValueFromCoeff(x,coeffs)
plt.figure()
plt.plot(x,y)
plt.plot(x, coef_y)
plt.plot(x, COEF_Y)
plt.legend(['Original Function', 'Fitted Function', 'Custom Fitting'])
plt.show()
Output

Here's the simple way of doing it if you didn't know that already...
import math
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
plt.plot(x, y, 'r', label='original y')
x = np.linspace(0, 0.01, 1000)
plt.plot(x, test1(x), 'b', label='y_fit')
plt.legend()
plt.show()

Cluster groups continuously instead of discrete - python

I'm trying to cluster a group of points in a probabilistic manner. Using below, I have a single set of xy points, which are recorded in X and Y. I want to cluster into groups using a reference point, which is displayed in X2 and Y2.
With the help of an answer the current approach is to measure the distance from the reference point and group using k-means. Although, it provides a method to cluster using the reference point, the hard cutoff and adherence to k clusters makes it somewhat unsuitable when dealing with numerous datasets. For instance, the number of clusters needed for this example is probably 3. But a separate example may different. I'd have to manually go through and alter k every time.
Given the non-probabilistic nature of k-means a separate option could be GMM. Is it possible to account for the reference point when modelling? If I attach the output below the underlying model isn't clustering as I'm hoping for.
If I look at the probability each point is within a group it's not clustered as I'd hoped. With this I run into the same problem with manually altering the amount of components. Because the points are distributed randomly, using “AIC” or “BIC” to select the appropriate number of clusters doesn't work. There is no optimal number.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
})
k-means:
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)
model = KMeans(n_clusters = 2)
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T)
df['group'] = model.labels_
plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)
GMM:
Y_sklearn = df[['X','Y']].values
gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)
proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

In my opinion, if you want to define clusters as "regions where points are close to each other", you should use DBSCAN.
This clustering algorithm finds clusters by looking at regions where points are close to each other (i.e. dense regions), and are separated from other clusters by regions where points are less dense.
This algorithm can categorize points as noise (outliers). Outliers are labelled -1.
They are points that do not belong to any cluster.
Here is some code to perform DBSCAN clustering, and to insert the cluster labels as a new categorical column in the original Y_sklearn DataFrame. It also prints how many clusters and how many outliers are found.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]
dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)
#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")
#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
print(f"{n_outliers} outliers were found.\n")
else:
print(f"No outliers were found.\n")
#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")
If you want to plot the results, I suggest you use Seaborn. Here is some code to plot the points of Y_sklearn DataFrame, and color them by the cluster they belong to. I also define a new color palette, which is just the default Seaborn color palette, but where outliers (with label -1) will be in black.
import matplotlib.pyplot as plt
import seaborn as sns
name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
color_outliers = "black"
palette.insert(0, color_outliers)
else:
pass
sns.set_palette(palette)
fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
x="X",
y="Y",
hue="cluster",
ax=ax,
)
Using default hyperparameters, the DBSCAN algorithm finds no cluster in the data you provided: all points are considered outliers, because there is no region where points are significantly more dense. Is that your whole dataset, or is it just a sample? If it is a sample, the whole dataset will have much more points, and DBSCAN will certainly find some high density regions.
Or you can try tweaking the hyperparameters, min_samples and eps in particular. If you want to "force" the algorithm to find more clusters, you can decrease min_samples (default is 5), or increase eps (default is 0.5). Of course, the optimal hyperparamete values depends on the specific dataset, but default values are considered quite good for DBSCAN. So, if the algorithm considers all points in your dataset to be outliers, it means that there are no "natural" clusters!

Do you mean density estimation? You can model your data as a Gaussian Mixture and then get a probability of a point to belong to the mixture. You can use sklearn.mixture.GaussianMixture for that. By changing number of components you can control how many clusters you will have. The metric to cluster on is Euclidian distance from the reference point. So the GMM model will provide you with prediction of which cluster the data point should be classified to.
Since your metric is 1d, you will get a set of Gaussian distributions, i.e. a set of means and variances. So you can easily calculate the probability of any point to be in certain cluster, just by calculating how far it is from the reference point and put the value in the normal distribution pdf formula.
To make image more clear, I'm changing the reference point to (-5, 5) and select number of clusters = 4. In order to get the best number of clusters, use some metric that minimizes total variance and penalizes growth of number of mixtures. For example argmin(model.covariances_.sum()*num_clusters)
import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({
'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],
})
ref_X, ref_Y = -5, 5
dist = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)
n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)
# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black')
You can read more here and here
P.S. score_samples() returns log likelihoods, use exp() to convert to probability

Taking your centre point of 0,0 we can calculate the Euclidean distance from this point to all points in your df.
df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
If you have a centre point other than zero it would be:
df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2)
Using your data and chart as before, we can plot this and see the distance metric increasing as we move away from the centre.
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
K-means
We can now use this distance data and use it to calculate K-means clusters as you did before, but this time using the distance data and an array of zeros (zeros because this k-means requires a 2d-array but we only want to split the 1d aray of dimensional data. So the zeros act as 'filler'
model = KMeans(n_clusters = 2) #choose how many clusters
# create this 2d array for the KMeans model
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) # transformed array because the above code produces
# data with 27 columns and 2 rows but we want it the other way round
df['group'] = model.labels_ # put the labels into the dataframe
Then we can plot the results
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30)
ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])
plt.show()
With three clusters we get the following result:
Other clustering methods
Check out SKlearn's clustering page for more options. I experimented with DBSCAN with some good results but it depends on what you are trying to achieve exactly. Check out the table underneath their example charts to see how they each compare.

How to change a seaborn histogram plot to work for hours of the day?

I have a pandas dataframe with lots of time intervals of varying start times and lengths. I am interested in the distribution of start times over 24hours. I therefore have another column entitled Hour with just that in. I have plotted a histogram using seaborn to look at the distribution but obviously the x axis starts at 0 and runs to 24. I wonder if there is a way to change so it runs from 8 to 8 and loops over at 23 to 0 so it provides a better visualisation of my data from a time perspective. Thanks in advance.
sns.distplot(df2['Hour'], bins = 24, kde = False).set(xlim=(0,23))

If you want to have a custom order of x-values on your bar plot, I'd suggest using matplotlib directly and plot your histogram simply as a bar plot with width=1 to get rid of padding between bars.
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# prepare sample data
dates = pd.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 7),
freq="H")
random_dates = np.random.choice(dates, 1000)
df = pd.DataFrame(data={"date":random_dates})
df["hour"] = df["date"].dt.hour
# set your preferred order of hours
hour_order = [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7]
# calculate frequencies of each hour and sort them
plot_df = (
df["hour"]
.value_counts()
.rename_axis("hour", axis=0)
.reset_index(name="freq")
.set_index("hour")
.loc[hour_order]
.reset_index())
# day / night colour split
day_mask = ((8 <= plot_df["hour"]) & (plot_df["hour"] <= 20))
plot_df["color"] = np.where(day_mask, "skyblue", "midnightblue")
# actual plotting - note that you have to cast hours as strings
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.bar(
x=plot_df["hour"].astype(str),
height=plot_df["freq"],
color=plot_df["color"], width=1)
ax.set_xlabel('Hour')
ax.set_ylabel('Frequency')
plt.show()

Mutiple plots in a single window

I need to draw many such rows (for a0 .. a128) in a single window. I've searched in FacetGrid, PairGrid and all over around but couldn't find. Only regplot has similar argument ax but it doesn't plot histograms. My data is 128 real valued features with label column [0, 1]. I need the graphs to be shown from my Python code as a separate application on Linux.
Also, it there a way to scale this histogram to show relative values on Y such that the right curve is not skewed?
g = sns.FacetGrid(df, col="Result")
g.map(plt.hist, "a0", bins=20)
plt.show()

Just a simple example using matplotlib. The code is not optimized (ugly, but simple plot-indexing):
import numpy as np
import matplotlib.pyplot as plt
N = 5
data = np.random.normal(size=(N*N, 1000))
f, axarr = plt.subplots(N, N) # maybe you want sharex=True, sharey=True
pi = [0,0]
for i in range(data.shape[0]):
if pi[1] == N:
pi[0] += 1 # next row
pi[1] = 0 # first column again
axarr[pi[0], pi[1]].hist(data[i], normed=True) # i was wrong with density;
# normed=True should be used
pi[1] += 1
plt.show()
Output:

inverse of FFT not the same as original function

I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)

You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.

You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved - matplotlib

Related

polynomial fitting of a signal and plotting the fitted signal

Cluster groups continuously instead of discrete - python

How to change a seaborn histogram plot to work for hours of the day?

Mutiple plots in a single window

inverse of FFT not the same as original function

Categories

Resources