Linear regression to fit a power-law in Python - matplotlib

I have two data sets index_list and frequency_list which I plot in a loglog plot by plt.loglog(index_list, freq_list). Now I'm trying to fit a power law a*x^(-b) with linear regression. I expect the curve to follow the initial curve closely but the following code seems to output a similar curve but mirrored on the y-axis.
I suspect I am using curve_fit badly.
why is this curve mirrored on the x-axis and how I can get it to properly fit my inital curve?
Using this data
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
f = open ("input.txt", "r")
index_list = []
freq_list = []
index = 0
for line in f:
split_line = line.split()
freq_list.append(int(split_line[1]))
index_list.append(index)
index += 1
plt.loglog(index_list, freq_list)
def power_law(x, a, b):
return a * np.power(x, -b)
popt, pcov = curve_fit(power_law, index_list, freq_list)
plt.plot(index_list, power_law(freq_list, *popt))
plt.show()

The code below made the following changes:
For the scipy functions to work, it is best that both index_list and freq_list are numpy arrays, not Python lists. Also, for the power not to overflow too rapidly, these arrays should be of float type (not of int).
As 0 to a negative power causes a divide-by-zero problem, it makes sense to start the index_list with 1.
Due to the powers, also for floats an overflow can be generated. Therefore, it makes sense to add bounds to curve_fit. Especially b should be limited not to cross about 50 (the highest value is about power(100000, b) giving an overflow when be.g. is100). Also setting initial values helps to direct the fitting process (p0=...).
Drawing a plot with index_list as x and power_law(freq_list, ...) as y would generate a very weird curve. It is necessary that the same x is used for the plot and for the function.
Note that calling plt.loglog() changes both axes of the plot to logarithmic. All subsequent plots on the same axes will continue to use the logarithmic scale.
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import pandas as pd
import numpy as np
def power_law(x, a, b):
return a * np.power(x, -b)
df = pd.read_csv("https://norvig.com/google-books-common-words.txt", delim_whitespace=True, header=None)
index_list = df.index.to_numpy(dtype=float) + 1
freq_list = df[1].to_numpy(dtype=float)
plt.loglog(index_list, freq_list, label='given data')
popt, pcov = curve_fit(power_law, index_list, freq_list, p0=[1, 1], bounds=[[1e-3, 1e-3], [1e20, 50]])
plt.plot(index_list, power_law(index_list, *popt), label='power law')
plt.legend()
plt.show()

Related

Evaluation of log density for various values of `mean`

I can evaluate the log probability density of a multivariate normal by doing
import numpy as np
import scipy.stats
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.zeros(2), cov = np.eye(2))
Now, I'm interested in evaluating the log density of the point [0,0] over a variety of values of mean. Here is what I have tried
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
scipy.stats.multivariate_normal.logpdf([0,0], mean = np.stack([x,y], axis = -1), cov = np.eye(2))
This results in an error: ValueError: Array 'mean' must be a vector of length 5202.
How can I evaluate the log density of a multivariate normal over a variety of values of mean?
As your error suggest logpdf is waiting a 1D array for the mean argument.
Since your covariance matrix is 2x2, you should give him a 2x1 array to mean.
If you want to evaluate the density for multiple mean values you can use a for loop after flattening x and y as follows :
import numpy as np
import scipy.stats
grid = np.linspace(-2,2,51)
x,y = np.meshgrid(grid,grid)
x,y = x.flatten(), y.flatten()
res = []
for i in range(len(x)):
x_i, y_i = x[i], y[i]
res.append(scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)))
You can also also use list comprehension in place of the for loop :
res = [scipy.stats.multivariate_normal.logpdf([0,0], mean =[x_i,y_i], cov = np.eye(2)) for i in range(len(x))]
To visualize the result you can use matplotlib.pyplot :
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x,y,c=res)
plt.show()
But I don't see the point of trying to evaluate the multivariate gaussian logpdf over several mean values.
In the case of a multivariate normal distribution the argument x and the mean m have symmetric roles as you can see in the exponential term : (x-m)^T Sigam^(-1) (x-m).
What you are doing is equivalent to evaluate the logpdf of a multivariate gaussian of mean [0,0] and of covariance eye(2).

Matplotlib streamplot with streamlines that don't break or end

I'd like to make a streamplot with lines that don't stop when they get too close together. I'd rather each streamline be calculated in both directions until it hits the edge of the window. The result is there'd be some areas where they'd all jumble up. But that's what I want.
I there anyway to do this in matplotlib? If not, is there another tool I can use for this that could interface with python/numpy?
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
U, V = Y**2, X**2
plt.streamplot(X,Y, U,V, density=1)
plt.show(False)
Ok, I've figured out I can get mostly what I want by turning up the density a lot and using custom start points. I'm still interested if there is a better or alternate way to do this.
Here's my solution. Doesn't it look so much better?
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
y,x = Y[:,0], X[0,:]
U, V = Y**2, X**2
stream_points = np.array(zip(np.arange(-9,9,.5), -np.arange(-9,9,.5)))
plt.streamplot(x,y, U,V, start_points=stream_points, density=35)
plt.show(False)
Edit: By the way, there seems to be some bug in streamplot such that start_points keyword only works if you use 1d arrays for the grid data. See Python Matplotlib Streamplot providing start points
As of Matplotlib version 3.6.0, an optional parameter broken_streamlines has been added for disabling streamline breaks.
Adding it to your snippet produces the following result:
import numpy as np
import matplotlib.pyplot as plt
Y,X = np.mgrid[-10:10:.01, -10:10:.01]
U, V = Y**2, X**2
plt.streamplot(X,Y, U,V, density=1, broken_streamlines=False)
plt.show(False)
Note
This parameter just extends the streamlines which were originally drawn (as in the question). This means that the streamlines in the modified plot above are much more uneven than the result obtained in the other answer, with custom start_points. The density of streamlines on any stream plot does not represent the magnitude of U or V at that point, only their direction. See the documentation for the density parameter of matplotlib.pyplot.streamplot for more details on how streamline start points are chosen by default, when they aren't specified by the optional start_points parameter.
For accurate streamline density, consider using matplotlib.pyplot.contour, but be aware that contour does not show arrows.
Choosing start points automatically
It may not always be easy to choose a set of good starting points automatically. However, if you know the streamfunction corresponding to the flow you wish to plot you can use matplotlib.pyplot.contour to produce a contour plot (which can be hidden from the output), and then extract a suitable starting point from each of the plotted contours.
In the following example, psi_expression is the streamfunction corresponding to the flow. When modifying this example for your own needs, make sure to update both the line defining psi_expression, as well as the one defining U and V. Ensure these both correspond to the same flow.
The density of the streamlines can be altered by changing contour_levels. Here, the contours are uniformly distributed.
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
x, y = sy.symbols("x y")
psi_expression = x**3 - y**3
psi_function = sy.lambdify((x, y), psi_expression)
Y, X = np.mgrid[-10:10:0.01, -10:10:0.01]
psi_evaluated = psi_function(X, Y)
U, V = Y**2, X**2
contour_levels = np.linspace(np.amin(psi_evaluated), np.amax(psi_evaluated), 30)
# Draw a temporary contour plot.
temp_figure = plt.figure()
contour_plot = plt.contour(X, Y, psi_evaluated, contour_levels)
plt.close(temp_figure)
points_list = []
# Iterate over each contour.
for collection in contour_plot.collections:
# Iterate over each segment in this contour.
for path in collection.get_paths():
middle_point = path.vertices[len(path.vertices) // 2]
points_list.append(middle_point)
# Reshape python list into numpy array of coords.
stream_points = np.reshape(np.array(points_list), (-1, 2))
plt.streamplot(X, Y, U, V, density=1, start_points=stream_points, broken_streamlines=False)
plt.show(False)

Locally weighted smoothing for binary valued random variable

I have a random variable as follows:
f(x) = 1 with probability g(x)
f(x) = 0 with probability 1-g(x)
where 0 < g(x) < 1.
Assume g(x) = x. Let's say I am observing this variable without knowing the function g and obtained 100 samples as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binned_statistic
list = np.ndarray(shape=(200,2))
g = np.random.rand(200)
for i in range(len(g)):
list[i] = (g[i], np.random.choice([0, 1], p=[1-g[i], g[i]]))
print(list)
plt.plot(list[:,0], list[:,1], 'o')
Plot of 0s and 1s
Now, I would like to retrieve the function g from these points. The best I could think is to use draw a histogram and use the mean statistic:
bin_means, bin_edges, bin_number = binned_statistic(list[:,0], list[:,1], statistic='mean', bins=10)
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], lw=2)
Histogram mean statistics
Instead, I would like to have a continuous estimation of the generating function.
I guess it is about kernel density estimation but I could not find the appropriate pointer.
straightforward without explicitly fitting an estimator:
import seaborn as sns
g = sns.lmplot(x= , y= , y_jitter=.02 , logistic=True)
plug in x= your exogenous variable and analogously y = dependent variable. y_jitter is jitter the point for better visibility if you have a lot of data points. logistic = True is the main point here. It will give you the logistic regression line of the data.
Seaborn is basically tailored around matplotlib and works great with pandas, in case you want to extend your data to a DataFrame.

histogram2d heatmap manipulation

I created a heatmap from a scatterplot of csv values using the code i found from a different stackoverflow thread here Generate a heatmap in MatPlotLib using a scatter data set
This works but I'd like to edit the colours/smooth between bins etc. I've read this https://matplotlib.org/examples/color/colormaps_reference.html ...but my level of n00b is preventing swift progress. Does my current code seem ameanable to easy manipulation for interpolation between bins (smoothing) or at least a colour change, or do I need to create my heatmap in a different way to gain more control? (the heatmap will represent how often a space is used in time, based on x y values of a tracked item)
Thanks , any help much appreciated.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import csv
with open('myfile.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
y = []
x = []
for row in readCSV:
x.append(float(row [0]))
y.append(float(row [1]))
print (x, y)
heatmap, xedges, yedges = np.histogram2d(x,y,bins=20)
extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.clf()
plt.imshow(heatmap.T, extent=extent)
plt.show()

heatmap for positive and negative values [duplicate]

I am trying to make a filled contour for a dataset. It should be fairly straightforward:
plt.contourf(x, y, z, label = 'blah', cm = matplotlib.cm.RdBu)
However, what do I do if my dataset is not symmetric about 0? Let's say I want to go from blue (negative values) to 0 (white), to red (positive values). If my dataset goes from -8 to 3, then the white part of the color bar, which should be at 0, is in fact slightly negative. Is there some way to shift the color bar?
First off, there's more than one way to do this.
Pass an instance of DivergingNorm as the norm kwarg.
Use the colors kwarg to contourf and manually specify the colors
Use a discrete colormap constructed with matplotlib.colors.from_levels_and_colors.
The simplest way is the first option. It is also the only option that allows you to use a continuous colormap.
The reason to use the first or third options is that they will work for any type of matplotlib plot that uses a colormap (e.g. imshow, scatter, etc).
The third option constructs a discrete colormap and normalization object from specific colors. It's basically identical to the second option, but it will a) work with other types of plots than contour plots, and b) avoids having to manually specify the number of contours.
As an example of the first option (I'll use imshow here because it makes more sense than contourf for random data, but contourf would have identical usage other than the interpolation option.):
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import DivergingNorm
data = np.random.random((10,10))
data = 10 * (data - 0.8)
fig, ax = plt.subplots()
im = ax.imshow(data, norm=DivergingNorm(0), cmap=plt.cm.seismic, interpolation='none')
fig.colorbar(im)
plt.show()
As an example of the third option (notice that this gives a discrete colormap instead of a continuous colormap):
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import from_levels_and_colors
data = np.random.random((10,10))
data = 10 * (data - 0.8)
num_levels = 20
vmin, vmax = data.min(), data.max()
midpoint = 0
levels = np.linspace(vmin, vmax, num_levels)
midp = np.mean(np.c_[levels[:-1], levels[1:]], axis=1)
vals = np.interp(midp, [vmin, midpoint, vmax], [0, 0.5, 1])
colors = plt.cm.seismic(vals)
cmap, norm = from_levels_and_colors(levels, colors)
fig, ax = plt.subplots()
im = ax.imshow(data, cmap=cmap, norm=norm, interpolation='none')
fig.colorbar(im)
plt.show()