Numpy: find mean coordinate of points along line - numpy

I have a bunch of points in a 2D space which all reside on a line (polygon). How can I compute the mean coordinate of these points on the line?
I don't mean the centroid of the points in the 2D space (as #rth initially proposed in his answer), but the mean location of the points along the line on which they reside. So basically, I could transform the line to a 1D axis, compute the mean location in 1D, and transform the location of the mean back into the 2D space.
Maybe these are exactly the necessary steps, but I think (or hope) that there is a function in numpy/scipy which allows me to do this in one step.

Edit: The approach you describe in the question is indeed probably the simplest way for solving this problem.
Here is an implementation that calculates the positions of vertices along the line in 1D, takes their mean, and finally calculates the corresponding 2D position with parametric interpolation,
import numpy as np
from scipy.interpolate import splprep, splev
vert = np.random.randn(1000, 2) # vertices definition here
# calculate the Euclidean distances between consecutive vertices
# equivalent to a for loop with
# dl[i] = ((vert[i+1, 0] - vert[i, 0])**2 + (vert[i+1,1] - vert[i,1])**2)**0.5
dl = (np.diff(vert, axis=0)**2).sum(axis=1)**0.5
# pad with 0, so dl.shape[0] == vert.shape[0] for convenience
dl = np.insert(dl, 0, 0.0)
l = np.cumsum(dl) # 1D coordinates along the line
l_mean = np.mean(l) # mean in the line coordinates
# calculate the coordinate of l_mean in 2D space
# with parametric B-spline interpolation
tck, _ = splprep(x=vert.T, u=l, k=3)
res = splev(l_mean, tck)
print(res)
Edit2: Assuming now that you have a high resolution set of points for your path vert_full and some approximate measurements vert_1, vert_2, etc, what you could do is the following.
Project each points of vert_1, etc. onto the exact path. Assuming that vert_full has much more datapoints than vert_1, we can simply look for the nearest neighbours of vert_1 in vert_full:
from scipy.spatial import cKDTree
tr = cKDTree(vert_full)
d, idx = tr.query(vert_1, k=1)
vert_1_proj = vert_full[idx] # this gives the projected corrdinates onto vert_full
# I have not actually run this, so it might require minor changes
Use the above mean calculation with the new vert_1_proj vector.

Meanwhile I've found the answer to my question, although using Shapely instead of Numpy.
from shapely.geometry import LineString, Point
# lists of points as (x,y) tuples
path_xy = [...]
points_xy = [...] # should be on or near path
path = LineString(path_xy) # create path object
pts = [Point(p) for p in points_xy] # create point objects
dist = [path.project(p) for p in pts] # distances along path
mean_dist = np.mean(dist) # mean distance along path
mean = path.interpolate(mean_dist) # mean point
mean_xy = (mean.x,mean.y)
This works perfectly!
(That's is also why I have to accept it as the answer, though I highly appreciate #rth's help!)

Related

Central Limit Theorem: Sample means do not follow a normal distribution

The Problem
Good evening.
I am learning about the Central Limit Theorem. As practice, I ran simulations in an attempt to find the mean of a fair die (I know, a toy problem).
I took 4000 samples, and in each sample I rolled a die 50 times (screenshot of the code at the bottom). For each of these 4000 samples I computed the mean. Then, I plotted these 4000 sample means in a histogram (with bin size 0.03) using matplotlib.
Here is the result:
Question
Why aren't the sample means normally distributed given that the conditions for CLT (sample size >= 30) were respected?
Specifically, why does the histogram look like two normal distributions superimposed on top of each other? More intriguingly, why does the "outer" distribution look "discrete" with empty spaces occurring at regular intervals?
It almost seems like the result is off in a systematic way.
All help is greatly appreciated. I am very lost.
Supplementary Code
The code I used to generate the 4000 sample means.
"""
Take multiple samples of dice rolls. For
each sample, compute the sample mean.
With the sample means, plot a histogram.
By the Central Limit Theorem, the sample
means should be normally distributed.
"""
sample_means = []
num_samples = 4000
for i in range(num_samples):
# Large enough for CLT to hold
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
When num_rolls equals 50, each possible mean will be a fraction with denominator 50. So, in reality, you are looking at a discrete distribution.
To create a histogram of a discrete distribution, the bin boundaries are best placed nicely in-between the values. Using a step size of 0.03, some bin boundaries will coincide with the values, putting the double of values into one bin compared to its neighbor. Moreover, due to subtle floating point rounding problems, the result can become unpredictable when values and boundaries coincide.
Here is some code to illustrate what is going on:
from matplotlib import pyplot as plt
import numpy as np
import random
sample_means = []
num_samples = 4000
for i in range(num_samples):
num_rolls = 50
sample = []
for j in range(num_rolls):
observation = random.randint(1, 6)
sample.append(observation)
sample_mean = sum(sample) / len(sample)
sample_means.append(sample_mean)
fig, axs = plt.subplots(2, 2, figsize=(14, 8))
random_y = np.random.rand(len(sample_means))
for (ax0, ax1), step in zip(axs, [0.03, 0.02]):
bins = np.arange(3.01, 4, step)
ax0.hist(sample_means, bins=bins)
ax0.set_title(f'step={step}')
ax0.vlines(bins, 0, ax0.get_ylim()[1], ls=':', color='r') # show the bin boundaries in red
ax1.scatter(sample_means, random_y, s=1) # show the sample means with a random y
ax1.vlines(bins, 0, 1, ls=':', color='r') # show the bin boundaries in red
ax1.set_xticks(np.arange(3, 4, 0.02))
ax1.set_xlim(3.0, 3.3) # zoom in to region to better see the ins
ax1.set_title('bin boundaries between values' if step == 0.02 else 'chaotic bin boundaries')
plt.show()
PS: Note that the code would run much, much faster if instead of Python lists, the code would work completely with numpy.

Why the point size using sns.lmplot is different when I used plt.scatter?

I want to do a scatterplot according x and y variables, and the points size depend of a numeric variable and the color of every point depend of a categorical variable.
First, I was trying this with plt.scatter:
Graph 1
After, I tried this using lmplot but the point size is different in relation to the first graph.
I think the two graphs should be equals. Why not?
The point size is different in every graph.
Graph 2
Your question is no so much descriptive but i guess you want to control the size of the marker. Here is more documentation
Here is the start point for you.
A numeric variable can also be assigned to size to apply a semantic mapping to the areas of the points:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="size", size="size")
For seaborn scatterplot:
df = sns.load_dataset("anscombe")
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df)
And to change the size of the points you use the s parameter.
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df, s=100)

Which dx do I choose for np.gradient argument?

I wont be to specific but I have a graph E vs T ( T being the independent quantity)
I want the derivative of E with respect to T. I am unsure what dx spacing I should choose?
Details:
T = 10**(np.arange(-1,1.5,0.05)) (I.e the spacing is not equal)
E is a function of T.
Questions:
Which spacing do I use?
My thoughts:
I think I take the spacing of T i.e np.gradient(Energy, dx = T) ??
For non-uniform spacing, pass in an array of positional values (not differences) which gradient will to use to calculate dx for each point. That is, pass in the array of absolute positions, not differences. So in your case, just pass in T.
Here's an example, as a test, where the blue is the curve and red is the calculated gradients.
import numpy as np
import matplotlib.pyplot as plt
T = 10**(np.arange(-1,1.5,0.05))
E = T**2
gradients = np.gradient(E, T)
plt.plot(T, E, '-o') # plot the curve
for i, g in enumerate(gradients): # plot the gradients at each point
plt.plot([T[i], T[i]+1], [E[i], E[i]+g], 'r')
Here's the line from the docs that's of interest:
N arrays to specify the coordinates of the values along each dimension
of F. The length of the array must match the size of the corresponding
dimension

How to create volume from point cloud in spherical coordinates?

I have two sets of discrete points in spherical coordinates, each representing top and bottom surfaces of an object.
I am trying to create volume from these points to separate points which lies inside and outside the object. Any suggestions where to look or which library to use?
Blue and red points represents top and bottom surfaces. Red points are generated by shifting top surface radially downwards with some constant radius.
If I am right, the blue and red surfaces are meshed (and watertight). So for every point you can draw the line from the sphere center and look for intersections with the mesh. This is done by finding the two triangles such that the line pierces them (this can be done by looking at the angular coordinates only, using a point-in-triangle formula), then finding the intersection points. Then it is an easy matter to classify the point as before the red surface, after the blue or in between.
Exhaustive search for the triangles can be costly. You can speed it up for instance using a hierarchy of bounding boxes or similar device.
Here is a custom tinkered method which may works at the condition that the average distance between points in the original surface is much smaller than the thickness of the volume and than the irregularities on the surface contour. In other words, that there are a lot of points describing the blue surfaces.
import matplotlib.pylab as plt
import numpy as np
from scipy.spatial import KDTree
# Generate a test surface:
theta = np.linspace(3, 1, 38)
phi = np.zeros_like(theta)
r = 1 + 0.1*np.sin(8*theta)
surface_points = np.stack((r, theta, phi), axis=1) # n x 3 array
# Generate test points:
x_span, y_span = np.linspace(-1, 0.7, 26), np.linspace(0.1, 1.2, 22)
x_grid, y_grid = np.meshgrid(x_span, y_span)
r_test = np.sqrt(x_grid**2 + y_grid**2).ravel()
theta_test = np.arctan2(y_grid, x_grid).ravel()
phi_test = np.zeros_like(theta_test)
test_points = np.stack((r_test, theta_test, phi_test), axis=1) # n x 3 array
# Determine if the test points are in the volume:
volume_thickness = 0.2 # Distance between the two surfaces
angle_threshold = 0.05 # Angular threshold to determine for a point
# if the line from the origin to the point
# go through the surface
# Get the nearest point: (replace the interpolation)
get_nearest_points = KDTree(surface_points[:, 1:]) # keep only the angles
# This is based on the cartesian distance,
# and therefore not enterily valid for the angle between points on a sphere
# It could be better to project the points on a unit shpere, and convert
# all coordinates in cartesian frame in order to do the nearest point seach...
distance, idx = get_nearest_points.query(test_points[:, 1:])
go_through = distance < angle_threshold
nearest_surface_radius = surface_points[idx, 0]
is_in_volume = (go_through) & (nearest_surface_radius > test_points[:, 0]) \
& (nearest_surface_radius - volume_thickness < test_points[:, 0])
not_in_volume = np.logical_not(is_in_volume)
# Graph;
plt.figure(figsize=(10, 7))
plt.polar(test_points[is_in_volume, 1], test_points[is_in_volume, 0], '.r',
label='in volume');
plt.polar(test_points[not_in_volume, 1], test_points[not_in_volume, 0], '.k',
label='not in volume', alpha=0.2);
plt.polar(test_points[go_through, 1], test_points[go_through, 0], '.g',
label='go through', alpha=0.2);
plt.polar(surface_points[:, 1], surface_points[:, 0], '.b',
label='surface');
plt.xlim([0, np.pi]); plt.grid(False);plt.legend();
The result graph, for 2D case, is:
The idea is to look for each test point the nearest point in the surface, by considering only the direction and not the radius. Once this "same direction" point is found, it's possible to test both if the point is inside the volume along the radial direction (volume_thickness), and close enough to the surface using the parameter angle_threshold.
I think it would be better to mesh (non-convex) the blue surface and perform a proper interpolation, but I don't know Scipy method for this.

Contour plotting orbitals in pyquante2 using matplotlib

I'm currently writing line and contour plotting functions for my PyQuante quantum chemistry package using matplotlib. I have some great functions that evaluate basis sets along a (npts,3) array of points, e.g.
from somewhere import basisset, line
bfs = basisset(h2) # Generate a basis set
points = line((0,0,-5),(0,0,5)) # Create a line in 3d space
bfmesh = bfs.mesh(points)
for i in range(bfmesh.shape[1]):
plot(bfmesh[:,i])
This is fast because it evaluates all of the basis functions at once, and I got some great help from stackoverflow here and here to make them extra-nice.
I would now like to update this to do contour plotting as well. The slow way I've done this in the past is to create two one-d vectors using linspace(), mesh these into a 2D grid using meshgrid(), and then iterating over all xyz points and evaluating each one:
f = np.empty((50,50),dtype=float)
xvals = np.linspace(0,10)
yvals = np.linspace(0,20)
z = 0
for x in xvals:
for y in yvals:
f = bf(x,y,z)
X,Y = np.meshgrid(xvals,yvals)
contourplot(X,Y,f)
(this isn't real code -- may have done something dumb)
What I would like to do is to generate the mesh in more or less the same way I do in the contour plot example, "unravel" it to a (npts,3) list of points, evaluate the basis functions using my new fast routines, then "re-ravel" it back to X,Y matrices for plotting with contourplot.
The problem is that I don't have anything that I can simply call .ravel() on: I either have 1d meshes of xvals and yvals, the 2D versions X,Y, and the single z value.
Can anyone think of a nice, pythonic way to do this?
If you can express f as a function of X and Y, you could avoid the Python for-loops this way:
import matplotlib.pyplot as plt
import numpy as np
def bf(x, y):
return np.sin(np.sqrt(x**2+y**2))
xvals = np.linspace(0,10)
yvals = np.linspace(0,20)
X, Y = np.meshgrid(xvals,yvals)
f = bf(X,Y)
plt.contour(X,Y,f)
plt.show()
yields