I am using an array (51x51x181) to make a 3d interpolation in python (and I can calculate any point inbetween if needed).
I need to reduce the size of the array and would like to do this with the least amount of error possible.
Attached you find an example, with the error function I would like to improve on. The number of values in the array should stay the same, however the Angles and Shifts in the example do not have to be equally spaced.
import numpy as np
from scipy.interpolate import RegularGridInterpolator
import itertools
Data=np.zeros((5,180))
Angles=np.linspace(0,360,10)
Shifts=np.linspace(0,100,10)
Data=np.sin(np.deg2rad(Angles[:,None]+Shifts[None,:]))
interp = RegularGridInterpolator((Angles, Shifts),Data, bounds_error=False, fill_value=None)
def errorfunc():
Angles=np.linspace(0,360,50)
Shifts=np.linspace(0,100,50)
Function_Results=np.sin(np.deg2rad(Angles[:,None]+Shifts[None,:]).flatten())
Data_interp=interp(np.array(list(itertools.product(Angles,Shifts))))
Error=np.sqrt(np.mean(np.square(Function_Results-Data_interp)))
return(Error)
I could not find a feasible optimizer in scipy (tried some with poor performance). Is there a standard way to do this?
Related
I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).
I have a 3d numpy array with the shape (128,128,384). Let's call this array "S". This array only contains binary values either 0s or 1s.
\now \i want to get a 3d plot of this array in such a way that \ I have a grid of indices (x,y,z) and for every entry of S when it is one \ I should get a point printed at the corresponding indices in the 3d grid. e.g. let's say I have 1 entry at S[120,50,36], so I should get a dot at that point in the grid.
So far I have tried many methods but have been able to implement one method that works which is extremely slow and hence useless in my case. that method is to iterate over the entire array and use a scatter plot. \here is a snippet of my code:
from numpy import np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
for i in range(0,128):
for j in range(0,128):
for k in range(0,384):
if S[i,j,k]==1:
ax.scatter(i,j,k,zdir='z', marker='o')
Please suggest to me any method which would be faster than this.
Also, please note that I am not trying to plot the entries in my array. The entries in my array are only a condition that tells me if I should plot corresponding to certain indices.
Thank you very much
You can use numpy.where.
In your example, remove the for loops and just use:
i, j, k = np.where(S==1)
ax.scatter(i,j,k,zdir='z', marker='o')
The method plt.hist() in pyplot has a way to create a 'step-like' plot style when calling
plt.hist(data, histtype='step')
but the 'ordinary' methods that plot raw data without processing (plt.plot(), plt.scatter(), etc.) apparently do not have style options to obtain the same result. My goal is to plot a given set of points using that style, without making histogram of these points.
Is that achievable with standard library methods for plotting a given 2-D set of points?
I also think that there is at least one hack (generating a fake distribution which would have histogram equal to our data) and a 'low-level' solution to draw each segment manually, but none of these ways seems favorable.
Maybe you are looking for drawstyle="steps".
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
data = np.cumsum(np.random.randn(10))
plt.plot(data, drawstyle="steps")
plt.show()
Note that this is slightly different from histograms, because the lines do not go to zero at the ends.
There is a huge difference between pandas "isin" and numpy "in1d" from the efficiency aspect. After some research I've noticed that the type of the data and the values that passed as parameter to the "in" method has huge impact on the run time. Anyway it looks that numpy implementation suffer much less from this problem.
What's going on here?
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
f = lambda: df['A'].isin(vals)
g = lambda: pd.np.in1d(df['A'],vals)
print 'pandas:', timeit.timeit(stmt='f()',setup='from __main__ import f',number=10)/10
print 'numpy :', timeit.timeit(stmt='g()',setup='from __main__ import g',number=10)/10
>>
**pandas: 0.0541711091995
numpy : 0.000645089149475**
Numpy and Pandas use different algorithms for isin. For some cases numpy's version is faster and for some pandas'. For your test case numpy seems to be faster.
Pandas' version has however a better asymptotic running time, in will win for bigger datasets.
Let's assume that there are n elements in the data-series (df in your example) and m elements in the query (vals in your example).
Usually, Numpy's algorithm does the following:
Use np.unique(..) to find all unique elements in the series. Thus is done via sorting, i.e. O(n*log(n)), there might be N<=n unique elements.
For every element use binary search to look up whether element is in the series, i.e. O(m*log(N)) in overall.
Which leads to overall running time of O(n*log(n) + m*log(N)).
There are some hard-coded optimizations in place for the cases, when vals only few elements and for this cases numpy really shines.
Pandas does something different:
Populates a hash-map (wrapped khash-functionality) in order to find all unique elements, which takes O(n).
Looks-up in the hash map in O(1) for every query, i.e. O(m) in overall.
I overall, running time is O(n)+O(m), which is much better than Numpy's.
However, for smaller inputs, constant factors and not the asymptotic behavior is that what counts and it is just way better for Numpy. There are also other consideration, like memory consumption (which is higher for Pandas) which might play a role.
But if we take a bigger query set, the situation is completely different:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
vals2 = np.random.randint(0,10,(10**6),dtype='int64')
And now:
%timeit df['A'].isin(vals) # 17.0 ms
%timeit df['A'].isin(vals2) # 16.8 ms
%timeit pd.np.in1d(df['A'],vals) # 1.36
%timeit pd.np.in1d(df['A'],vals2) # 82.9 ms
Numpy is really losing ground as long as there are more queries. It can also be seen, that building of the hash-map is the bottleneck for Pandas and not the queries.
In the end it doesn't make much sense (even if I just did!) to evaluate the performance for only one input size - it should be done for a range of input sizes - there are some surprises to be discovered!
E.g. fun fact: if you would take
df = pd.DataFrame(np.random.randint(0,10,(10**6+1), dtype='int8'),columns=['A'])
i.e. 10^6+1 instead of 10^6, pandas would fall back to numpy's algorithm (which is not clever in my opinion) and would become better for small inputs but worse for big:
%timeit df['A'].isin(vals) # 6ms was 17.0 ms
%timeit df['A'].isin(vals2) # 100ms was 16.8 ms
In pandas/seaborn:
sns.distplot(combo['resubmits'], kde=False, bins=8)
plt.savefig("g1.png")
Makes a very pretty histogram. I want to include a textual "legend" showing the mean, stdev, n, etc as numbers in a box. You would think this is so common that there's a semi automatic way to do it but I can't find it.
There is a feature request for that.
However, note that using matplotlib.pyplot.axvline, you can easily do it yourself for now.
from matplotlib import pyplot as plt
plt.axvline(x, 0, y_max)
where x=combo['resubmits'].mean() and y_max is the maximal value of hist(combo['resubmits'])'s bins' values.