Related
Given
df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])
I would like to filter for np.array(['1', '2.3']). I can do
df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]
but is this the fastest way to do it?
EDIT:
Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!
DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.
You can rely on list comprehension for performance:
df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
Performance via timeit(on my system currently using 4gb ram) :
%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
My suggestion would be to do the following:
import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])
In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.
Then do:
matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)
Which outputs you an array of bools indicating where the matches are located.
So you can then filter your rows like this:
df[matching_rows]
I have two geodataframes or geoseries, both consists of thousands of points.
My requirement is to append (merge) both geodataframes and drop duplicate points.
In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points
I tried as:
output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')
However, it is very slow.
Do you know any faster way of doing it ?
Here is another way of combining dataframes using pandas, along with timings, versus geopandas:
import pandas as pd
import numpy as np
data1 = np.random.randint(-100, 100, size=10000)
data2 = np.random.randint(-100, 100, size=10000)
df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df1 = df1.set_index(["longitude", "latitude"])
df2 = df2.set_index(["longitude", "latitude"])
%timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This seems a lot faster than using geopandas
import geopandas as gp
gdf1 = gp.GeoDataFrame(
df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
gdf2 = gp.GeoDataFrame(
df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
%timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But maybe you need some kind of optimisations as mentioned here.
The function checks for non-matching indexes from each df and then combines them.
df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
col1
1
2
5
6
If I have an image made of uint16s and want to compute a histogram for each bit, i.e. a vector 'x' of 0..65535 that contains the intensity value, and a vector y that is the number of samples that have that value, is there a vectorized numpy / linear algreba way to compute this?
I did it the obvious way with Numpy, and using your image dimensions on my Mac, it takes 300ms. I then did the same thing with OpenCV and it is 33x faster at 9ms!
#!/usr/bin/env python3
import cv2
import numpy as np
# Dimensions - height, width
h, w = 2160, 2560
# Known image, channel0=1, channel1=3, channel2=5, channel3=65535
R = np.zeros((h,w,4), dtype=np.uint16)
R[...,0] = 1
R[...,1] = 3
R[...,2] = 5
R[...,3] = 65535
def npHistogram(R):
"""Generate histogram using Numpy"""
H, _ = np.histogram(R,65536)
return H
def OpenCVHistogram(R):
"""Generate histogram using OpenCV"""
H = cv2.calcHist([R.ravel()], [0], None, [65536], [0,65536])
return H
A = npHistogram(R)
B = OpenCVHistogram(R)
#%timeit npHistogram(R)
#%timeit OpenCVHistogram(R)
Results
Using IPython, I got these timings
%timeit npHistogram(R)
300 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit OpenCVHistogram(R)
9.02 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Keywords: Python, histogram, slow, Numpy, np.histogram, speedup, OpenCV, image processing.
Ok, if OpenCV is too big a dependency for you to get 9ms processing time instead of 300ms, how about Numba? This runs in 10ms.
#!/usr/bin/env python3
import numpy as np
from numba import jit
# Dimensions - height, width
h, w = 2160, 2560
# Known image, channel0=1, channel1=3, channel2=5, channel3=65535
R = np.zeros((h,w,4), dtype=np.uint16)
R[...,0] = 1
R[...,1] = 3
R[...,2] = 5
R[...,3] = 65535
#jit(nopython=True, nogil=True)
def NumbaHistogram(pixels):
"""Histogram of uint16 image"""
H = np.zeros(65536, dtype=np.int32)
for i in range(len(pixels)):
H[pixels[i]] += 1
return H
#%timeit q = NumbaHistogram(R.ravel())
Results
%timeit NumbaHistogram(R.ravel())
10.4 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I want to create new columns based on the elements of column Col1, which is of type set. Each element has a corresponding column name that is stored in a dict. Here is the full code:
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def elem_in_set(x,e):
return 1 if e in x else 0
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.apply(lambda x: elem_in_set(x['Col1'], v), axis=1)
return df
%timeit create_columns(df, d)
#5.05 s ± 78.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The problem is that the production dataframe has about 400k rows, and my solution does not scale well at all - I'm looking at around 10 minutes on my machine. The column containing all elements (Col1) could be type list instead of set, but that doesn't improve performance.
Is there a faster solution to this?
I made a small change in your create_columns apply. Seems like it works much faster now.
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.Col1.apply(lambda x: 1 if v in x else 0)
return df
create_columns(df, d)
#191 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!