Dropping Duplicate Points - pandas

I have two geodataframes or geoseries, both consists of thousands of points.
My requirement is to append (merge) both geodataframes and drop duplicate points.
In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points
I tried as:
output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')
However, it is very slow.
Do you know any faster way of doing it ?

Here is another way of combining dataframes using pandas, along with timings, versus geopandas:
import pandas as pd
import numpy as np
data1 = np.random.randint(-100, 100, size=10000)
data2 = np.random.randint(-100, 100, size=10000)
df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df1 = df1.set_index(["longitude", "latitude"])
df2 = df2.set_index(["longitude", "latitude"])
%timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This seems a lot faster than using geopandas
import geopandas as gp
gdf1 = gp.GeoDataFrame(
df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
gdf2 = gp.GeoDataFrame(
df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
%timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But maybe you need some kind of optimisations as mentioned here.
The function checks for non-matching indexes from each df and then combines them.
df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
col1
1
2
5
6

Related

Date Countdown with pandas

I'm trying to calc the different between a date and today in months.
Here is what I have so far:
import pandas as pd
import numpy as np
from datetime import date
def calc_date_countdown(df):
today = date.today()
df['countdown'] = df['date'].apply(lambda x: (x-today)/np.timedelta64(1, 'M'))
df['countdown'] = df['countdown'].astype(int)
return df
Any pointers on what I'm doing wrong or maybe a more efficient way of doing it?
When I run on my dataset, this is the error I'm getting: TypeError: unsupported operand type(s) for -: 'Timestamp' and 'datetime.date'
Using apply is not very efficient, as this is an array operation.
See the below example:
from datetime import date, datetime
def per_array(df):
df['months'] = ((pd.to_datetime(date.today()) - df['date']) / np.timedelta64(1, 'M')).astype(int)
return df
def using_apply(df):
today = date.today()
df['months'] = df['date'].apply(lambda x: (x-pd.to_datetime(today))/np.timedelta64(1, 'M'))
df['months'] = df['months'].astype(int)
return df
df = pd.DataFrame({'date': [pd.to_datetime(f"2023-0{i}-01") for i in range(1,8)]})
print(df)
# date
# 0 2023-01-01
# 1 2023-02-01
# 2 2023-03-01
# 3 2023-04-01
# 4 2023-05-01
# 5 2023-06-01
# 6 2023-07-01
Timing it:
%%timeit
per_array(df)
195 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
using_apply(df)
384 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As you can see, it is around twice as fast to not use apply.
import pandas as pd
def calc_date_countdown(df):
today = pd.Timestamp.today()
df['countdown'] = df['date'].apply(lambda x: (x - today).days // 30)
return df
This should work as long as your date column in the dataframe is a Timestamp object. If it's not, you may need to convert it using pd.to_datetime() before running the function.

apply calculation in pandas column with groupby

what could be wrong in below code ??
a)I need a group by Area columns and apply some mathematical formula across columns:
b)Also if I have another column lets say the date and need to be added to groupby how will it come in below command
df3 = dataset.groupby('AREA')(['col1']+['col2']).sum()
table is in image below
enter image description here
I think you can sum column before grouping for better performance:
dataset['new'] = dataset['col1']+dataset['col2']
df3 = dataset.groupby('AREA', as_index=False)['new'].sum()
But your solution is possible in lambda function:
df3 = (dataset.groupby('AREA')
.apply(lambda x: (x['col1']+x['col2']).sum())
.reset_index(name='SUM'))
Performance:
np.random.seed(123)
N = 100000
dataset = pd.DataFrame({'AREA': np.random.randint(1000, size=N),
'col1': np.random.randint(10, size=N),
'col2':np.random.randint(10, size=N)})
#print (dataset)
In [24]: %%timeit
...: dataset['new'] = dataset['col1']+dataset['col2']
...: df3 = dataset.groupby('AREA', as_index=False)['new'].sum()
...:
7.64 ms ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [25]: %%timeit
...: df3 = (dataset.groupby('AREA')
...: .apply(lambda x: (x['col1']+x['col2']).sum())
...: .reset_index(name='SUM'))
...:
368 ms ± 5.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Do I need the left table indexed for efficiency when left merging with another indexed DataFrame?

I want to do a left merge to join two pandas DataFrame:
merged_df = left_df.merge(right_df, how='left', left_on='id', right_index=True)
left_df is not indexed, it only has an id, but right_df is indexed.
I have not indexed left_df since it changes continuously, but for the merge would it be faster if the left DataFrame is also indexed? In my case the merge is done very frequently and until now the left DataFrames has up to 60k rows and the right up to 1000.
I have not checked pandas' merge code, but since in the left merge it keeps all rows of the left DataFrame I am not sure if indexing it would increase this merge's speed.
Let's just test it with fake data:
import pandas as pd
import numpy as np
# df1: 60k rows, not indexed
df1 = pd.DataFrame(data = {'a': np.random.randint(0, 100, 60_000),
'b': np.random.randint(0, 100, 60_000)})
# df2: 1k rows, indexed
df2 = pd.DataFrame(data = {'c': np.random.randint(0, 100, 1000)},
index = np.random.randint(0, 100, 1000))
Joins Performances
%timeit pd.merge(df1, df2, left_on='a', right_index=True, how='left')
55.4 ms ± 6.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.merge(df1.set_index('a'), df2, left_index=True, right_index=True,
how='left')
49.8 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This already shows a better performance when joining on index. However I am also setting the index in the join, which just need to be done once if you have multiple joins. Let's see time split between the two operations:
%time df1.set_index('a', inplace=True)
Wall time: 936 µs
%timeit pd.merge(df1, df2, left_index=True, right_index=True, how='left')
48 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Optimization - split a column of type set into multiple columns

I want to create new columns based on the elements of column Col1, which is of type set. Each element has a corresponding column name that is stored in a dict. Here is the full code:
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def elem_in_set(x,e):
return 1 if e in x else 0
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.apply(lambda x: elem_in_set(x['Col1'], v), axis=1)
return df
%timeit create_columns(df, d)
#5.05 s ± 78.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The problem is that the production dataframe has about 400k rows, and my solution does not scale well at all - I'm looking at around 10 minutes on my machine. The column containing all elements (Col1) could be type list instead of set, but that doesn't improve performance.
Is there a faster solution to this?
I made a small change in your create_columns apply. Seems like it works much faster now.
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.Col1.apply(lambda x: 1 if v in x else 0)
return df
create_columns(df, d)
#191 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!