Optimization - split a column of type set into multiple columns - pandas

I want to create new columns based on the elements of column Col1, which is of type set. Each element has a corresponding column name that is stored in a dict. Here is the full code:
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def elem_in_set(x,e):
return 1 if e in x else 0
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.apply(lambda x: elem_in_set(x['Col1'], v), axis=1)
return df
%timeit create_columns(df, d)
#5.05 s ± 78.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The problem is that the production dataframe has about 400k rows, and my solution does not scale well at all - I'm looking at around 10 minutes on my machine. The column containing all elements (Col1) could be type list instead of set, but that doesn't improve performance.
Is there a faster solution to this?

I made a small change in your create_columns apply. Seems like it works much faster now.
import numpy as np
import pandas as pd
np.random.seed(123)
N = 10**4 #number of rows in the dataframe
df = pd.DataFrame({'Cnt': np.random.randint(2,10,N)})
# generate lists of random length
def f(x):
return set(np.random.randint(101,120,x))
df['Col1'] = df['Cnt'].apply(f)
# dictionary with column names for each element in list
d = {'Item_1':101, 'Item_2':102, 'Item_3':103, 'Item_4':104, 'Item_5':105, 'Item_6':106, 'Item_7':107, 'Item_8':108,
'Item_9':109, 'Item_10':110, 'Item_11':111, 'Item_12':112, 'Item_13':113, 'Item_14':114, 'Item_15':115, 'Item_16':116,
'Item_17':117, 'Item_18':118, 'Item_19':119, 'Item_20':120}
def create_columns(input_data, d):
df = input_data.copy()
for k, v in d.items():
df[k] = df.Col1.apply(lambda x: 1 if v in x else 0)
return df
create_columns(df, d)
#191 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Date Countdown with pandas

I'm trying to calc the different between a date and today in months.
Here is what I have so far:
import pandas as pd
import numpy as np
from datetime import date
def calc_date_countdown(df):
today = date.today()
df['countdown'] = df['date'].apply(lambda x: (x-today)/np.timedelta64(1, 'M'))
df['countdown'] = df['countdown'].astype(int)
return df
Any pointers on what I'm doing wrong or maybe a more efficient way of doing it?
When I run on my dataset, this is the error I'm getting: TypeError: unsupported operand type(s) for -: 'Timestamp' and 'datetime.date'
Using apply is not very efficient, as this is an array operation.
See the below example:
from datetime import date, datetime
def per_array(df):
df['months'] = ((pd.to_datetime(date.today()) - df['date']) / np.timedelta64(1, 'M')).astype(int)
return df
def using_apply(df):
today = date.today()
df['months'] = df['date'].apply(lambda x: (x-pd.to_datetime(today))/np.timedelta64(1, 'M'))
df['months'] = df['months'].astype(int)
return df
df = pd.DataFrame({'date': [pd.to_datetime(f"2023-0{i}-01") for i in range(1,8)]})
print(df)
# date
# 0 2023-01-01
# 1 2023-02-01
# 2 2023-03-01
# 3 2023-04-01
# 4 2023-05-01
# 5 2023-06-01
# 6 2023-07-01
Timing it:
%%timeit
per_array(df)
195 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
using_apply(df)
384 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As you can see, it is around twice as fast to not use apply.
import pandas as pd
def calc_date_countdown(df):
today = pd.Timestamp.today()
df['countdown'] = df['date'].apply(lambda x: (x - today).days // 30)
return df
This should work as long as your date column in the dataframe is a Timestamp object. If it's not, you may need to convert it using pd.to_datetime() before running the function.

Removing items in one 2d numpy array from another

I have a function, foo, which returns an np array containing every possible combination of np.arange(n) when k numbers are removed.
import numpy as np
from itertools import combinations
def foo(n,k):
return np.array([np.delete(np.arange(n),i) for i in combinations(range(n),k)])
The output of this function is correct, but the list comprehension it uses means a longer processing time when larger numbers are involved. Is there a more efficient solution to this using pure numpy?
I have tried using np.delete with idx as the key (a 2d array that contains the values to remove on each row), along with a broadcasted np.arange without success:
import numpy as np
from itertools import combinations
k = 2
n = 15
idx = np.array([i for i in combinations(range(n),k)])
arr = np.broadcast_to(np.arange(n), (idx.shape[0],n))
res = np.delete(arr, idx, axis=1)
This code produces an empty array.
"Is there a more efficient solution to this using pure numpy?" . No.
itertools is efficient. Instead of deleting k elements, choose (n-k) elements.
import numpy as np
from itertools import chain, combinations
def foo_new(n,k):
return list(combinations(np.arange(n),n-k))
def foo_old(n,k): ## your function
return np.array([np.delete(np.arange(n),i) for i in combinations(range(n),k)])
# In [5]: %timeit foo_new(25,5)
# 3.77 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# In [6]: %timeit foo_old(25,5)
# 151 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas vectorize function using two dataframes

I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
x = x.set_index('Cat')
y = y.set_index('Cat')
y = np.sqrt(y['data_point2'])
vec = pd.DataFrame(x['data_point1'] * y)
grid = np.random.rand(len(x),len(x))
result = vec.dot(vec.T).mul(grid).sum().sum()
return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
'data_point1':np.random.rand(sample_size),
'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?
I'm not sure if it's possible to vectorize your function since it's a bit complex. However, some_calc itself and how it is called can be optimized.
What
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']],
y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
does is basically the same as a groupby. So instead of creating these groups and applying the function on them, use groupby + apply. Simplifying the some_calc function as well, we get:
def some_calc(df):
x = df['data_point1'].values
y = np.sqrt(df['data_point2'].values)
vec = (x * y).reshape(-1, 1)
grid = np.random.rand(len(x),len(x))
result = (vec # vec.T * grid).sum().sum()
return result
apply = df1.groupby('Cat').apply(some_calc)
apply.name = 'Apply'
df2.merge(apply, left_on='Cat', right_index=True)
The final merge is just to add the results to the df2 dataframe.
Timings:
# original
20.5 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# above code
3.62 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dropping Duplicate Points

I have two geodataframes or geoseries, both consists of thousands of points.
My requirement is to append (merge) both geodataframes and drop duplicate points.
In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points
I tried as:
output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')
However, it is very slow.
Do you know any faster way of doing it ?
Here is another way of combining dataframes using pandas, along with timings, versus geopandas:
import pandas as pd
import numpy as np
data1 = np.random.randint(-100, 100, size=10000)
data2 = np.random.randint(-100, 100, size=10000)
df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df1 = df1.set_index(["longitude", "latitude"])
df2 = df2.set_index(["longitude", "latitude"])
%timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This seems a lot faster than using geopandas
import geopandas as gp
gdf1 = gp.GeoDataFrame(
df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
gdf2 = gp.GeoDataFrame(
df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
%timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But maybe you need some kind of optimisations as mentioned here.
The function checks for non-matching indexes from each df and then combines them.
df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
col1
1
2
5
6

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!