apply calculation in pandas column with groupby - pandas

what could be wrong in below code ??
a)I need a group by Area columns and apply some mathematical formula across columns:
b)Also if I have another column lets say the date and need to be added to groupby how will it come in below command
df3 = dataset.groupby('AREA')(['col1']+['col2']).sum()
table is in image below
enter image description here

I think you can sum column before grouping for better performance:
dataset['new'] = dataset['col1']+dataset['col2']
df3 = dataset.groupby('AREA', as_index=False)['new'].sum()
But your solution is possible in lambda function:
df3 = (dataset.groupby('AREA')
.apply(lambda x: (x['col1']+x['col2']).sum())
.reset_index(name='SUM'))
Performance:
np.random.seed(123)
N = 100000
dataset = pd.DataFrame({'AREA': np.random.randint(1000, size=N),
'col1': np.random.randint(10, size=N),
'col2':np.random.randint(10, size=N)})
#print (dataset)
In [24]: %%timeit
...: dataset['new'] = dataset['col1']+dataset['col2']
...: df3 = dataset.groupby('AREA', as_index=False)['new'].sum()
...:
7.64 ms ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [25]: %%timeit
...: df3 = (dataset.groupby('AREA')
...: .apply(lambda x: (x['col1']+x['col2']).sum())
...: .reset_index(name='SUM'))
...:
368 ms ± 5.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Dropping Duplicate Points

I have two geodataframes or geoseries, both consists of thousands of points.
My requirement is to append (merge) both geodataframes and drop duplicate points.
In other words, output = gdf1 all points + gdf2 points that do not intersect with gdf1 points
I tried as:
output = geopandas.overlay(gdf1, gdf2, how='symmetric_difference')
However, it is very slow.
Do you know any faster way of doing it ?
Here is another way of combining dataframes using pandas, along with timings, versus geopandas:
import pandas as pd
import numpy as np
data1 = np.random.randint(-100, 100, size=10000)
data2 = np.random.randint(-100, 100, size=10000)
df1 = pd.concat([-pd.Series(data1, name="longitude"), pd.Series(data1, name="latitude")], axis=1)
df1['geometry'] = df1.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df2 = pd.concat([-pd.Series(data2, name="longitude"), pd.Series(data2, name="latitude")], axis=1)
df2['geometry'] = df2.apply(lambda x: (x['latitude'], x['longitude']), axis=1)
df1 = df1.set_index(["longitude", "latitude"])
df2 = df2.set_index(["longitude", "latitude"])
%timeit pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
112 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This seems a lot faster than using geopandas
import geopandas as gp
gdf1 = gp.GeoDataFrame(
df1, geometry=gp.points_from_xy(df1.index.get_level_values("longitude"), df1.index.get_level_values("latitude")))
gdf2 = gp.GeoDataFrame(
df2, geometry=gp.points_from_xy(df2.index.get_level_values("longitude"), df2.index.get_level_values("latitude")))
%timeit gp.overlay(gdf1, gdf2, how='symmetric_difference')
29 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But maybe you need some kind of optimisations as mentioned here.
The function checks for non-matching indexes from each df and then combines them.
df1 = pd.DataFrame([1,2,3,4],columns=['col1']).set_index("col1")
df2 = pd.DataFrame([3,4,5,6],columns=['col1']).set_index("col1")
pd.concat([df1[~df1.index.isin(df2.index)],df2[~df2.index.isin(df1.index)]])
col1
1
2
5
6

Do I need the left table indexed for efficiency when left merging with another indexed DataFrame?

I want to do a left merge to join two pandas DataFrame:
merged_df = left_df.merge(right_df, how='left', left_on='id', right_index=True)
left_df is not indexed, it only has an id, but right_df is indexed.
I have not indexed left_df since it changes continuously, but for the merge would it be faster if the left DataFrame is also indexed? In my case the merge is done very frequently and until now the left DataFrames has up to 60k rows and the right up to 1000.
I have not checked pandas' merge code, but since in the left merge it keeps all rows of the left DataFrame I am not sure if indexing it would increase this merge's speed.
Let's just test it with fake data:
import pandas as pd
import numpy as np
# df1: 60k rows, not indexed
df1 = pd.DataFrame(data = {'a': np.random.randint(0, 100, 60_000),
'b': np.random.randint(0, 100, 60_000)})
# df2: 1k rows, indexed
df2 = pd.DataFrame(data = {'c': np.random.randint(0, 100, 1000)},
index = np.random.randint(0, 100, 1000))
Joins Performances
%timeit pd.merge(df1, df2, left_on='a', right_index=True, how='left')
55.4 ms ± 6.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.merge(df1.set_index('a'), df2, left_index=True, right_index=True,
how='left')
49.8 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This already shows a better performance when joining on index. However I am also setting the index in the join, which just need to be done once if you have multiple joins. Let's see time split between the two operations:
%time df1.set_index('a', inplace=True)
Wall time: 936 µs
%timeit pd.merge(df1, df2, left_index=True, right_index=True, how='left')
48 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Multiply array indices with numbers

is there any short numpy command for the following operation in the for loops?
import numpy as np
a= np.array([1.0,2.0,3.0,4.0,5.0,6.0])
b= np.array([10.0,20.0,30.0])
c= np.array([100.0,200.0,300.0,900.0])
y=np.linspace(0,2,50)
m=np.array([0.2,0.1,0.3])
A,C,B,Y = np.meshgrid(a,c,b,y,indexing="ij")
print Y
for i in range(0,len(a)):
for j in range(0,len(c)):
for k in range(0,len(b)):
Y[i][j][k]=Y[i][j][k]*m[k]
print "--------"
print Y
Abstractly I have $Y_{ijkl}$ and I want to multiply $Y_{ij0l}$ with $m_0$ and $Y_{ij1l}$ with $m_1$ and so on...
Many thanks in advance!
To remove the loop, you just need einsum here.
np.einsum('ijkl,k->ijkl', Y, m)
Or just broadcasted multiplication:
Y * m[:, None]
However, if you don't want to create the meshgrid in the first place, you can broadcast Y first, to make this more memory efficient.
np.einsum(
"ijkl,k->ijkl",
np.broadcast_to(y, a.shape + c.shape + b.shape + y.shape),
m,
)
or:
np.broadcast_to(y, a.shape + c.shape + b.shape + y.shape) * m[:, None]
If you need A, C, B as well, you can continue with your current approach.
Performance
In [44]: %%timeit
...: np.einsum(
...: "ijkl,k->ijkl",
...: np.broadcast_to(y, (a.shape[0], c.shape[0], b.shape[0], y.shape[0])),
...: m,
...: )
...:
21.1 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [45]: %%timeit
...: A,C,B,Y = np.meshgrid(a,c,b,y,indexing="ij")
...: for i in range(0,len(a)):
...: for j in range(0,len(c)):
...: for k in range(0,len(b)):
...: Y[i][j][k]=Y[i][j][k]*m[k]
...:
420 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to raise a matrix to the power of elements in an array that is increasing in an ascending order?

Currently I have a C matrix generated by:
def c_matrix(n):
exp = np.exp(1j*np.pi/n)
exp_n = np.array([[exp, 0], [0, exp.conj()]], dtype=complex)
c_matrix = np.array([exp_n**i for i in range(1, n, 1)], dtype=complex)
return c_matrix
What this does is basically generate a list of number from 0 to n-1 using list comprehension, then returns a list of the matrix exp_nbeing raised to the elements of the ascendingly increasing list. i.e.
exp_n**[0, 1, ..., n-1] = [exp_n**0, exp_n**1, ..., exp_n**(n-1)]
So I was wondering if there's a more numpythonic way of doing it(in order to make use of Numpy's broadcasting ability) like:
exp_n**np.arange(1,n,1) = np.array(exp_n**0, exp_n**1, ..., exp_n**(n-1))
You're speaking of a Vandermonde matrix. Numpy has numpy.vander
def c_matrix_vander(n):
exp = np.exp(1j*np.pi/n)
exp_n = np.array([[exp, 0], [0, exp.conj()]], dtype=complex)
return np.vander(exp_n.ravel(), n, increasing=True)[:, 1:].swapaxes(0, 1).reshape(n-1, 2, 2)
Performance
In [184]: %timeit c_matrix_vander(10_000)
849 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [185]: %timeit c_matrix(10_000)
41.5 ms ± 549 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Validation
>>> np.isclose(c_matrix(10_000), c_matrix_vander(10_000)).all()
True

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)