Find rows where field A is a substring of field B - pandas

There are 10 mil records of two columns keyword and string.
I want to find rows where the keyword column appears in its string column:
test_df=pd.DataFrame({'keyword1':['day','night','internet','day','night','internet'],'string1':['today is a good day','I like this','youtube','sunday','what is this','internet']})
test_df
my first attempt is use .apply, but it's slow.
test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
because there are 10mil different strings, but much small number of keywords(in the magnitude of 10 thousands). So I'm thinking maybe it's more efficient if I group it by keywords.
test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
Supposedly, this approach only has 10k iteration rather than 10m iteration. But it is only slightly faster(10%). I'm not sure why? overhead of iteration is small, or groupby has its additional cost.
My question is: is there a better way to perform this job?

One idea is create mask by GroupBy.transform and compare by x.name, also regex=False should improved performance, but here seems still a lot of groups (10k), so groupby is bottleneck here:
mask = (test_df.groupby('keyword1')['string1']
.transform(lambda x : x.str.contains(x.name, regex=False)))
df = test_df[mask]
print (df)
keyword1 string1
0 day today is a good day
3 day sunday
5 internet internet
Another idea is using list comprehension, but not sure if faster in 10M:
test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
Some tests with sample data, but here are only few groups, so groupby really fast:
#6k data
test_df = pd.concat([test_df] * 1000, ignore_index=True)
In [49]: %timeit test_df[test_df.groupby('keyword1', sort=False)['string1'].transform(lambda x :x.str.contains(x.name, regex=False))]
5.84 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [50]: %timeit test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
9.46 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [51]: %timeit test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
11.7 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
138 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

What is the difference between specifying the rows vs slice in iloc?

What is the difference between specifying the rows vs slice in the following example:
df.iloc[[3, 4], 1]
vs
df.iloc[3:4, 1]
Slice a:b implies consecutive advancing/following of positions, while specifying positions as a list allows indexing by arbitrary sequence like [3, 5, 4, 1].
The difference is also in performance. slice by indexes works many times faster. For example
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.iloc[range(100),:]
Out:
177 µs ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.iloc[:100, :]
Out:
22.4 µs ± 828 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Efficiently filter DataFrame by looking for NumPy array match in row

Given
df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])
I would like to filter for np.array(['1', '2.3']). I can do
df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]
but is this the fastest way to do it?
EDIT:
Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!
DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.
You can rely on list comprehension for performance:
df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
Performance via timeit(on my system currently using 4gb ram) :
%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
My suggestion would be to do the following:
import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])
In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.
Then do:
matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)
Which outputs you an array of bools indicating where the matches are located.
So you can then filter your rows like this:
df[matching_rows]

difference of complexity in ordering and sorting?

I'm trying to understand the complexity of numpy array indexing here.
Given a 1-d numpy array A. and b = numpy.argsort(A)
what's the difference in time compleixty between np.sort(A) vs A[b] ?
for np.sort(A), it would be O(n log (n)), while A[b] should be O(n) ?
Under the hood argsort does a sort, which again gives complexity O(n log(n)).
You can actually specify the algorithm as described here
To conclude, while only A[b] is linear you cannot use this to beat the general complexity of sorting, as you yet have to determine b (by sorting).
Do a simple timing:
In [233]: x = np.random.random(100000)
In [234]: timeit np.sort(x)
6.79 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [235]: timeit x[np.argsort(x)]
8.42 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [236]: %%timeit b = np.argsort(x)
...: x[b]
...:
235 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [237]: timeit np.argsort(x)
8.08 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timing only one size doesn't give O complexity, but it reveals the relative significance of the different steps.
If you don't need the argsort, then sort directly. If you already have b use it rather than sorting again.
Here is a visual comparison to see it better:
#sort
def m1(A,b):
return np.sort(A)
#compute argsort and them index
def m2(A,b):
return A[np.argsort(A)]
#index with precomputed argsort
def m3(A,b):
return A[b]
A = [np.random.rand(n) for n in [10,100,1000,10000]]
Runtime on a log-log scale:

My jupiter notebook is taking long minutes before giving any output(keeps running) when I write this particular piece of code

for j in range(len(datelist)):
tempmax.append((df.where(df['Date']==datelist[j])['Data_Value'].max()))
tempmin.append((df.where(df['Date']==datelist[j])['Data_Value'].min()))
print(tempmax)
When I write this piece of code my Jupiter notebook keeps running for around 10 minutes before providing any output
For of all, you can increase your speed by skipping pd.where()
Speed comparison:
df = pd.DataFrame()
df['a'] = range(16000)
df['b'] = range(16000)
%timeit df.where(df['a']==2)['b'].max()
>>> 6.31 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[df['a']==2]['b'].max()
>>> 777 µs ± 8.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Your new code should run roughly 10x as fast:
for j in range(len(datelist)):
tempmax.append((df[df['Date']==datelist[j]]['Data_Value'].max()))
tempmin.append((df[df['Date']==datelist[j]]['Data_Value'].min()))
It's also possible that using pd.Series.agg() will speed things up even more:
for j in range(len(datelist)):
search = df[df['Date']==datelist[j]].agg(['max','min'])['Data_Value']
tempmax.append(search['max'])
tempmin.append(search['min'])

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)