What is the difference between specifying the rows vs slice in iloc? - pandas

What is the difference between specifying the rows vs slice in the following example:
df.iloc[[3, 4], 1]
vs
df.iloc[3:4, 1]

Slice a:b implies consecutive advancing/following of positions, while specifying positions as a list allows indexing by arbitrary sequence like [3, 5, 4, 1].

The difference is also in performance. slice by indexes works many times faster. For example
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.iloc[range(100),:]
Out:
177 µs ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.iloc[:100, :]
Out:
22.4 µs ± 828 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Related

Custom Vectorization in numpy for string arrays

I am trying to apply vectorization with custom function on numpy string arrays.
Example:
import numpy
test_array = numpy.char.array(["sample1-sample","sample2-sample"])
numpy.char.array(test_array.split('-'))[:,0]
Op:
chararray([b'sample1', b'sample2'], dtype='|S7')
But these are in-built functions, is there any other method to achieve vectorization with custom functions. Example, with the following function:
def custom(text):
return text[0]
numpy doesn't implement fast string methods (as it does for numeric dtypes). So the np.char code is more for convenience than performance.
In [124]: alist=["sample1-sample","sample2-sample"]
In [125]: arr = np.array(alist)
In [126]: carr = np.char.array(alist)
A straightforward list comprehension versus your code:
In [127]: [item.split('-')[0] for item in alist]
Out[127]: ['sample1', 'sample2']
In [128]: np.char.array(carr.split('-'))[:,0]
Out[128]: chararray([b'sample1', b'sample2'], dtype='|S7')
In [129]: timeit [item.split('-')[0] for item in alist]
664 ns ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [130]: timeit np.char.array(carr.split('-'))[:,0]
20.5 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For the simple task of clipping the strings, there is a fast numpy way - using a shorter dtype:
In [131]: [item[0] for item in alist]
Out[131]: ['s', 's']
In [132]: carr.astype('S1')
Out[132]: chararray([b's', b's'], dtype='|S1')
But assuming that's just an example, not your real world custom action, I suggest using lists.
np.char recommends using the np.char functions and ordinary array instead of np.char.array. The functionality is basically the same. But using the arr above:
In [140]: timeit np.array(np.char.split(arr, '-').tolist())[:,0]
13.8 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.char functions often produce string dtype arrays, but split creates an object dtype array of lists:
In [141]: np.char.split(arr, '-')
Out[141]:
array([list(['sample1', 'sample']), list(['sample2', 'sample'])],
dtype=object)
Object dtype arrays are essentially lists.
In [145]: timeit [item[0] for item in np.char.split(arr, '-').tolist()]
9.08 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Your code is relatively slow because it takes time to convert this array of lists into a new chararray.

Efficiently filter DataFrame by looking for NumPy array match in row

Given
df = pd.DataFrame({'x': [np.array(['1', '2.3']), np.array(['30', '99'])]},
index=[pd.date_range('2020-01-01', '2020-01-02', freq='D')])
I would like to filter for np.array(['1', '2.3']). I can do
df[df['x'].apply(lambda x: np.array_equal(x, np.array(['1', '2.3'])))]
but is this the fastest way to do it?
EDIT:
Let's assume that all the elements inside the numpy array are strings, even though it's not good practice!
DataFrame length can go to 500k rows and the number of values in each numpy array can go to 10.
You can rely on list comprehension for performance:
df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
Performance via timeit(on my system currently using 4gb ram) :
%timeit -n 2000 df[np.array([np.array_equal(x,np.array([1, 2.3])) for x in df['x'].values])]
#output:
425 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
%timeit -n 2000 df[df['x'].apply(lambda x: np.array_equal(x, np.array([1, 2.3])))]
#output:
875 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
My suggestion would be to do the following:
import numpy as np
mat = np.stack([np.array(["a","b","c"]),np.array(["d","e","f"])])
In reality this would be the actual data from the cols of your dataframe. Make sure that these are a single numpy array.
Then do:
matching_rows = (np.array(["a","b","c"]) == mat).all(axis=1)
Which outputs you an array of bools indicating where the matches are located.
So you can then filter your rows like this:
df[matching_rows]

Find rows where field A is a substring of field B

There are 10 mil records of two columns keyword and string.
I want to find rows where the keyword column appears in its string column:
test_df=pd.DataFrame({'keyword1':['day','night','internet','day','night','internet'],'string1':['today is a good day','I like this','youtube','sunday','what is this','internet']})
test_df
my first attempt is use .apply, but it's slow.
test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
because there are 10mil different strings, but much small number of keywords(in the magnitude of 10 thousands). So I'm thinking maybe it's more efficient if I group it by keywords.
test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
Supposedly, this approach only has 10k iteration rather than 10m iteration. But it is only slightly faster(10%). I'm not sure why? overhead of iteration is small, or groupby has its additional cost.
My question is: is there a better way to perform this job?
One idea is create mask by GroupBy.transform and compare by x.name, also regex=False should improved performance, but here seems still a lot of groups (10k), so groupby is bottleneck here:
mask = (test_df.groupby('keyword1')['string1']
.transform(lambda x : x.str.contains(x.name, regex=False)))
df = test_df[mask]
print (df)
keyword1 string1
0 day today is a good day
3 day sunday
5 internet internet
Another idea is using list comprehension, but not sure if faster in 10M:
test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
Some tests with sample data, but here are only few groups, so groupby really fast:
#6k data
test_df = pd.concat([test_df] * 1000, ignore_index=True)
In [49]: %timeit test_df[test_df.groupby('keyword1', sort=False)['string1'].transform(lambda x :x.str.contains(x.name, regex=False))]
5.84 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [50]: %timeit test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
9.46 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [51]: %timeit test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
11.7 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
138 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

difference of complexity in ordering and sorting?

I'm trying to understand the complexity of numpy array indexing here.
Given a 1-d numpy array A. and b = numpy.argsort(A)
what's the difference in time compleixty between np.sort(A) vs A[b] ?
for np.sort(A), it would be O(n log (n)), while A[b] should be O(n) ?
Under the hood argsort does a sort, which again gives complexity O(n log(n)).
You can actually specify the algorithm as described here
To conclude, while only A[b] is linear you cannot use this to beat the general complexity of sorting, as you yet have to determine b (by sorting).
Do a simple timing:
In [233]: x = np.random.random(100000)
In [234]: timeit np.sort(x)
6.79 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [235]: timeit x[np.argsort(x)]
8.42 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [236]: %%timeit b = np.argsort(x)
...: x[b]
...:
235 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [237]: timeit np.argsort(x)
8.08 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timing only one size doesn't give O complexity, but it reveals the relative significance of the different steps.
If you don't need the argsort, then sort directly. If you already have b use it rather than sorting again.
Here is a visual comparison to see it better:
#sort
def m1(A,b):
return np.sort(A)
#compute argsort and them index
def m2(A,b):
return A[np.argsort(A)]
#index with precomputed argsort
def m3(A,b):
return A[b]
A = [np.random.rand(n) for n in [10,100,1000,10000]]
Runtime on a log-log scale:

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)