Sorting very large 1D arrays - numpy

I'm about to try out Pytables for the first time and I need to write my data to the hdf file per time step. I'll have over 100,000 time steps. When I'm done, I would like to sort my 100,000+ x 6 array by column 2, i.e., I currently have everything sorted by time but now I need to sort the array by order of decreasing rain rates (col 2). I'm unsure how to even begin here. I know that having the entire array in memory is unwise. Any ideas how to doe this fast and efficiently?
Appreciate any advice.

I know that having the entire array in memory is unwise.
You might be overthinking it. A 100K x 6 array of float64 takes just ~5MB of RAM. On my computer, sorting such an array takes about 27ms:
In [37]: a = np.random.rand(100000, 6)
In [38]: %timeit a[a[:,1].argsort()]
10 loops, best of 3: 27.2 ms per loop

Unless you have a very old computer, you should put the entire array in memory. Assuming they are floats, it will only take 100000*6*4./2**20 = 2.29 Mb. Twice as much for doubles. You can use numpy's sort or argsort for sorting. For example, you can get the sorting indices from your second column:
import numpy as np
a = np.random.normal(0, 1, size=(100000,6))
idx = a[:, 1].argsort()
And then use these to index the columns you want, or the whole array:
b = a[idx]
You can even use different types of sort and check their speed:
In [33]: %timeit idx = a[:, 1].argsort(kind='quicksort')
100 loops, best of 3: 12.6 ms per loop
In [34]: %timeit idx = a[:, 1].argsort(kind='mergesort')
100 loops, best of 3: 14.4 ms per loop
In [35]: %timeit idx = a[:, 1].argsort(kind='heapsort')
10 loops, best of 3: 21.4 ms per loop
So you see that for an array of this size it doesn't really matter.

Related

Pandas replace .apply(lambda x: with fast solution e.g. using numpy arrays

I am trying to speed up a ranking function that I use to process millions of rows with hundreds of factors. I have provided a sample MCVE below:
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
My ranking code is below as follows:
import pandas as pd
import numpy as np
import bottleneck as bn
%timeit ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category']).apply(lambda x: x[to_rank].apply(lambda x: bn.nanrankdata(x) * 100 / len(x) - 1))
10 loops, best of 3: 106 ms per loop
With my data, this takes about 30 - 40 seconds. I gather that .apply(lambda x: has big overheads, including a loop, dtype detection, and shape analysis, and I am using this twice to loop over a multi-index, which is probably doubly inefficient. I have read that one can vectorize this by using Series/numpy arrays (e.g. https://tomaugspurger.github.io/modern-4-performance.html but I am struggling to implement this myself; indeed, most similar questions about applying a function over a multi-index seem to use .apply(lambda x: so I suspect others could also benefit from speeding up their code.
You can define a function and use transform, although the time taken is not that much better (only twice as fast) :
def nanrankdata_len(x):
return bn.nanrankdata(x)*100/len(x) - 1
%timeit ranked = df.groupby(['date_id','category']).transform(nanrankdata_len)
#-> 10 loops, best of 3: 55.5 ms per loop

The fastest way to get filtered data checking substring value within ndarray

I have a big array of data:
>>> len(b)
6636849
>>> print(b)
[['60D19E9E-4E2C-11E2-AA9A-52540027E502' '100015361']
['60D19EB6-4E2C-11E2-AA9A-52540027E502' '100015385']
['60D19ECE-4E2C-11E2-AA9A-52540027E502' '100015409']
...,
['8CC90633-447E-11E6-B010-005056A76B49' '106636785']
['F8C74244-447E-11E6-B010-005056A76B49' '106636809']
['F8C7425C-447E-11E6-B010-005056A76B49' '106636833']]
I need to get the filtered dataset, i.e, everything containing (or starting with) '106' in the string). Something like the following code with substring operation instead of math operation:
>>> len(b[b[:,1] > '10660600'])
30850
I don't think numpy is well suited for this type of operation. You can do it simply using basic python operations. Here it is with some sample data a:
import random # for the test data
a = []
for i in range(10000):
a.append(["".join(random.sample('abcdefg',3)), "".join(random.sample('01234567890',8))])
answer = [i for i in a if i[1].find('106') != -1]
Keep in mind that startswith is going to be a lot faster than find, because find has to look for matching substrings in all positions.
It's not too clear why you need do this with such a large list/array in the first place, and there might be a better solution when it comes to not including these values in the list in the first place.
Here's a simple pandas solution
import pandas as pd
df = pd.DataFrame(b, columns=['1st String', '2nd String'])
df_filtered = df[df['2nd String'].str.contains('106')]
This gives you
In [29]: df_filtered
Out[29]:
1st String 2nd String
3 8CC90633-447E-11E6-B010-005056A76B49 106636785
4 F8C74244-447E-11E6-B010-005056A76B49 106636809
5 F8C7425C-447E-11E6-B010-005056A76B49 106636833
Update: Timing Results
Using Benjamin's list a as the test sample:
In [20]: %timeit [i for i in a if i[1].find('106') != -1]
100 loops, best of 3: 2.2 ms per loop
In [21]: %timeit df[df['2nd String'].str.contains('106')]
100 loops, best of 3: 5.94 ms per loop
So it looks like Benjamin's answer is actually about 3x faster. This surprises me since I was under the impression that the operation in pandas is vectorized. Moreover, the speed ratio does not change when a is 100 times longer.
Look at the functions in the np.char submodule:
data = [['60D19E9E-4E2C-11E2-AA9A-52540027E502', '100015361'],
['60D19EB6-4E2C-11E2-AA9A-52540027E502', '100015385'],
['60D19ECE-4E2C-11E2-AA9A-52540027E502', '100015409'],
['8CC90633-447E-11E6-B010-005056A76B49', '106636785'],
['F8C74244-447E-11E6-B010-005056A76B49', '106636809'],
['F8C7425C-447E-11E6-B010-005056A76B49', '106636833']]
data = np.array([r[1] for r in data], np.str)
idx = np.char.startswith(data, '106')
print(idx)

Dynamic axis indexing of Numpy ndarray

I want to obtain the 2D slice in a given direction of a 3D array where the direction (or the axis from where the slice is going to be extracted) is given by another variable.
Assuming idx the index of the 2D slice in a 3D array, and direction the axis in which obtain that 2D slice, the initial approach would be:
if direction == 0:
return A[idx, :, :]
elif direction == 1:
return A[:, idx, :]
else:
return A[:, :, idx]
I'm pretty sure there must be a way of doing this without doing conditionals, or at least, not in raw python. Does numpy have a shortcut for this?
The better solution I've found so far (for doing it dynamically), relies in the transpose operator:
# for 3 dimensions [0,1,2] and direction == 1 --> [1, 0, 2]
tr = [direction] + range(A.ndim)
del tr[direction+1]
return np.transpose(A, tr)[idx]
But I wonder if there is any better/easier/faster function for this, since for 3D the transpose code almost looks more awful than the 3 if/elif. It generalizes better for ND and the larger the N the more beautiful the code gets in comparison, but for 3D is quite the same.
Transpose is cheap (timewise). There are numpy functions that use it to move the operational axis (or axes) to a known location - usually the front or end of the shape list. tensordot is one that comes to mind.
Other functions construct an indexing tuple. They may start with a list or array for ease of manipulation, and then turn it into a tuple for application. For example
I = [slice(None)]*A.ndim
I[axis] = idx
A[tuple(I)]
np.apply_along_axis does something like that. It's instructive to look at the code for functions like this.
I imagine the writers of the numpy functions worried most about whether it works robustly, secondly about speed, and lastly whether it looks pretty. You can bury all kinds of ugly code in a function!.
tensordot ends with
at = a.transpose(newaxes_a).reshape(newshape_a)
bt = b.transpose(newaxes_b).reshape(newshape_b)
res = dot(at, bt)
return res.reshape(olda + oldb)
where the previous code calculated newaxes_.. and newshape....
apply_along_axis constructs a (0...,:,0...) index tuple
i = zeros(nd, 'O')
i[axis] = slice(None, None)
i.put(indlist, ind)
....arr[tuple(i.tolist())]
To index a dimension dynamically, you can use swapaxes, as shown below:
a = np.arange(7 * 8 * 9).reshape((7, 8, 9))
axis = 1
idx = 2
np.swapaxes(a, 0, axis)[idx]
Runtime comparison
Natural method (non dynamic) :
%timeit a[:, idx, :]
300 ns ± 1.58 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
swapaxes:
%timeit np.swapaxes(a, 0, axis)[idx]
752 ns ± 4.54 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Index with list comprehension:
%timeit a[[idx if i==axis else slice(None) for i in range(a.ndim)]]
This is python. You could simply use eval() like this:
def get_by_axis(a, idx, axis):
indexing_list = a.ndim*[':']
indexing_list[axis] = str(idx)
expression = f"a[{', '.join(indexing_list)}]"
return eval(expression)
Obviously, in which case you do not accept input from untrusted users.

New column to pandas dataframe according to values from other column

I have a column in my data frame with string data. I need to create a new column of integers, one for each unique string. I will use this column as the second level of a multiindex. The code below does the trick, but I was wondering if there could be a more efficient solution in Pandas for it?
import pandas as pd
df = pd.DataFrame({'c1':[1,2,3,4],
'c2':['a','a','b','b']})
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
can be replaced with
df['c3'] = pd.Categorical(df['c2']).codes
Even for this small DataFrame, using Categorical is (about 4x) quicker:
In [33]: %%timeit
...: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
1000 loops, best of 3: 1.07 ms per loop
In [35]: %timeit pd.Categorical(df['c2']).codes
1000 loops, best of 3: 264 µs per loop
The improvement in speed will increase with the number of unique elements in df['c2'] since the Python for-loop's relative inefficency will become more apparent with more iterations.
For example, if
import string
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({'c1':np.arange(N),
'c2':np.random.choice(list(string.letters), size=N)})
then using Categorical is (about 56x) quicker:
In [53]: %%timeit
....: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
10 loops, best of 3: 58.2 ms per loop
In [54]: %timeit df['c3'] = pd.Categorical(df['c2']).codes
1000 loops, best of 3: 1.04 ms per loop
The benchmarks above were done with IPython's %timeit "magic function".

split data frame based on integer index

In pandas how do I split Series/dataframe into two Series/DataFrames where odd rows in one Series, even rows in different? Right now I am using
rng = range(0, n, 2)
odd_rows = df.iloc[rng]
This is pretty slow.
Use slice:
In [11]: s = pd.Series([1,2,3,4])
In [12]: s.iloc[::2] # even
Out[12]:
0 1
2 3
dtype: int64
In [13]: s.iloc[1::2] # odd
Out[13]:
1 2
3 4
dtype: int64
Here's some comparisions
In [100]: df = DataFrame(randn(100000,10))
simple method (but I think range makes this slow), but will work regardless of the index
(e.g. does not have to be a numeric index)
In [96]: %timeit df.iloc[range(0,len(df),2)]
10 loops, best of 3: 21.2 ms per loop
The following require an Int64Index that is range based (which is easy to get, just reset_index()).
In [107]: %timeit df.iloc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.67 ms per loop
In [108]: %timeit df.loc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.48 ms per loop
make sure to give it index positions
In [98]: %timeit df.take(df.index % 2)
100 loops, best of 3: 3.06 ms per loop
same as above but no conversions on negative indicies
In [99]: %timeit df.take(df.index % 2,convert=False)
100 loops, best of 3: 2.44 ms per loop
This winner is #AndyHayden soln; this only works on a single dtype
In [118]: %timeit DataFrame(df.values[::2],index=df.index[::2])
10000 loops, best of 3: 63.5 us per loop