How to bin a numerical pandas Series into n groups of approximately the same size without qcut? - pandas

I would like to split my series into exactly n groups (assuming there are at least n distinct values in the series), where the group sizes are approximately equal.
The code needs to be generic, so I cannot know the distribution of the data in advance, hence using pd.cut with pre-defined bins is not an option for me.
I tried using pd.qcut or pd.cut with pd.Series.quantile but they all fall short when some value is repeated very often in the series.
For instance, if I want exactly 3 groups:
series = pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4])
pd.qcut(series, q=3, duplicates="drop")
creates only 2 categories: Categories (2, interval[float64]): [(0.999, 3.0] < (3.0, 4.0]], whereas I would like to get something like [(0.999, 1.0] < (1.0, 3.0] < (3.0, 4.0]].
Is there any way to do this easily with pandas' built-in methods?

Related

How to plot outliers with regard to unique ids

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?
you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Baffled by numpy's transpose

Let's take a very simple case: an array with shape (2,3,4), ignoring the values.
>>> a.shape
(2, 3, 4)
When we transpose it and print the dimensions:
>>> a.transpose([1,2,0]).shape
(3, 4, 2)
So I'm saying: take axis index 2 and make it the first, then take axis index 0 and make it the second and finally take axis index 1 and make it the third. I should get (4,2,3), right?
Well, I thought perhaps I don't understand the logic fully. So I read the documentation and its says:
Use transpose(a, argsort(axes)) to invert the transposition of tensors
when using the axes keyword argument.
So I tried
>>> c = np.transpose(a, [1,2,0])
>>> c.shape
(3, 4, 2)
>>> np.transpose(a, np.argsort([1,2,0])).shape
(4, 2, 3)
and got yet a completely different shape!
Could someone please explain this? Thanks.
In [259]: a = np.zeros((2,3,4))
In [260]: idx = [1,2,0]
In [261]: a.transpose(idx).shape
Out[261]: (3, 4, 2)
What this has done is take a.shape[1] dimension and put it first. a.shape[2] is 2nd, and a.shape[0] third:
In [262]: np.array(a.shape)[idx]
Out[262]: array([3, 4, 2])
transpose without parameter is a complete reversal of the axis order. It's an extension of the familiar 2d transpose (rows become columns, columns become rows):
In [263]: a.transpose().shape
Out[263]: (4, 3, 2)
In [264]: a.transpose(2,1,0).shape
Out[264]: (4, 3, 2)
And the do-nothing transpose:
In [265]: a.transpose(0,1,2).shape
Out[265]: (2, 3, 4)
You have an initial axes order and final one; describing swap can be hard to visualize if you don't regularly work with lists of size 3 or larger.
Some people find it easier to use swapaxes, which changes the order of just axes. rollaxis is yet another way.
I prefer to use transpose since it can do anything the others can; so I just have to develop an intuitive for one tool.
The argsort comment operates this way:
In [278]: a.transpose(idx).transpose(np.argsort(idx)).shape
Out[278]: (2, 3, 4)
That is, apply it to the result of one transpose to get back the original order.
np.argsort([1,2,0]) returns an array like [2,0,1]
So
np.transpose(a, np.argsort([1,2,0])).shape
act like
np.transpose(a, [2,0,1]).shape
not
np.transpose(a, [1,2,0]).shape

Get indices for values of one array in another array

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:
ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])
and I want the list order for which new[order] == ref.
My current idea is:
def find(val):
return numpy.argmin(numpy.absolute(ref-val))
order = sorted(range(new.size), key=lambda x:find(new[x]))
However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.
So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?
Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).
What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].
Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.
Start with:
i = np.argsort(ref)
j = np.argsort(new)
Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:
k = np.argsort(i)
Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with
order = np.argsort(new)[np.argsort(np.argsort(ref))]
From your original example:
>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order] # Should give ref
array([5, 3, 1, 2, 3, 4])
This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:
def inverse_argsort(a):
fwd = np.argsort(a)
inv = np.empty_like(fwd)
inv[fwd] = np.arange(fwd.size)
return inv
order = np.argsort(new)[inverse_argsort(ref)]

Numpy rebinning a 2D array

I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.

Matching an array to a row in Numpy

I have an array 'A' of shape(50,3) and another array 'B' of shape (1,3).
Actually this B is a row in A. So I need to find its row location.
I used np.where(A==B), but it gives the locations searched element wise. For example, below is the result i got :
>>> np.where(A == B)
(array([ 3, 3, 3, 30, 37, 44]), array([0, 1, 2, 1, 2, 0]))
Actually B is the 4th row in A (in my case). But above result gives (3,0)(3,1)(3,2) and others, which are matched element-wise.
Instead of this, i need an answer '3' which is the answer obtained when B searched in A as a whole and it also removes others like (30,1)(37,2)... which are partial matches.
How can i do this in Numpy?
Thank you.
You can specify the axis:
numpy.where((A == B).all(axis=1))