Pandas groupby when only one value groupedby - pandas

I want to compute the cumulative count of a given variable. So I expect that the following code works
import pandas as pd
import numpy as np
df = pd.DataFrame.from_records({'x': [0, 1, 0, 1, 1]})
df2 = pd.DataFrame.from_records({'x': [0, 0, 0, 0, 0]})
result = df.groupby('x').apply(lambda x: pd.Series(np.arange(len(x)), index=x.index)).reset_index(level=0, drop=True).sort_index()
assert (result == [0, 0, 1, 1, 2]).all()
result2 = df2.groupby('x').apply(lambda x: pd.Series(np.arange(len(x)))).reset_index(level=0, drop=True).sort_index()
assert (result2 == [0, 1, 2, 3, 4]).all()
The first assert is True but not the second one.
Why ?

This seems to be an open issue.
See BUG: inconsistent return format of Dataframe group apply function.
A workaround can be:
assert (result2.values == [0, 1, 2, 3, 4]).all()

Related

get elements in one array while not in other array along with axis 0 [duplicate]

I have 2 2d numpy arrays A and B
I want to remove all the rows in A which appear in B.
I tried something like this:
A[~np.isin(A, B)]
but isin keeps the dimensions of A, I need one boolean value per row to filter it.
EDIT: something like this
A = np.array([[3, 0, 4],
[3, 1, 1],
[0, 5, 9]])
B = np.array([[1, 1, 1],
[3, 1, 1]])
.....
A = np.array([[3, 0, 4],
[0, 5, 9]])
Probably not the most performant solution, but does exactly what you want. You can change the dtype of A and B to be a unit consisting of one row. You need to ensure that the arrays are contiguous first, e.g. with ascontiguousarray:
Av = np.ascontiguousarray(A).view(np.dtype([('', A.dtype, A.shape[1])])).ravel()
Bv = np.ascontiguousarray(B).view(Av.dtype).ravel()
Now you can apply np.isin directly:
>>> np.isin(Av, Bv)
array([False, True, False])
According to the docs, invert=True is faster than negating the output of isin, so you can do
A[np.isin(Av, Bv, invert=True)]
Try the following - it uses matrix multiplication for dimensionality reduction:
import numpy as np
A = np.array([[3, 0, 4],
[3, 1, 1],
[0, 5, 9]])
B = np.array([[1, 1, 1],
[3, 1, 1]])
arr_max = np.maximum(A.max(0) + 1, B.max(0) + 1)
print (A[~np.isin(A.dot(arr_max), B.dot(arr_max))])
Output:
[[3 0 4]
[0 5 9]]
This is certainly not the most performant solution but it is relatively easy to read:
A = np.array([row for row in A if row not in B])
Edit:
I found that the code does not correctly work, but this does:
A = [row for row in A if not any(np.equal(B, row).all(1))]

Loop through arrays manually

I am trying to replicate how zip works by using a simple example and i want the output to be an array.
I have the following data
s = (2, 2)
array = np.zeros(s)
x = np.array([1, 0, 1, 0, 1, 1, 1, 1])
y = np.array([1, 0, 0, 0, 1, 0, 1, 1])
What i want to do is have a 2x2 matrix as output, which works like this:
for i, j in zip(x, y):
array[i][j] += 1
This outputs
[[2 0]
[2 4]]
I tried obtaining the same results without using the zip for lists but i get a (1,1) tuple
for i in range(len(x)):
array = x[i], y[i]
will output: (1, 1)
for i in range(len(x)):
array[x[i]][y[i]] += 1
This will do the same as
for i, j in zip(x, y):
array[i][j] += 1
This makes the transformation clear for you
for idx in range(len(x)):
i = x[idx]
j = y[idx]
array[i][j] += 1
print(array)
Output:
[[2 0]
[2 4]]
import numpy as np
array = np.zeros(2, dtype=int)
x = np.array([1, 0, 1, 0, 1, 1, 1, 1])
y = np.array([1, 0, 0, 0, 1, 0, 1, 1])
# empirically what zip does
[(x[i], y[i]) for i in range(sorted([len(x), len(y)])[0])]
#proof
print( list(zip(x, y)) == [(x[i], y[i]) for i in range(sorted([len(x), len(y)])[0])] )
True
Why use sorted?
Since zip stops at the index of the smaller list we need to iterate by the range of the smallest list; therefore sorted([1,2])[0] is 1 (the smaller of the 2)
so even if i had y = np.array([1, 0, 0, 0, 1, 0, 1, 1, 1]) that would still return the correct zip of the 2
In order to do it in a vectorized way, try to work on array.ravel() instead. This command performs dynamic changes of array:
np.add.at(array.ravel(),
np.ravel_multi_index(np.array([x,y]), s),
np.repeat(1, len(x)))
So you can check out after running it that array has changed:
>>> array
array([[2, 0],
[2, 4]])

return the entire row with has the max value in numpy - python

val = np.array([[1, 3], [2, 5], [0, 6], [1, 2] ])
print(np.max(val))
6
I also want to print the row [0,6]. with axis it returns all the value from other rows as well. argmax doesnt return the row index.
One way is to use np.where which return indexes where true:
r,_ = np.where(val == np.max(val))
val[r]
Output:
array([[0, 6]])

plot a groupby object with bokeh

Consider the following MWE.
from pandas import DataFrame
from bokeh.plotting import figure
data = dict(x = [0,1,2,0,1,2],
y = [0,1,2,4,5,6],
g = [1,1,1,2,2,2])
df = DataFrame(data)
p = figure()
p.line( 'x', 'y', source=df[ df.g == 1 ] )
p.line( 'x', 'y', source=df[ df.g == 2 ] )
Ideally, I would like to compress the last to lines in one:
p.line( 'x', 'y', source=df.groupby('g') )
(Real life examples have a large and variable number of groups.) Is there any concise way to do this?
I just found out that the following works
gby = df.groupby('g')
gby.apply( lambda d: p.line( 'x', 'y', source=d ) )
(it has some drawbacks, though).
Any better idea?
I didn't come out with df.groupby so I used df.loc but maybe multi_line is what you are after:
from pandas import DataFrame
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = DataFrame(data, index = data['g'])
dfs = [DataFrame(df.loc[i].values, columns = df.columns) for i in df['g'].unique()]
source = ColumnDataSource(dict(x = [df['x'].values for df in dfs], y = [df['y'].values for df in dfs]))
p = figure()
p.multi_line('x', 'y', source = source)
show(p)
Result:
This is Tony's solution slightly simplified.
import pandas as pd
from bokeh.plotting import figure
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = pd.DataFrame(data)
####################### So far as in the OP
gby = df.groupby('g')
p = figure()
x = [list( sdf['x'] ) for i,sdf in gby]
y = [list( sdf['y'] ) for i,sdf in gby]
p.multi_line( x, y )
from pandas import DataFrame
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
data = dict(x = [0, 1, 2, 0, 1, 2],
y = [0, 1, 2, 4, 5, 6],
g = [1, 1, 1, 2, 2, 2])
df = DataFrame(data)
plt = figure()
for i, group in df.groupby(['g']):
source = ColumnDataSource(group)
plt.line('x','y', source=source, legend_group='g')
show(plt)

Substitute entries of numpy array with numpy arrays

I have a numpy array A of size ((s1,...sm)) with integer entries and a dictionary D with integers as keys and numpy arrays of size ((t)) as values. I would like to evaluate the dictionary on every entry of the array A to get a new array B of size ((s1,...sm,t)).
For example
D={1:[0,1],2:[1,0]}
A=np.array([1,2,1])
The output shout be
array([[0,1],[1,0],[0,1]])
Motivation: I have an array with indexes of unit vectors as entries and I need to transform it into an array with the vectors as entries.
If you can rename your keys to be 0-indexed, you might use direct array querying on your unit vectors:
>>> units = np.array([D[1], D[2]])
>>> B = units[A - 1] # -1 because 0 indexed: 1 -> 0, 2 -> 1
>>> B
array([[0, 1],
[1, 0],
[0, 1]])
And similarly for any shape:
>>> A = np.random.random_integers(0, 1, (10, 11, 12))
>>> A.shape
(10, 11, 12)
>>> B = units[A]
>>> B.shape
(10, 11, 12, 2)
You can learn more about advanced indexing on the numpy doc
>>> np.asarray([D[key] for key in A])
array([[0, 1],
[1, 0],
[0, 1]])
Here's an approach using np.searchsorted to locate those row indices to index into the values of the dictionary and then simply indexing it to get the desired output, like so -
idx = np.searchsorted(D.keys(),A)
out = np.asarray(D.values())[idx]
Sample run -
In [45]: A
Out[45]: array([1, 2, 1])
In [46]: D
Out[46]: {1: [0, 1], 2: [1, 0]}
In [47]: idx = np.searchsorted(D.keys(),A)
...: out = np.asarray(D.values())[idx]
...:
In [48]: out
Out[48]:
array([[0, 1],
[1, 0],
[0, 1]])