Is there an easy way to group columns in a Pandas DataFrame? - pandas

I am trying to use Pandas to represent motion-capture data, which has T measurements of the (x, y, z) locations of each of N markers. For example, with T=3 and N=4, the raw CSV data looks like:
T,Ax,Ay,Az,Bx,By,Bz,Cx,Cy,Cz,Dx,Dy,Dz
0,1,2,1,3,2,1,4,2,1,5,2,1
1,8,2,3,3,2,9,9,1,3,4,9,1
2,4,5,7,7,7,1,8,3,6,9,2,3
This is really simple to load into a DataFrame, and I've learned a few tricks that are easy (converting marker data to z-scores, or computing velocities, for example).
One thing I'd like to do, though, is convert the "flat" data shown above into a format that has a hierarchical index on the column (marker), so that there would be N columns at level 0 (one for each marker), and each one of those would have 3 columns at level 1 (one each for x, y, and z).
A B C D
x y z x y z x y z x y z
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3
I know how do this by loading up the flat file and then manipulating the Series objects directly, perhaps by using append or just creating a new DataFrame using a manually-created MultiIndex.
As a Pandas learner, it feels like there must be a way to do this with less effort, but it's hard to discover. Is there an easier way?

You basically just need to manipulate the column names, in your case.
Starting with your original DataFrame (and a tiny index manipulation):
from StringIO import StringIO
import numpy as np
a = pd.read_csv(StringIO('T,Ax,Ay,Az,Bx,By,Bz,Cx,Cy,Cz,Dx,Dy,Dz\n\
0,1,2,1,3,2,1,4,2,1,5,2,1\n\
1,8,2,3,3,2,9,9,1,3,4,9,1\n\
2,4,5,7,7,7,1,8,3,6,9,2,3'))
a.set_index('T', inplace=True)
So that:
>> a
Ax Ay Az Bx By Bz Cx Cy Cz Dx Dy Dz
T
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3
Then simply create a list of tuples for your columns, and use MultiIndex.from_tuples:
a.columns = pd.MultiIndex.from_tuples([(c[0], c[1]) for c in a.columns])
>> a
A B C D
x y z x y z x y z x y z
T
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3

Related

select top values for each datapoint in dataset [duplicate]

I have a following sample df:
idd x y
0 1 2 3
1 1 3 4
2 1 5 6
3 2 7 10
4 2 9 8
5 3 11 12
6 3 13 14
7 3 15 16
8 3 17 18
I want to use groupby by "idd" and find min of x and y and store it in a new df along with "idd".
In the above df, I expect to have xmin for idd=1 as 2, ymin for idd=1 as 3; idd = 2, xmin should be 7, ymin should be 8, and so on.
Expecting df:
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12
Code tried:
for group in df.groupby("idd"):
box = [df['x'].max(), df['y'].max()]
but it finds the min of x and y of the whole column and not as per "idd".
Here's a slightly different approach without rename
df = df.groupby('idd').min().add_suffix('min').reset_index()
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12
You can use groupby and then take min for each group.
df.groupby('idd').min().reset_index().rename(columns={'x':'xmin','y':'ymin'})
Out[105]:
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Pandas running sum

I have a pandas dataframe and it is something like this:
x y
1 0
2 1
3 2
4 0 <<<< Reset
5 1
6 2
7 3
8 0 <<<< Reset
9 1
10 2
The x values could be anything, they are not meaningful for this question. The y values increment, and reset and increment again. I need a third column (z) which is a number that represents the groups, so it increments when the y values are reset.
I cannot guarantee that the reset will be to zero, only a value that is less than the previous one, should indicate a reset.
x y z
1 0 0
2 1 0
3 2 0
4 0 1 <<<< Incremented by 1
5 1 1
6 2 1
7 3 1
8 0 2 <<<< Incremented by 1
9 1 2
10 2 2
So To produce z, i understand what needs to be done, just not familiar with the syntax. My solution would be to first assign z as a sparse column of 0 and 1's, where everything is zero except a 1 is given when y[ix] < y[ix-1], indicating that the y counter has been reset. Then a cumulative running sum should be performed on the z column, meaning that: z[ix] = sum(z[0],z[1],...,z[ix])
Id appreciate some help with the syntax of assigning column z, if someone has a moment.
Based on your logic:
#general case
df['z'] = df['y'].diff().lt(0).cumsum()
# or equivalently
# df['z'] = df['y'].lt(df['y'].shift()).cumsum()
Output:
x y z
0 1 0 0
1 2 1 0
2 3 2 0
3 4 0 1
4 5 1 1
5 6 2 1
6 7 3 1
7 8 0 2
8 9 1 2
9 10 2 2
Using ne(1)
df.y.diff().ne(1).cumsum().sub(1)
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: y, dtype: int32

Apply an element-wise function on a pandas dataframe with index and column values as inputs

I often have this need, and I can't seem to find the way to do it efficiently.
Let's say I have a pandas DataFrame object and I want the value of each element (i,j) to be equal to f(index[i], columns[j]).
Using applymap, value of index and column for each element is lost.
What is the best way to do it?
It depends on what you are trying to do specifically.
clever hack
using pd.Panel.apply
it works because it will iterate over each series along the major and minor axes. It's name will be the tuple we need.
df = pd.DataFrame(index=range(5), columns=range(5))
def f1(x):
n = x.name
return n[0] + n[1] ** 2
pd.Panel(dict(A=df)).apply(f1, 0)
0 1 2 3 4
0 0 1 4 9 16
1 1 2 5 10 17
2 2 3 6 11 18
3 3 4 7 12 19
4 4 5 8 13 20
example 1
Here is one such use case and one possible solution for that use case
df = pd.DataFrame(index=range(5), columns=range(5))
f = lambda x: x[0] + x[1]
s = df.stack(dropna=False)
s.loc[:] = s.index.map(f)
s.unstack()
0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
or this will do the same thing
df.stack(dropna=False).to_frame().apply(lambda x: f(x.name), 1).unstack()
example 2
df = pd.DataFrame(index=list('abcd'), columns=list('xyz'))
v = df.values
c = df.columns.values
i = df.index.values
pd.DataFrame(
(np.tile(i, len(c)) + c.repeat(len(i))).reshape(v.shape),
i, c
)
x y z
a ax bx cx
b dx ay by
c cy dy az
d bz cz dz

Python Pandas groupby-apply strange behavior

Can anyone help me understand why there is different behavior between the two calls to apply below? Thank you.
In [34]: df
Out[34]:
A B C
0 1 0 0
1 1 7 4
2 2 9 8
3 2 2 4
4 2 2 1
5 3 3 3
6 3 3 2
7 3 5 7
In [35]: g = df.groupby('A')
In [36]: g.apply(max)
Out[36]:
A B C
A
1 1 7 4
2 2 9 8
3 3 5 7
In [37]: g.apply(lambda x: max(x))
Out[37]:
A
1 C
2 C
3 C
dtype: object
Short answer - you probably just want
df.groupby('A').max()
Longer answer - max is a generic python function that finds the max of any iterable. Because iterating a DataFrame is over the columns, calling the python max just finds the "largest" column, which happens in your second case.
In the first case - pandas has intercept logic, which turns things like g.apply(sum) into g.sum().