Change the order of index in pandas DataFrame with multiindex - pandas

I am trying to find a straightforward way to change the order of values in a pandas DatafFrame multiindex. To illustrate what I mean, suppose we have a DataFrame with multiindex define as follows:
index = pd.MultiIndex(levels=[[u'C', u'D', u'M'], [u'C', u'D', u'M']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=[u'level0', u'level1'])
df = pd.DataFrame(np.random.randint(10,size=(9,3)),index=index,columns=['C','M','D'])
So we have a DataFrame df as follows:
What I am trying to do is to change the sequence of the Multiindex from "C D M" (which is ordered alphabetically) to "C M D" in both level0 and level1. I have tried to use pd.reindex, but have not found an easy way to achieve this goal.
Jezrael gave an answer below which gives the correct display:
L = list('CMD')
mux = pd.MultiIndex.from_product([L, L], names=df.index.names)
df = df.reindex(mux)
print (df)
However, what I need is that the levels of the index are in the order of "C M D" as well. If we check df.index, we get the following:
MultiIndex(levels=[[u'C', u'D', u'M'], [u'C', u'D', u'M']],
labels=[[0, 0, 0, 2, 2, 2, 1, 1, 1], [0, 2, 1, 0, 2, 1, 0, 2, 1]],
names=[u'level0', u'level1'])
Note the "levels" are still in the order of "C D M". What I want is such that when I use df.unstack(), I still get the index in the order of "C M D". Sorry for not making this clear.

Use reindex by new MultiIndex.from_product:
np.random.seed(2018)
index = pd.MultiIndex(levels=[[u'C', u'D', u'M'], [u'C', u'D', u'M']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=[u'level0', u'level1'])
df = pd.DataFrame(np.random.randint(10,size=(9,3)),
index=index,columns=['C','M','D'])
print (df)
C M D
level0 level1
C C 6 2 9
D 5 4 6
M 9 9 7
D C 9 6 6
D 1 0 6
M 5 6 7
M C 0 7 8
D 7 9 4
M 8 1 2
L = list('CMD')
mux = pd.MultiIndex.from_product([L, L], names=df.index.names)
df = df.reindex(mux)
print (df)
C M D
level0 level1
C C 6 2 9
M 9 9 7
D 5 4 6
M C 0 7 8
M 8 1 2
D 7 9 4
D C 9 6 6
M 5 6 7
D 1 0 6
EDIT:
If need set ordering create ordered CategoricalIndex and then simply sort_index:
L = pd.CategoricalIndex(list('CDM'), ordered=True, categories=list('CMD'))
df.index = pd.MultiIndex.from_product([L, L], names=df.index.names)
df = df.sort_index()
print (df)
C M D
level0 level1
C C 6 2 9
M 9 9 7
D 5 4 6
M C 0 7 8
M 8 1 2
D 7 9 4
D C 9 6 6
M 5 6 7
D 1 0 6
Check unstack for new ordering:
print (df.unstack())
C M D
level1 C M D C M D C M D
level0
C 6 9 5 2 9 4 9 7 6
M 0 8 7 7 1 9 8 2 4
D 9 5 1 6 6 0 6 7 6

Related

How to find min and max values of a dataframe within x rolling rows without a loop?

I have this dataframe that looks like that:
df = pd.DataFrame(
[
[5, 8],
[8, 10],
[3, 15],
[16, 20],
[12, 21],
[5, 9],
[10, 12],
[20, 22],
[4, 10],
[7, 13],
[9, 15],
[6, 9],
],
columns=list("lh"),
)
I would like to know the min and max within x previous and forward rows for each row but without using a loop, as the dataframe is quite large, and it takes a long time.
I have this function that works:
def pivotid(df1, l, n1, n2): # n1 n2 before and after candle L
if l - n1 < 0 or l + n2 >= len(df1):
return 0
pividlow = 1
pividhigh = 1
for i in range(l - n1, l + n2 + 1):
if df1.l[l] > df1.l[i]:
pividlow = 0
if df1.h[l] < df1.h[i]:
pividhigh = 0
if pividlow and pividhigh:
return 3
elif pividlow:
return 1
elif pividhigh:
return 2
else:
return 0
Here's how I call it:
df['pivot'] = df.apply(lambda x: pivotid(df, x.name, 2, 2), axis=1)
Here's the expected result:
l h pivot
0 5 8 0
1 8 10 0
2 3 15 1
3 16 20 0
4 12 21 2
5 5 9 1
6 10 12 0
7 20 22 2
8 4 10 1
9 7 13 0
10 9 15 0
11 6 9 0
Do you think that there's a way to achieve that without using a for loop with pandas?
With the dataframe you provided, here is one way to do it using Pandas rolling:
df["pivot"] = (
(df["l"] == df["l"].rolling(window=5, center=True).min()).astype(int)
+ (df["h"] == df["h"].rolling(window=5, center=True).max()).astype(int) * 2
)
Then:
print(df)
# Output
l h pivot
0 5 8 0
1 8 10 0
2 3 15 1
3 16 20 0
4 12 21 2
5 5 9 1
6 10 12 0
7 20 22 2
8 4 10 1
9 7 13 0
10 9 15 0
11 6 9 0

groupby and count on multiple columns of dataframe

I have a df
df = pd.DataFrame([
[1, 1, 'A', 10],
[4, 1 ,'A', 6],
[7, 2 ,'A', 3],
[2, 2 ,'A', 4],
[6, 2 ,'B', 9],
[5, 2 ,'B', 7],
[5, 1 ,'B', 12],
[5, 1 ,'B', 4],
[5, 2 ,'C', 9],
[5, 1 ,'C', 3],
[5, 1 ,'C', 4],
[5, 2 ,'C', 7]
],
index=['A', 'A', 'A','A','A','A','A','A','A','A','A','A'],
columns=['A', 'B', 'C', 'D'])
I can count the number of non zero values for column D grouped by column A using:
df['countTrans'] = df['D'].ne(0).groupby(df['A']).transform('sum')
where the output is:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 7.0
A 5 1 B 12 7.0
A 5 1 B 4 7.0
A 5 2 C 9 7.0
A 5 1 C 3 7.0
A 5 1 C 4 7.0
A 5 2 C 7 7.0
however I would like to also group by not only by column A but also column B.
I have tried variants of:
df['countTrans'] = df['D'].ne(0).groupby(df['A'], df['B']).transform('sum')
df['countTrans'] = df['D'].ne(0).groupby(df['A','B']).transform('sum')
without success
my desired output would look like:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 3.0
A 5 1 B 12 4.0
A 5 1 B 4 4.0
A 5 2 C 9 3.0
A 5 1 C 3 4.0
A 5 1 C 4 4.0
A 5 2 C 7 3.0
Possible solution is pass Series to list:
df['countTrans'] = df['D'].ne(0).groupby([df['A'], df['B']]).transform('sum')
print (df)
A B C D countTrans
A 1 1 A 10 1
A 4 1 A 6 1
A 7 2 A 3 1
A 2 2 A 4 1
A 6 2 B 9 1
A 5 2 B 7 3
A 5 1 B 12 4
A 5 1 B 4 4
A 5 2 C 9 3
A 5 1 C 3 4
A 5 1 C 4 4
A 5 2 C 7 3
Or create helper column by DataFrame.assign (more 'clean' in my opinion):
df['countTrans'] = df.assign(E = df['D'].ne(0)).groupby(['A','B'])['E'].transform('sum')
#similar solution with overwrite D
#df['countTrans'] = df.assign(D = df['D'].ne(0)).groupby(['A','B'])['D'].transform('sum')

Passing Tuple to a function via apply

I am trying to run below function which takes two points..
point A=(2,3)
point B=(4,5
def Somefunc(pointA, point B):
x= pointA[0] + pointB[1]
return x
Now, when in try to create a separate column based on this fucntion, it is throwing me errors like cannot convert the series to <class 'float'>, so I tried this
df['T']=df.apply(Somefunc((df['A'].apply(lambda x: float(x)),df['B'].apply(lambda x: float(x))),\
(df['C'].apply(lambda x: float(x)),df['D'].apply(lambda x: float(x)))),axis=0))
Sample dataframe below;
A B C D
1 2 3 5
2 4 7 8
4 7 9 0
Any help will be appreciated.
This is the best guess I can make as to what you're trying to do:
df['T']=df.apply(lambda row: [(row['A'],row['B']),(row['C'],row['D'])],axis=1)
Edit: to apply your function;
df['T'] = df.apply(lambda row: SomeFunc((row['A'],row['B']),(row['C'],row['D'])),axis=1)
that being said, the same result can be achieved much quicker and idiomatically like so:
>>> df
A B C D
0 2 7 3 3
1 3 1 5 7
2 2 0 6 2
3 3 9 5 9
4 0 2 3 7
>>> df['T']=df.apply(tuple,axis=1)
>>> df
A B C D T
0 2 7 3 3 (2, 7, 3, 3)
1 3 1 5 7 (3, 1, 5, 7)
2 2 0 6 2 (2, 0, 6, 2)
3 3 9 5 9 (3, 9, 5, 9)
4 0 2 3 7 (0, 2, 3, 7)

Pairwise minima of elements of a pandas Series

Input:
numbers = pandas.Series([3,5,8,1], index=["A","B","C","D"])
A 3
B 5
C 8
D 1
Expected output (pandas DataFrame):
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Current (working) solution:
pairwise_mins = pandas.DataFrame(index=numbers.index)
def calculate_mins(series, index):
to_return = numpy.minimum(series, series[index])
return to_return
for col in numbers.index:
pairwise_mins[col] = calculate_mins(numbers, col)
I suspect there must be a better, shorter, vectorized solution. Who could help me with it?
Use the outer ufunc here that numpy provides, combined with numpy.minimum
n = numbers.to_numpy()
np.minimum.outer(n, n)
array([[3, 3, 3, 1],
[3, 5, 5, 1],
[3, 5, 8, 1],
[1, 1, 1, 1]], dtype=int64)
This can be done by broadcasting:
pd.DataFrame(np.where(numbers.values[:,None] < numbers.values,
numbers[:,None],
numbers),
index=numbers.index,
columns=numbers.index)
Output:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Use np.broadcast_to and np.clip
a = numbers.values
pd.DataFrame(np.broadcast_to(a, (a.size,a.size)).T.clip(max=a),
columns=numbers.index,
index=numbers.index)
Out[409]:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.