Pandas use several rows for a column MultiIndex in a DataFrame - pandas

When you load a CSV in pandas you can easily specify the number of rows to use as column indexes, as such:
import pandas
from six import StringIO
df = """a | X | X | Y | Y | Z | Z
b | C | N | C | N | C | N
c | i | i | i | j | j | j
d | 3 | 10 | 4 | 98 | 81 | 0"""
df = StringIO(df.replace(' ',''))
df = pandas.read_csv(df, sep="|", header=[0,1,2])
>>> df
a X Y Z
b C N C N C N
c i i i j j j
0 d 3 10 4 98 81 0
But how do you produce the same result from a Dataframe in memory? How do you simply specifying which set of rows should be used for the column index ?
Without of course going through this hack:
>>> df
0 1 2 3 4 5 6
0 a X X Y Y Z Z
1 b C N C N C N
2 c i i i j j j
3 d 3 10 4 98 81 0
path = '~/test/temp.csv'
df.to_csv(path, header=None, index=None)
df = pandas.read_csv(path, header=[0,1,2])
Or even this hack:
>>> df
0 1 2 3 4 5 6
0 a X X Y Y Z Z
1 b C N C N C N
2 c i i i j j j
3 d 3 10 4 98 81 0
df = df.transpose().set_index([0,1,2]).transpose()
I tried using this method, but it does not accept an axis parameter:
df.set_index(['a', 'b', 'c'], axis=1)

Your alternative solution should be improved a bit:
df = df.T.set_index([0,1,2]).T
Another solution without transpose:
df.columns = pd.MultiIndex.from_tuples(df.iloc[:3].apply(tuple))
df = df.iloc[3:].reset_index(drop=True)
print (df)
a X Y Z
b C N C N C N
c i i i j j j
0 d 3 10 4 98 81 0

Related

dataframe sorting by sum of values

I have the following df:
df = pd.DataFrame({'from':['A','A','A','B','B','C','C','C'],'to':['J','C','F','C','M','Q','C','J'],'amount':[1,1,2,12,13,5,5,1]})
df
and I wish to sort it is such way that the highest amount of 'from' is first. So in this example, 'from' B has 12+13 = 25 so B is the first in the list. Then comes C with 11 and then A with 4.
One way to do it is like this:
df['temp'] = df.groupby(['from'])['amount'].transform('sum')
df.sort_values(by=['temp'], ascending =False)
but I'm just adding another column. Wonder if there's a better way?
I think your method is good and explicit.
A variant without the temporary column could be:
df.sort_values(by='from', ascending=False,
key=lambda x: df['amount'].groupby(x).transform('sum'))
output:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2
In your case do with argsort
out = df.iloc[(-df.groupby(['from'])['amount'].transform('sum')).argsort()]
Out[53]:
from to amount
3 B C 12
4 B M 13
5 C Q 5
6 C C 5
7 C J 1
0 A J 1
1 A C 1
2 A F 2

Group by year and get count and total count [duplicate]

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?
Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Python Select N number of rows dataframe

I have a dataframe with 2 columns and I want to select N number of row from column B per column A
A B
0 A
0 B
0 I
0 D
1 A
1 F
1 K
1 L
2 R
For each unique number in Column A give me N random rows from Column B: if N == 2 then the resulting dataframe would look like. If Column A doesn't have up to N rows then return all of column A
A B
0 A
0 D
1 F
1 K
2 R
Use DataFrame.sample per groups in GroupBy.apply with test length of groups with if-else:
N = 2
df1 = df.groupby('A').apply(lambda x: x.sample(N) if len(x) >=N else x).reset_index(drop=True)
print (df1)
A B
0 0 I
1 0 D
2 1 A
3 1 K
4 2 R
Or:
N = 2
df1 = df.groupby('A', group_keys=False).apply(lambda x: x.sample(N) if len(x) >=N else x)
print (df1)
A B
0 0 A
3 0 D
5 1 F
6 1 K
8 2 R

Create pandas dataframe by repeating one row with new multiindex

In Pandas I have a series and a multi-index:
s = pd.Series([1,2,3,4], index=['w', 'x', 'y', 'z'])
idx = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']])
What is the best way for me to create a DataFrame that has idx as index, and s as value for each row, preserving the index in S as columns?
df =
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
Use the pd.DataFrame constructor followed by assign
pd.DataFrame(index=idx).assign(**s)
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
You can use numpy.repeat with numpy.ndarray.reshape for duplicate data and last DataFrame constructor:
arr = np.repeat(s.values, len(idx)).reshape(-1, len(idx))
df = pd.DataFrame(arr, index=idx, columns=s.index)
print (df)
w x y z
a c 1 1 1 1
d 2 2 2 2
b c 3 3 3 3
d 4 4 4 4
Timings:
np.random.seed(123)
s = pd.Series(np.random.randint(10, size=1000))
s.index = s.index.astype(str)
idx = pd.MultiIndex.from_product([np.random.randint(10, size=250), ['a','b','c', 'd']])
In [32]: %timeit (pd.DataFrame(np.repeat(s.values, len(idx)).reshape(len(idx), -1), index=idx, columns=s.index))
100 loops, best of 3: 3.94 ms per loop
In [33]: %timeit (pd.DataFrame(index=idx).assign(**s))
1 loop, best of 3: 332 ms per loop
In [34]: %timeit pd.DataFrame([s]*len(idx),idx,s.index)
10 loops, best of 3: 82.9 ms per loop
Use [s]*len(s) as data, idx as index and s.index as column to reconstruct a df.
pd.DataFrame([s]*len(s),idx,s.index)
Out[56]:
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4