transforming data frame in ipython a little like transpose - pandas

Suppose I have a data frame like the following data.frame in pandas
a 1 11
a 3 12
a 20 13
b 2 14
b 4 15
I want to generate a resulting data.frame like this
V1 1 2 3 4 20
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN
How can I get this transformation?
Thank you.

You can use pivot:
import pandas as pd
df = pd.DataFrame({'col1': ['a','a','a','b','b'],
'col2': [1,3,20,2,4],
'col3': [11,12,13,14,15]})
print df.pivot(index='col1', columns='col2')
Output:
col3
col2 1 2 3 4 20
col1
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN

Related

regroup uneven number of rows pandas df

I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i

How to groupby a dataframe with two level header and generate box plot?

Now I have a dataframe like below (original dataframe):
Equipment
A
B
C
1
10
10
10
1
11
11
11
2
12
12
12
2
13
13
13
3
14
14
14
3
15
15
15
And I want to transform the dataframe like below (transformed dataframe):
1
-
-
2
-
-
3
-
-
A
B
C
A
B
C
A
B
C
10
10
10
12
12
12
14
14
14
11
11
11
13
13
13
15
15
15
How can I make such groupby transformation with two level header by Pandas?
Additionally, I want to use the transformed dataframe to generate box plot, and the whole box plot is divided into three parts (i.e. 1,2,3), and each part has three box plots (i.e. A,B,C). Can I use the transformed dataframe in Image 2 without any processing? Or can I realize the box plotting only by the original dataframe?
Thank you so much.
Try:
g = df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.reset_index(drop=True).T)
g:
Equipment 1 2 3
A B C A B C A B C
0 10 10 10 12 12 12 14 14 14
1 11 11 11 13 13 13 15 15 15
Explanation:
grp = df.groupby(' Equipment ')[df.columns[1:]]
grp.apply(print)
A B C
0 10 10 10
1 11 11 11
A B C
2 12 12 12
3 13 13 13
A B C
4 14 14 14
5 15 15 15
you can see the index 0 1, 2 3, 4 5 for each equipment group(1,2,3).
That's why I used reset_index to make them 0 1 for each group why???
If you do without reset index:
df.groupby(' Equipment ')[df.columns[1:]].apply(lambda x: x.T)
0 1 2 3 4 5
Equipment
1 A 10.0 11.0 NaN NaN NaN NaN
B 10.0 11.0 NaN NaN NaN NaN
C 10.0 11.0 NaN NaN NaN NaN
2 A NaN NaN 12.0 13.0 NaN NaN
B NaN NaN 12.0 13.0 NaN NaN
C NaN NaN 12.0 13.0 NaN NaN
3 A NaN NaN NaN NaN 14.0 15.0
B NaN NaN NaN NaN 14.0 15.0
C NaN NaN NaN NaN 14.0 15.0
See the values in (2,3) and (4,5) column. I want to combine them into (0, 1) column only. That's why reset index with a drop.
0 1
Equipment
1 A 10 11
B 10 11
C 10 11
2 A 12 13
B 12 13
C 12 13
3 A 14 15
B 14 15
C 14 15
You can play with the code to understand it deeply. What's happening inside.

Make all values after a label have the same value of that label

I have a data frame:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 2)), columns=['col1', 'col2'])
Which generates the following frame:
col1 col2
0 6 3
1 7 4
2 6 9
3 2 6
4 7 4
I want to replace all values from row 2 forward with whatever value on row 1. So I type:
df.loc[2:] = df.loc[1:1]
But the resulting frame is filled with nan:
col1 col2
0 6.0 3.0
1 7.0 4.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
I know I can use fillna(method='ffill') to get what I want but why did the broadcasting not work and result is NaN? Expected result:
col1 col2
0 6 3
1 7 4
2 7 4
3 7 4
4 7 4
Edit: pandas version 0.24.2
I believe df.loc[1:1] is just the empty array, hence converted to NaN? It should be df.loc[2:, 'Value'] = df.loc[1, 'Value'].

Filling data into empty dataframe from other dataframes

I have an empty dataframe as:
import pandas as pd
df = pd.DataFrame(columns = ['A', 'B', 'C', 'D'])
I have another dataframe as:
df1 =
A D B
20181010 12 13
20181010 14 13
20181010 5 13
20181010 7 13
I want to fill df with data from df1 to get another dataframe as:
A B C D
20181010 13 NaN 12
20181010 13 NaN 14
20181010 13 NaN 5
20181010 13 NaN 7
df1 is missing column C so it gets filled with NaN. Other versions of df1 has other missing columns.
I am not sure how to populate df with data from df1
By using reindex
df1.reindex(columns=df.columns)
Out[92]:
A B C D
0 20181010 13 NaN 12
1 20181010 13 NaN 14
2 20181010 13 NaN 5
3 20181010 13 NaN 7
In this case pd.concat will do:
df = pd.concat((df,df1))
>>> df
A B C D
0 20181010 13 NaN 12
1 20181010 13 NaN 14
2 20181010 13 NaN 5
3 20181010 13 NaN 7

easy multidimensional numpy ndarray to pandas dataframe method?

Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)
You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""
I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23