easy multidimensional numpy ndarray to pandas dataframe method? - numpy

Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)

You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN

From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""

I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23

Related

python rolling product on non-adjacent row

I would like to calculate rolling product of non-adjacent row, such as product of values in every fifth row as shown in the photo (result in blue cell is the product of data in blue cell etc.)
The best way I can do now is the following;
temp = pd.DataFrame([range(20)]).transpose()
df = temp.copy()
df['shift1'] = temp.shift(5)
df['shift2'] = temp.shift(10)
df['shift3'] = temp.shift(15)
result = df.product(axis=1)
however, it looks to be cumbersome as I want to change the row step dynamically.
can anyone tell me if there is a better way to navigate this?
Thank you
You can use groupby.cumprod/groupby.prod with the modulo 5 as grouper:
import numpy as np
m = np.arange(len(df)) % 5
# option 1
df['result'] = df.groupby(m)['data'].cumprod()
# option 2
df.loc[~m.duplicated(keep='last'), 'result2'] = df.groupby(m)['data'].cumprod()
# or
# df.loc[~m.duplicated(keep='last'),
# 'result2'] = df.groupby(m)['data'].prod().to_numpy()
output:
data result result2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
4 4 4 NaN
5 5 0 NaN
6 6 6 NaN
7 7 14 NaN
8 8 24 NaN
9 9 36 NaN
10 10 0 NaN
11 11 66 NaN
12 12 168 NaN
13 13 312 NaN
14 14 504 NaN
15 15 0 0.0
16 16 1056 1056.0
17 17 2856 2856.0
18 18 5616 5616.0
19 19 9576 9576.0

Pandas groupby calculation using values from different rows based on other column

I have the following dataframe, observations are grouped in pairs. NaN here represents different products traded in pair wrt A. I want to groupby transaction and compute
A/NaN so that the value for all NaNs can be expressed in unit A.
transaction name value ...many other columns
1 A 3
1 NaN 5
2 NaN 7
2 A 6
3 A 4
3 NaN 3
4 A 10
4 NaN 9
5 C 8
5 A 6
..
Thus the desired df would be
transaction name value new_column ...many other columns
1 A 3 NaN
1 NaN 6 0.5
2 NaN 7 0.8571
2 A 6 NaN
3 A 4 1.333
3 NaN 3 NaN
4 A 10 1.111
4 NaN 9 NaN
5 C 8 0.75
5 A 6 NaN
...
First filter rows with A and convert transaction to index for possible divide rows with missing value by mapped transaction by Series.map:
m = df['name'].ne('A')
s = df[~m].set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN
1 1 NaN 5 0.600000
2 2 NaN 7 0.857143
3 2 A 6 NaN
4 3 A 4 NaN
5 3 NaN 3 1.333333
6 4 A 10 NaN
7 4 NaN 9 1.111111
8 5 NaN 8 0.750000
9 5 A 6 NaN
EDIT: There is multiple A values per groups, not only one, possible solution is removed duplicates:
print (df)
transaction name value
0 1 A 3
1 1 A 4
2 1 NaN 5
3 2 NaN 7
4 2 A 6
5 3 A 4
6 3 NaN 3
7 4 A 10
8 4 NaN 9
9 5 C 8
10 5 A 6
# s = df[~m].set_index('transaction')['value']
# df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
# print (df)
#InvalidIndexError: Reindexing only valid with uniquely valued Index objects
m = df['name'].ne('A')
print (df[~m].drop_duplicates(['transaction','name']))
transaction name value
0 1 A 3
4 2 A 6
5 3 A 4
7 4 A 10
10 5 A 6
s = df[~m].drop_duplicates(['transaction','name']).set_index('transaction')['value']
df.loc[m, 'new_column'] = df.loc[m, 'transaction'].map(s) / df.loc[m, 'value']
print (df)
transaction name value new_column
0 1 A 3 NaN <- 2 times a per 1 group
1 1 A 4 NaN <- 2 times a per 1 group
2 1 NaN 5 0.600000
3 2 NaN 7 0.857143
4 2 A 6 NaN
5 3 A 4 NaN
6 3 NaN 3 1.333333
7 4 A 10 NaN
8 4 NaN 9 1.111111
9 5 C 8 0.750000
10 5 A 6 NaN
Assuming there are only two values per transaction, you can use agg and divide the first and last element by each other:
df.loc[df['name'].isna(), 'new_column'] = df.sort_values(by='name').\
groupby('transaction')['value'].\
agg(f='first', l='last').agg(lambda x: x['f'] / x['l'], axis=1)

Any function/attribute in dataframe similar to attribute 'remove' or 'pop'

Is there any attribute/function for dataframe similar to like 'remove' attribute in series, to remove the 1st occirance of similar indexes in a dataframe.
Dataframe:
a b c d
100 1 2 3 NaN
200 4 5 6 NaN
100 7 9 10 NaN
Desired output:(after the desired command)
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
Try with loc and duplicated with keep='last':
>>> df[~df.index.duplicated(keep='last')]
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
>>>
Edit:
df.iloc[np.where(df.index.duplicated(keep='last'))]

Selecting Columns to fill based on numbers of items in row

So I have 4 columns in a dataframe: W, X, Y, Z.
I have a CSV file that has some rows with some having 4 items, 3 items, and 2 items per row.
I am using:
frame = pd.read_csv("file_example.csv", names = [ 'W', 'X', 'Y', 'Z'])
Is there a way to make it so that the rows with 3 fill in W X and Z skipping over Y and leaving then NAN? And similarly make it so that the rows with 2 items fill in W and Z, skipping over X and Y? As it is now it just fills in the first columns it comes across.
In other words, is there a way to pick and choose which columns a row will fill up based on the number of items in the row?
Thanks.
Edit:
Input (corresponding to the output):
2,seafood,21418
2,stews,24126
2,seafood,23287
2,sandwiches,17429
and
4,6237
4,30815
4,5321
4,49248
Trying the method below, I put 100 test lines each of 4,3,2 item rows.
Sample part of output:
3 item line:
2 seafood 21418.0 21418
2 stews 24126.0 24126
2 seafood 23287.0 23287
2 sandwiches 17429.0 17429
2 item line:
4 6237 NaN 6237
4 30815 NaN 30815
4 5321 NaN 5321
4 49248 NaN 49248
The z is filling correctly, but the NaNs are not masking over.
Edit 2: Did not assign the new dataframe to a variable. Solution works.
import numpy as np
import pandas as pd
df = pd.read_csv('test.csv', names=['W', 'X', 'Y', 'Z'])
df
Out:
W X Y Z
0 10 Blue 20160809.0 203.0
1 12 Red 20160810.0 4578.0
2 9 Red 3094.0 NaN
3 15 Yellow 109.0 NaN
4 1 86 NaN NaN
5 5 9384 NaN NaN
6 56 3490 NaN NaN
Record the positions of NaNs:
nans = df.isnull().values
Fill Z column:
df['Z'] = df['Z'].fillna(df['Y'].fillna(df['X']))
Shift NaNs to the left:
df.mask(np.roll(nans, -1), np.nan)
Out:
W X Y Z
0 10 Blue 20160809.0 203
1 12 Red 20160810.0 4578
2 9 Red NaN 3094
3 15 Yellow NaN 109
4 1 NaN NaN 86
5 5 NaN NaN 9384
6 56 NaN NaN 3490

transforming data frame in ipython a little like transpose

Suppose I have a data frame like the following data.frame in pandas
a 1 11
a 3 12
a 20 13
b 2 14
b 4 15
I want to generate a resulting data.frame like this
V1 1 2 3 4 20
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN
How can I get this transformation?
Thank you.
You can use pivot:
import pandas as pd
df = pd.DataFrame({'col1': ['a','a','a','b','b'],
'col2': [1,3,20,2,4],
'col3': [11,12,13,14,15]})
print df.pivot(index='col1', columns='col2')
Output:
col3
col2 1 2 3 4 20
col1
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN