I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN
I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i
I think my toy example below is self-explanatory. Basically, I can plot a line based on 5 values, yet if I interpolate NaNs the resulting line plot is empty. I would expect that matplotlib would still be able to connect the discrete existing points in my data (which are all still present).
a = pd.DataFrame([1,2,3,4,5], index=range(0, 10, 2), columns=['value'])
print(a)
value
0 1
2 2
4 3
6 4
8 5
a.plot()
b = pd.DataFrame([np.NaN]*5, index=range(1, 11, 2), columns=['value'])
print(pd.concat([a, b]).sort_index())
value
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 NaN
6 4.0
7 NaN
8 5.0
9 NaN
pd.concat([a, b]).sort_index().plot()
I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output
Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)
You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""
I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23