Selecting Columns to fill based on numbers of items in row - pandas

So I have 4 columns in a dataframe: W, X, Y, Z.
I have a CSV file that has some rows with some having 4 items, 3 items, and 2 items per row.
I am using:
frame = pd.read_csv("file_example.csv", names = [ 'W', 'X', 'Y', 'Z'])
Is there a way to make it so that the rows with 3 fill in W X and Z skipping over Y and leaving then NAN? And similarly make it so that the rows with 2 items fill in W and Z, skipping over X and Y? As it is now it just fills in the first columns it comes across.
In other words, is there a way to pick and choose which columns a row will fill up based on the number of items in the row?
Thanks.
Edit:
Input (corresponding to the output):
2,seafood,21418
2,stews,24126
2,seafood,23287
2,sandwiches,17429
and
4,6237
4,30815
4,5321
4,49248
Trying the method below, I put 100 test lines each of 4,3,2 item rows.
Sample part of output:
3 item line:
2 seafood 21418.0 21418
2 stews 24126.0 24126
2 seafood 23287.0 23287
2 sandwiches 17429.0 17429
2 item line:
4 6237 NaN 6237
4 30815 NaN 30815
4 5321 NaN 5321
4 49248 NaN 49248
The z is filling correctly, but the NaNs are not masking over.
Edit 2: Did not assign the new dataframe to a variable. Solution works.

import numpy as np
import pandas as pd
df = pd.read_csv('test.csv', names=['W', 'X', 'Y', 'Z'])
df
Out:
W X Y Z
0 10 Blue 20160809.0 203.0
1 12 Red 20160810.0 4578.0
2 9 Red 3094.0 NaN
3 15 Yellow 109.0 NaN
4 1 86 NaN NaN
5 5 9384 NaN NaN
6 56 3490 NaN NaN
Record the positions of NaNs:
nans = df.isnull().values
Fill Z column:
df['Z'] = df['Z'].fillna(df['Y'].fillna(df['X']))
Shift NaNs to the left:
df.mask(np.roll(nans, -1), np.nan)
Out:
W X Y Z
0 10 Blue 20160809.0 203
1 12 Red 20160810.0 4578
2 9 Red NaN 3094
3 15 Yellow NaN 109
4 1 NaN NaN 86
5 5 NaN NaN 9384
6 56 NaN NaN 3490

Related

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

regroup uneven number of rows pandas df

I need to regroup a df from the above format in the one below but it fails and the output shape is (unique number of IDs, 2). Is there a more obvious solution?
You can use groupby and pivot:
(df.assign(n=df.groupby('ID').cumcount().add(1))
.pivot(index='ID', columns='n', values='Value')
.add_prefix('val_')
.reset_index()
)
Example input:
df = pd.DataFrame({'ID': [7,7,8,11,12,18,22,22,22],
'Value': list('abcdefghi')})
Output:
n ID val_1 val_2 val_3
0 7 a b NaN
1 8 c NaN NaN
2 11 d NaN NaN
3 12 e NaN NaN
4 18 f NaN NaN
5 22 g h i

Why does interpolating NaNs result in an empty plot?

I think my toy example below is self-explanatory. Basically, I can plot a line based on 5 values, yet if I interpolate NaNs the resulting line plot is empty. I would expect that matplotlib would still be able to connect the discrete existing points in my data (which are all still present).
a = pd.DataFrame([1,2,3,4,5], index=range(0, 10, 2), columns=['value'])
print(a)
value
0 1
2 2
4 3
6 4
8 5
a.plot()
b = pd.DataFrame([np.NaN]*5, index=range(1, 11, 2), columns=['value'])
print(pd.concat([a, b]).sort_index())
value
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
5 NaN
6 4.0
7 NaN
8 5.0
9 NaN
pd.concat([a, b]).sort_index().plot()

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output

easy multidimensional numpy ndarray to pandas dataframe method?

Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)
You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""
I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23