Insert index to level in multiindex dataframe [duplicate] - pandas

This question already has an answer here:
Adding a new nested level value to a MultiIndex DataFrame
(1 answer)
Closed last month.
Is it possible to add one index to a level in multiindex dataframe?
For example, I am trying to add 'new_index' to level 1 with nan value.
#Sample data
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.set_index([['one', 'two', 'three'], [1, 2, 3]])
df.index.names = ['first', 'second']
df
#Output
A B C
first second
one 1 1 4 7
two 2 2 5 8
three 3 3 6 9
#Desired Output
A B C
first second
one 1 1 4 7
new_index NaN NaN NaN
two 2 2 5 8
new_index NaN NaN NaN
three 3 3 6 9
new_index NaN NaN NaN
Thank you very much.

This is what I found.
df = df.unstack("second").stack(level=0)
df["new_index"] = "NA"
df.stack().unstack(level=1)
df
#output
A B C
first second
one 1 1.00000000 4.00000000 7.00000000
new_index NA NA NA
three 3 3.00000000 6.00000000 9.00000000
new_index NA NA NA
two 2 2.00000000 5.00000000 8.00000000
new_index NA NA NA
For NA is actually just string "NA", it can not rigorously be answer.
But replacing it with np.nan will make 'new_index' disappear since pd.stack() will dropna.
Any other idea?

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

How to join pandas dataframes which have a multiindex

Problem Description
I have a dataframe with a multi-index that is three levels deep (0, 1, 2) and I'd like to join this dataframe with another dataframe which is indexed by level 2 of my original dataframe.
In code, I'd like to turn:
pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
and
pd.DataFrame(['c', 'e']).transpose().set_index(0)
into
pd.DataFrame(['a', 'b', 'c', 'd', 'e']).transpose().set_index([0, 1, 2])
What I've tried
I've tried using swaplevel and then join. Didn't work, though some of the error messages suggested that if only I could set on properly this might work.
I tried concat, but couldn't get this to work either. Not sure it can't work though...
Notes:
I have seen this question in which the answer seems to dodge the question (while solving the problem).
pandas will naturally do this for you if the names of the index levels line up. You can rename the index of the second dataframe and join accordingly.
d1 = pd.DataFrame(['a', 'b', 'c', 'd']).transpose().set_index([0, 1, 2])
d2 = pd.DataFrame(['c', 'e']).transpose().set_index(0)
d1.join(d2.rename_axis(2))
3 1
0 1 2
a b c d e
More Comprehensive Example
d1 = pd.DataFrame([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
], pd.MultiIndex.from_product([['A', 'B'], ['X', 'Y']], names=['One', 'Two']))
d2 = pd.DataFrame([
list('abcdefg')
], ['Y'], columns=list('ABCDEFG'))
d3 = pd.DataFrame([
list('hij')
], ['A'], columns=list('HIJ'))
d1.join(d2.rename_axis('Two')).join(d3.rename_axis('One'))
0 1 A B C D E F G H I J
One Two
A X 1 2 NaN NaN NaN NaN NaN NaN NaN h i j
Y 3 4 a b c d e f g h i j
B X 5 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Y 7 8 a b c d e f g NaN NaN NaN

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows