Assigning to a slice from another DataFrame requires matching column names? - pandas

If I want to set (replace) part of a DataFrame with values from another, I should be able to assign to a slice (as in this question) like this:
df.loc[rows, cols] = df2
Not so in this case, it nulls out the slice instead:
In [32]: df
Out[32]:
A B
0 1 -0.240180
1 2 -0.012547
2 3 -0.301475
In [33]: df2
Out[33]:
C
0 x
1 y
2 z
In [34]: df.loc[:,'B']=df2
In [35]: df
Out[35]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
But it does work with just a column (Series) from df2, which is not an option if I want multiple columns:
In [36]: df.loc[:,'B']=df2['C']
In [37]: df
Out[37]:
A B
0 1 x
1 2 y
2 3 z
Or if the column names match:
In [47]: df3
Out[47]:
B
0 w
1 a
2 t
In [48]: df.loc[:,'B']=df3
In [49]: df
Out[49]:
A B
0 1 w
1 2 a
2 3 t
Is this expected? I don't see any explanation for it in docs or Stackoverflow.

Yes, this is expected. Label alignment is one of the core features of pandas. When you use df.loc[:,'B'] = df2 it needs to align two DataFrames:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN NaN x
1 NaN NaN y
2 NaN NaN z)
The above shows how each DataFrame looks when aligned as a tuple (the first one is df and the second one is df2). If your df2 also had a column named B with values [1, 2, 3], it would become:
df.align(df2)
Out:
( A B C
0 1 -0.240180 NaN
1 2 -0.012547 NaN
2 3 -0.301475 NaN, A B C
0 NaN 1 x
1 NaN 2 y
2 NaN 3 z)
Since B's are aligned, your assignment would result in
df.loc[:,'B'] = df2
df
Out:
A B
0 1 1
1 2 2
2 3 3
When you use a Series, the alignment will be on a single axis (on index in your example). Since they exactly match, there will be no problem and it will assign the values from df2['C'] to df['B'].
You can either rename the labels before the alignment or use a data structure that doesn't have labels (a numpy array, a list, a tuple...).

You can use the underlying NumPy array:
df.loc[:,'B'] = df2.values
df
A B
0 1 x
1 2 y
2 3 z
Pandas indexing is always sensitive to labeling of both rows and columns. In this case, your rows check out, but your columns do not. (B != C).
Using the underlying NumPy array makes the operation index-insensitive.
The reason that this does work when df2 is a Series is because Series have no concept of columns. The only alignment is on the rows, which are aligned.

Related

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C
try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C
Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

Re-index to insert missing rows in a multi-indexed dataframe

I have a MultiIndexed DataFrame with three levels of indices. I would like to expand my third level to contain all values in a given range, but only for the existing values in the two upper levels.
For example, assume the first level is name, the second level is date and the third level is hour. I would like to have rows for all 24 possible hours (even if some are currently missing), but only for the already existing names and dates. The values in new rows can be filled with zeros.
So a simple example input would be:
>>> import pandas as pd
>>> df = pd.DataFrame([[1,1,1,3],[2,2,1,4], [3,3,2,5]], columns=['A', 'B', 'C','val'])
>>> df.set_index(['A', 'B', 'C'], inplace=True)
>>> df
val
A B C
1 1 1 3
2 2 1 4
3 3 2 5
if the required values for C are [1,2,3], the desired output would be:
val
A B C
1 1 1 3
2 0
3 0
2 2 1 4
2 0
3 0
3 3 1 0
2 5
3 0
I know how to achieve this using groupby and applying a defined function for each group, but I was wondering if there was a cleaner way of doing this with reindex (I couldn't make this one work for a MultiIndex case, but perhaps I'm missing something)
Use -
partial_indices = [ i[0:2] for i in df.index.values ]
C_reqd = [1, 2, 3]
final_indices = [j+(i,) for j in partial_indices for i in C_reqd]
index = pd.MultiIndex.from_tuples(final_indices, names=['A', 'B', 'C'])
df2 = pd.DataFrame(pd.Series(0, index), columns=['val'])
df2.update(df)
Output
df2
val
A B C
1 1 1 3.0
2 0.0
3 0.0
2 2 1 4.0
2 0.0
3 0.0
3 3 1 0.0
2 5.0
3 0.0

Find multiple strings in a given column

I'm not sure whether it is possible to do easily.
I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').
What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.
For example, if I have a dataframe like this:
df1 = pd.DataFrame({
'ID':[1,2,3,4,5,6],
'Texts':['this is a string',
'here we have another string',
'this one is completly different',
'one more',
'this is one more',
'and the last one'],
'c':['C','C','C','C','C','C'],
'd':['D','D','D','D','NaN','NaN']
}, columns = ['ID','Texts','c','d'])
df1
Out[2]:
ID Texts c d
0 1 this is a string C D
1 2 here we have another string C D
2 3 this one is completly different C D
3 4 one more C D
4 5 this is one more C NaN
5 6 and the last one C NaN
And another dataframe like this:
df2 = pd.DataFrame({
'SubString':['This', 'one', 'this is', 'is one'],
'Score':[0.5, 0.2, 0.75, -0.5]
}, columns = ['SubString','Score'])
df2
Out[3]:
SubString Score
0 This 0.50
1 one 0.20
2 this is 0.75
3 is one -0.50
I want to get something like this:
df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
score = 0
for index2, row2 in df2.iterrows():
if row2['SubString'] in row1['Texts']:
score += row2['Score']
df1.set_value(index1, 'Score', score)
df1
Out[4]:
ID Texts c d Score
0 1 this is a string C D 0.75
1 2 here we have another string C D 0.00
2 3 this one is completly different C D -0.30
3 4 one more C D 0.20
4 5 this is one more C NaN 0.45
5 6 and the last one C NaN 0.20
Is there a less garbled and faster way to do it?
Thanks!
Option 1
In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0)
for _, x in df2.iterrows()]
).sum(axis=0)
Out[691]: array([ 0.75, 0. , -0.3 , 0.2 , 0.45, 0.2 ])
Option 2
In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0 0.75
1 0.00
2 -0.30
3 0.20
4 0.45
5 0.20
Name: Texts, dtype: float64
Note: apply doesn't get rid of loops, it just hides them.

pandas set_index with NA and None values seem to be not working

I am trying to index a pandas DataFrame using columns with occasional NA and None in them. This seems to be failing. In the example below, df0 has (None,e) combination on index 3, but df1 has (NaN,e). Any suggestions?
import pandas as pd
import numpy as np
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',None,np.NaN], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
>>> df0
Out[3]:
k1 k2 v
0 4 a 1
1 NaN d 2
2 6 NaN 3
3 None e 4
4 NaN NaN 5
>>> df1
Out[4]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
NaN e 4
NaN 5
Edit: I see the point--so this is the expected behavior.
This is expected behaviour, the None value is being converted to NaN and as the value is duplicated it isn't being shown:
In [31]:
df1.index
Out[31]:
MultiIndex(levels=[['4', '6'], ['a', 'd', 'e']],
labels=[[0, -1, 1, -1, -1], [0, 1, -1, 2, -1]],
names=['k1', 'k2'])
From the above you can see that -1 is being used to display NaN values, with respect to the output, if your df was like the following then the output shows the same behaviour:
In [34]:
df0 = pd.DataFrame({'k1':['4',np.NaN,'6',1,1], 'k2':['a','d',np.NaN,'e',np.NaN], 'v':[1,2,3,4,5]})
df1 = df0.copy().set_index(['k1','k2'])
df1
Out[34]:
v
k1 k2
4 a 1
NaN d 2
6 NaN 3
1 e 4
NaN 5
You can see that 1 is repeated for the last two rows