Replace NaN inside a Column by another Dataframe using a Condition - pandas

I have a Dataframe like below and we have NaN only inside column 'Type 2':
id
Type 1
Type 2
0
a
b
1
c
d
2
e
NaN
3
g
f
4
i
h
5
j
k
6
l
NaN
7
m
NaN
8
o
p
9
x
y
9
z
NaN
And another one is an ordered Dataframe like Below:
id
Type 1
Type 2
0
a
o1
1
b
o2
2
c
o3
3
d
o4
...
...
...
23
x
o24
24
y
o25
25
z
o26
I want to fill NaN inside column 'type 2' with the second Dataframe Type 2 correspond Type 1.
I know I should use fillna() but I do not know how to add conditions like above, I wrote.
Thank you

Related

Pandas: new column where value is based on a specific value within subgroup

I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?
Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN

Pandas Merge(): Appending data from merged columns and replace null values (Extension from question https://stackoverflow.com/questions/68471939)

I'd like to merge two tables while replacing the null value in one column from one table with the non-null values from the same labelled column from another table.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1 for x in a]
d=[2 for x in a]
e=[3 for x in a]
f=[4 for x in a]
g=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d, 'e':e, 'f':f, 'g':g})
table_1
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 NaN
4 y w 1 2 3 4 NaN
5 y w 1 2 3 4 NaN
# Table 2 (new table to be merged to table_1, and would need to use values in column 'c' to replace values in the same column in table_1, while keeping the values in the other non-null rows)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
g=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'g':g})
table_2
a b g
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output:
a b c d e f g_x g_y
0 x z 1 2 3 4 1.0 NaN
1 x z 1 2 3 4 1.0 NaN
2 x z 1 2 3 4 1.0 NaN
3 y w 1 2 3 4 NaN 2.0
4 y w 1 2 3 4 NaN 2.0
5 y w 1 2 3 4 NaN 2.0
6 y w 1 2 3 4 NaN 2.0
7 y w 1 2 3 4 NaN 2.0
8 y w 1 2 3 4 NaN 2.0
9 y w 1 2 3 4 NaN 2.0
10 y w 1 2 3 4 NaN 2.0
11 y w 1 2 3 4 NaN 2.0
Desired output:
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 2.0
4 y w 1 2 3 4 2.0
5 y w 1 2 3 4 2.0
There are some problems you have to solve:
Tables 1,2 'g' column type: it should be float. So we use DataFrame.astype({'column_name':'type'}) for both tables 1,2;
Indexes. You are allowed to insert data by index, because other columns of table_1 contain the same data : 'y w 1 2 3 4'. Therefore we should filter NaN values from 'g' column of the table 1: ind=table_1[*pd.isnull*(table_1['g'])] and create a new Series with new indexes from table 1 that cover NaN values from 'g': pd.Series(table_2['g'].to_list(),index=ind.index)
try this solution:
table_1=table_1.astype({'a':'str','b':'str','g':'float'})
table_2=table_2.astype({'a':'str','b':'str','g':'float'})
ind=table_1[pd.isnull(table_1['g'])]
table_1.loc[ind.index,'g']=pd.Series(table_2['g'].to_list(),index=ind.index)
Here is the output.

How to update the value in a column based on another value in the same column where both rows have the same value in another column?

The dataframe df is given:
ID I J K
0 10 1 a 1
1 10 2 b nan
2 10 3 c nan
3 11 1 f 0
4 11 2 b nan
5 11 3 d nan
6 12 1 b 1
7 12 2 d nan
8 12 3 c nan
For each unique value in the ID, when I==3, if J=='c' then the K=1 where I==1, else K=0. The other values in K do not matter. In other words, the value of K in the row 0, 3, and 6 are determined based on the value of I in the row 2, 5, and 8 respectively.
Try:
IDs = df.loc[df.I.eq(3) & df.J.eq("c"), "ID"]
df["K"] = np.where(df["ID"].isin(IDs) & df.I.eq(1), 1, 0)
df["K"] = np.where(df.I.eq(1), df.K, np.nan) # <-- if you want other values NaNs
print(df)
Prints:
ID I J K
0 10 1 a 1.0
1 10 2 b NaN
2 10 3 c NaN
3 11 1 f 0.0
4 11 2 b NaN
5 11 3 d NaN
6 12 1 b 1.0
7 12 2 d NaN
8 12 3 c NaN

Conditionally ffill() with pandas

I want to forward fill a pandas series conditionally based on the last valid index in the series. For example, say we have this series:
import pandas as pd
ser = pd.Series(['a', 'b', 'b', pd.NA, 'c', pd.NA, pd.NA, 'd', pd.NA])
ser
0 a
1 b
2 b
3 <NA>
4 c
5 <NA>
6 <NA>
7 d
8 <NA>
I would like to ffill() the series only if the last valid index was not 2. This is the desired result:
0 a
1 b
2 b
3 <NA>
4 c
5 c
6 c
7 d
8 d
I came up with this way which works, but does not seem like a great answer. Is there a more elegant way of doing this?
ffilled = ser.ffill()
shifted = ser.shift(1)
result = ffilled.loc[(~pd.isna(ser)) | (shifted != 'b')]
result
0 a
1 b
2 b
4 c # -> index 3 does not get forward filled
5 c
6 c
7 d
8 d
Concatenating this result back with the original would insert a NaN at index 3, so this works but making two intermediary versions of the series doesn't seem like a great solution.
You can also do this simply with boolean masking:-
result=ser[~(ser.index==3)].ffill()
Finally use reindex() method:-
result=result.reindex(ser.index)
Now if you print result you will get your expected output:-
0 a
1 b
2 b
3 NaN
4 c
5 c
6 c
7 d
8 d
And if you want <NA> in place of nan values then:-
result.fillna('<NA>',inplace=True)
Now if you print result you will get exact same series that you want:-
0 a
1 b
2 b
3 <NA>
4 c
5 c
6 c
7 d
8 d
nan can be a bit clunky with this kind of thing. I would try this:
# generate fill values
fvals = ser.ffill().where(ser.index != 3)
ser.fillna(fvals)
output:
0 a
1 b
2 b
3 NaN
4 c
5 c
6 c
7 d
8 d
dtype: object
The other answers to this question hard code the removal of a specific index which isn't usable for unseen series. I realized that the shifted version of the series used in the question is not actually necessary, so I went with this:
result = ffilled.loc[(~pd.isna(ser)) | (ffilled != 'b')]
result
0 a
1 b
2 b
4 c
5 c
6 c
7 d
8 d
This also addresses a bug that the question's approach had, which is that multiple NA values after a 'b' value would not get omitted.
#AnuragDabas's use of reindex is a good way of putting a NaN back in at index 3 which can then be filled in as desired with fillna().
result = result.reindex(ser.index)

How can I append a Series to an existing DataFrame row with pandas?

I want to take a series and append it to an existing dataframe row. For example:
df
A B C
0 2 3 4
1 5 6 7
2 7 8 9
series
0 x
1 y
2 z
-->
A B C D E F
0 2 3 4 x y z
1 5 6 7 ...
2 7 8 9 ...
I want to do this using a for loop, appending a different series to each row of the dataframe. The series may have different lengths. Is there an easy way to accomplish this?
Use loc and the series's index as the column name
lst = [
[2,3,4],
[5,6,7],
[7,8,9]
]
df = pd.DataFrame(lst, columns=list("ABC"))
print(df)
###
A B C
0 2 3 4
1 5 6 7
2 7 8 9
s1 = pd.Series(list("xyz"))
s1.index = list("DEF")
print(s1)
###
D x
E y
F z
dtype: object
s2 = pd.Series(list("abcd"))
s2.index = list("GHIJ")
print(s2)
###
G a
H b
I c
J d
dtype: object
for idx, s in enumerate([s1, s2]):
df.loc[idx, s.index] = s.values
print(df)
###
A B C D E F G H I J
0 2 3 4 x y z NaN NaN NaN NaN
1 5 6 7 NaN NaN NaN a b c d
2 7 8 9 NaN NaN NaN NaN NaN NaN NaN
Try this:
df['D'], df['E'], df['F'] = s.tolist()
And now:
print(df)
Gives:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
Edit:
If you are not sure how many extra values there are, try:
from string import ascii_uppercase as letters
df = df.assign(**dict(zip([letters[i + len(df.columns)] for i, v in enumerate(series)], series.tolist())))
print(df)
Output:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z