Pandas Merge(): Appending data from merged columns and replace null values - pandas

I'd like to merge to tables while replacing the null value in one table with the non-null values from another.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c})
table_1
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w NaN
4 y w NaN
5 y w NaN
# Table 2 (new table to be appended to table_1, and would need to use values in column 'c' to replace values in the same column in table_1)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
c=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'c':c})
table_2
a b c
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
# Merging the two tables
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output (I don't understand why the number of rows is increased):
a b c_x c_y
0 x z 1.0 NaN
1 x z 1.0 NaN
2 x z 1.0 NaN
3 y w NaN 2.0
4 y w NaN 2.0
5 y w NaN 2.0
6 y w NaN 2.0
7 y w NaN 2.0
8 y w NaN 2.0
9 y w NaN 2.0
10 y w NaN 2.0
11 y w NaN 2.0
Desired output (to replace the null values in the 'c' column in table_1 with the numeric values from table_2):
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w 2.0
4 y w 2.0
5 y w 2.0

try:
out=table_1.append(table_2).dropna(subset=['c']).reset_index(drop=True)
#OR
out=pd.concat([table_1,table_2]).dropna(subset=['c']).reset_index(drop=True)
output of out:
a b c
0 x z 1.0
1 x z 1.0
2 x z 1.0
3 y w 2.0
4 y w 2.0
5 y w 2.0

Related

group dataframe if the column has the same value in consecutive order

let's say I have a dataframe that looks like below:
I want to assign my assets to one group if I have treatment that are consecutive. If there are two consecutive assets without treatment after them, then we still can assign them to the same group. However, if there are more than two assets without treatment, then those assets (without treatment) will have empty group. The next assets that have treatment will be assigned to a new group
You can use a rolling check if whether there was at least one Y in the last N occurrences.
I am providing two options depending on whether or not it's important not to label the leading/trailing Ns:
# maximal number of days without treatment
# to remain in same group
N = 2
m = df['Treatment'].eq('Y')
group = m.rolling(N+1, min_periods=1).max().eq(0)
group = (group & ~group.shift(fill_value=False)).cumsum().add(1)
df['group'] = group
# don't label leading/trailing N
m1 = m.groupby(group).cummax()
m2 = m[::-1].groupby(group).cummax()
df['group2'] = group.where(m1&m2)
print(df)
To handle the last NaNs separately:
m3 = ~m[::-1].cummax()
df['group3'] = group.where(m1&m2|m3)
Output:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 1.0 1.0
4 N 1 1.0 1.0
5 Y 1 1.0 1.0
6 Y 1 1.0 1.0
7 Y 1 1.0 1.0
8 N 1 NaN NaN
9 N 1 NaN NaN
10 N 2 NaN NaN
11 Y 2 2.0 2.0
12 Y 2 2.0 2.0
13 Y 2 2.0 2.0
14 Y 2 2.0 2.0
15 N 2 NaN 2.0
Other example for N=1:
Treatment group group2 group3
0 Y 1 1.0 1.0
1 Y 1 1.0 1.0
2 Y 1 1.0 1.0
3 N 1 NaN NaN
4 N 2 NaN NaN
5 Y 2 2.0 2.0
6 Y 2 2.0 2.0
7 Y 2 2.0 2.0
8 N 2 NaN NaN
9 N 3 NaN NaN
10 N 3 NaN NaN
11 Y 3 3.0 3.0
12 Y 3 3.0 3.0
13 Y 3 3.0 3.0
14 Y 3 3.0 3.0
15 N 3 NaN 3.0

Create columns in python data frame based on existing column-name and column-values

I have a dataframe in pandas:
import pandas as pd
# assign data of lists.
data = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Employment': ['R','U', 'E','R','U', 'E','R','U', 'E','R','U', 'E'],
'Age': ['Y','M', 'O','Y','M', 'O','Y','M', 'O','Y','M', 'O']
}
# Create DataFrame
df = pd.DataFrame(data)
df
What I want is to create for each category of each existing column a new column with the following format:
Gender_M -> for when the gender equals M
Gender_F -> for when the gender equal F
Employment_R -> for when employment equals R
Employment_U -> for when employment equals U
and so on...
So far, I have created the below code:
for i in range(len(df.columns)):
curent_column=list(df.columns)[i]
col_df_array = df[curent_column].unique()
for j in range(col_df_array.size):
new_col_name = str(list(df.columns)[i])+"_"+col_df_array[j]
for index,row in df.iterrows():
if(row[curent_column] == col_df_array[j]):
df[new_col_name] = row[curent_column]
The problem is that even though I have managed to create successfully the column names, I am not able to get the correct column values.
For example the column Gender should be as below:
data2 = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Gender_M': ['M', 'na', 'M', 'na','M', 'na','M', 'na','M', 'na','M', 'na'],
'Gender_F': ['na', 'F', 'na', 'F','na', 'F','na', 'F','na', 'F','na', 'F']
}
df2 = pd.DataFrame(data2)
Just to say, the na can be anything such as blanks or dots or NAN.
You're looking for pd.get_dummies.
>>> pd.get_dummies(df)
Gender_F Gender_M Employment_E Employment_R Employment_U Age_M Age_O Age_Y
0 0 1 0 1 0 0 0 1
1 1 0 0 0 1 1 0 0
2 0 1 1 0 0 0 1 0
3 1 0 0 1 0 0 0 1
4 0 1 0 0 1 1 0 0
5 1 0 1 0 0 0 1 0
6 0 1 0 1 0 0 0 1
7 1 0 0 0 1 1 0 0
8 0 1 1 0 0 0 1 0
9 1 0 0 1 0 0 0 1
10 0 1 0 0 1 1 0 0
11 1 0 1 0 0 0 1 0
If you are trying to get the data in a format like your df2 example, I believe this is what you are looking for.
ndf = pd.get_dummies(df)
df.join(ndf.mul(ndf.columns.str.split('_').str[-1]))
Output:
Old Answer
df[['Gender']].join(pd.get_dummies(df[['Gender']]).mul(df['Gender'],axis=0).replace('',np.NaN))
Output:
Gender Gender_F Gender_M
0 M NaN M
1 F F NaN
2 M NaN M
3 F F NaN
4 M NaN M
5 F F NaN
6 M NaN M
7 F F NaN
8 M NaN M
9 F F NaN
10 M NaN M
11 F F NaN
If you are okay with 0s and 1s in your new columns, then using get_dummies (as suggested by #richardec) should be the most straightforward.
However, if want a specific letter in each of your new columns, then another method is to loop through the current columns and the specific categories within each column, and create a new column from this information using apply.
for col in data.keys():
categories = list(df[col].unique())
for category in categories:
df[f"{col}_{category}"] = df[col].apply(lambda x: category if x==category else float("nan"))
Result:
>>> df
Gender Employment Age Gender_M Gender_F Employment_R Employment_U Employment_E Age_Y Age_M Age_O
0 M R Y M NaN R NaN NaN Y NaN NaN
1 F U M NaN F NaN U NaN NaN M NaN
2 M E O M NaN NaN NaN E NaN NaN O
3 F R Y NaN F R NaN NaN Y NaN NaN
4 M U M M NaN NaN U NaN NaN M NaN
5 F E O NaN F NaN NaN E NaN NaN O
6 M R Y M NaN R NaN NaN Y NaN NaN
7 F U M NaN F NaN U NaN NaN M NaN
8 M E O M NaN NaN NaN E NaN NaN O
9 F R Y NaN F R NaN NaN Y NaN NaN
10 M U M M NaN NaN U NaN NaN M NaN
11 F E O NaN F NaN NaN E NaN NaN O

Pandas Merge(): Appending data from merged columns and replace null values (Extension from question https://stackoverflow.com/questions/68471939)

I'd like to merge two tables while replacing the null value in one column from one table with the non-null values from the same labelled column from another table.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1 for x in a]
d=[2 for x in a]
e=[3 for x in a]
f=[4 for x in a]
g=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d, 'e':e, 'f':f, 'g':g})
table_1
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 NaN
4 y w 1 2 3 4 NaN
5 y w 1 2 3 4 NaN
# Table 2 (new table to be merged to table_1, and would need to use values in column 'c' to replace values in the same column in table_1, while keeping the values in the other non-null rows)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
g=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'g':g})
table_2
a b g
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output:
a b c d e f g_x g_y
0 x z 1 2 3 4 1.0 NaN
1 x z 1 2 3 4 1.0 NaN
2 x z 1 2 3 4 1.0 NaN
3 y w 1 2 3 4 NaN 2.0
4 y w 1 2 3 4 NaN 2.0
5 y w 1 2 3 4 NaN 2.0
6 y w 1 2 3 4 NaN 2.0
7 y w 1 2 3 4 NaN 2.0
8 y w 1 2 3 4 NaN 2.0
9 y w 1 2 3 4 NaN 2.0
10 y w 1 2 3 4 NaN 2.0
11 y w 1 2 3 4 NaN 2.0
Desired output:
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 2.0
4 y w 1 2 3 4 2.0
5 y w 1 2 3 4 2.0
There are some problems you have to solve:
Tables 1,2 'g' column type: it should be float. So we use DataFrame.astype({'column_name':'type'}) for both tables 1,2;
Indexes. You are allowed to insert data by index, because other columns of table_1 contain the same data : 'y w 1 2 3 4'. Therefore we should filter NaN values from 'g' column of the table 1: ind=table_1[*pd.isnull*(table_1['g'])] and create a new Series with new indexes from table 1 that cover NaN values from 'g': pd.Series(table_2['g'].to_list(),index=ind.index)
try this solution:
table_1=table_1.astype({'a':'str','b':'str','g':'float'})
table_2=table_2.astype({'a':'str','b':'str','g':'float'})
ind=table_1[pd.isnull(table_1['g'])]
table_1.loc[ind.index,'g']=pd.Series(table_2['g'].to_list(),index=ind.index)
Here is the output.

How can I append a Series to an existing DataFrame row with pandas?

I want to take a series and append it to an existing dataframe row. For example:
df
A B C
0 2 3 4
1 5 6 7
2 7 8 9
series
0 x
1 y
2 z
-->
A B C D E F
0 2 3 4 x y z
1 5 6 7 ...
2 7 8 9 ...
I want to do this using a for loop, appending a different series to each row of the dataframe. The series may have different lengths. Is there an easy way to accomplish this?
Use loc and the series's index as the column name
lst = [
[2,3,4],
[5,6,7],
[7,8,9]
]
df = pd.DataFrame(lst, columns=list("ABC"))
print(df)
###
A B C
0 2 3 4
1 5 6 7
2 7 8 9
s1 = pd.Series(list("xyz"))
s1.index = list("DEF")
print(s1)
###
D x
E y
F z
dtype: object
s2 = pd.Series(list("abcd"))
s2.index = list("GHIJ")
print(s2)
###
G a
H b
I c
J d
dtype: object
for idx, s in enumerate([s1, s2]):
df.loc[idx, s.index] = s.values
print(df)
###
A B C D E F G H I J
0 2 3 4 x y z NaN NaN NaN NaN
1 5 6 7 NaN NaN NaN a b c d
2 7 8 9 NaN NaN NaN NaN NaN NaN NaN
Try this:
df['D'], df['E'], df['F'] = s.tolist()
And now:
print(df)
Gives:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z
Edit:
If you are not sure how many extra values there are, try:
from string import ascii_uppercase as letters
df = df.assign(**dict(zip([letters[i + len(df.columns)] for i, v in enumerate(series)], series.tolist())))
print(df)
Output:
A B C D E F
0 2 3 4 x y z
1 5 6 7 x y z
2 7 8 9 x y z

Transforming pandas dataframe : un pivot

I have the following dataframe :
commune nuance_1 votes_1 nuance_2 votes_2 nuance_3 votes_3
A X 12 Y 20 Z 5
B X 10 Y 5
C Z 7 X 2
and I would like to obtain after transformation :
commune nuance votes
A X 12
A Y 20
A Z 5
B X 10
B Y 5
C Z 7
C X 2
Is there a way to do this ( sort of un pivot ) ?
You can use pd.wide_to_long here:
out = (pd.wide_to_long(df,['nuance','votes'],'commune','j',sep='_')
.dropna(how='all').sort_index(0).droplevel(1).reset_index())
print(out)
commune nuance votes
0 A X 12.0
1 A Y 20.0
2 A Z 5.0
3 B X 10.0
4 B Y 5.0
5 C Z 7.0
6 C X 2.0