Join 2 data frame with special columns matching new - pandas

I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)

Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))

You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3

df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3

Related

How to method chaining in pandas to aggregate a DataFrame?

I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:
Having this data:
import pandas as pd
my_df = pd.DataFrame({
'name': ['john', 'diana', 'rachel', 'chris'],
'favorite_color': ['red', 'blue', 'green', 'red']
})
my_df
#> name favorite_color
#> 0 john red
#> 1 diana blue
#> 2 rachel green
#> 3 chris red
and I want to end up with this summary table:
#> total_people total_ppl_who_like_red
#> 0 4 2
Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:
desired_output_via_building_new_df = pd.DataFrame({
'total_people': [len(my_df)],
'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
})
desired_output_via_building_new_df
#> total_people total_ppl_who_like_red
#> 0 4 2
However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of
# pseudo-code; not really working
my_df.agg({
'total_people': lambda x: len(x),
'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
})
I can only borrow inspiration from R/dplyr code:
library(dplyr, warn.conflicts = FALSE)
my_df <-
data.frame(name = c("john", "diana", "rachel", "chris"),
favorite_color = c("red", "blue", "green", "red")
)
my_df |>
summarise(total_people = n(), ## in the context of `summarise()`,
total_ppl_who_like_red = sum(favorite_color == "red")) ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#> total_people total_ppl_who_like_red
#> 1 4 2
Solution for processing one Series:
df = my_df.favorite_color.apply({
'total_people': lambda x: x.count(),
'total_ppl_who_like_red': lambda x: x.eq('red').sum()
}).to_frame(name=0).T
print (df)
total_people total_ppl_who_like_red
0 4 2
General solution for processing DataFrame with DataFrame.pipe - then pandas processing input DataFrame, if use apply or agg processing columns separately:
df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum()}))
.to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red
0 4 2
df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum(),
'max_age':x.age.max()
}).to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red max_age
0 4 2 41

The pandas isin() function but returning the actual values, not just a boolean

I have an NumPy array of good animals, and a DataFrame of people with a list of animals they own.
good_animals = np.array(['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin'])
data = {
> 'People': [1, 2, 3, 4, 5],
> 'Animals': [['Owl'], ['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'], []],
> }
df = pd.DataFrame(data)
I want to add another column to my DataFrame, showing all the good animals that person owns.
The following gives me a Series showing whether or not each animal is a good animal.
df['Animals'].apply(lambda x: np.isin(x, good_animals))
But I want to see the actual good animals, not just booleans.
You can use intersection of sets from lists:
df['new'] = df['Animals'].apply(lambda x: list(set(x).intersection(good_animals)))
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Dragon, Owl]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []
If possible duplciated values or if order is important use list comprehension:
s = set(good_animals)
df['new'] = df['Animals'].apply(lambda x: [y for y in x if y in s])
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Owl, Dragon]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []
I`m not very sure if I understood well your questions. Why are you using np.array? You can try this:
good_animals = ['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin']
import pandas as pd
df_dict = {
'People':["1","2","3","4","5"],
'Animals':[['Owl'],['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'],[]],
'Good_animals': [None, None, None,None,None]
}
df = pd.DataFrame(df_dict)
for x in range(df.shape[0]):
row = x
df.Good_animals.iloc[x] = ', ' .join([y for y in df.Animals.iloc[row] if y in good_animals])
The result:
People Animals Good_animals
0 1 [Owl] Owl
1 2 [Owl, Dragon] Owl, Dragon
2 3 [Dog, Human]
3 4 [Unicorn, Pitbull] Unicorn
4 5 []

How to show rows with data which are not equal?

I have two tables
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df1 = pd.DataFrame(np.array([[1, 2, 4], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
print(df1.equals(df2))
I want to compare them. I want the same result if I would use function df.compare(df1) or at least something close to it. Can't use above fnction as my complier states that 'DataFrame' object has no attribute 'compare'
First approach:
Let's compare value by value:
In [1183]: eq_df = df1.eq(df2)
In [1196]: eq_df
Out[1200]:
a b c
0 True True False
1 True True True
2 True True True
Then let's reduce it down to see which rows are equal for all columns
from functools import reduce
In [1285]: eq_ser = reduce(np.logical_and, (eq_df[c] for c in eq_df.columns))
In [1288]: eq_ser
Out[1293]:
0 False
1 True
2 True
dtype: bool
Now we can print out the rows which are not equal
In [1310]: df1[~eq_ser]
Out[1315]:
a b c
0 1 2 4
In [1316]: df2[~eq_ser]
Out[1316]:
a b c
0 1 2 3
Second approach:
def diff_dataframes(
df1, df2, compare_cols=None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""
Given two dataframes and column(s) to compare, return three dataframes with rows:
- common between the two dataframes
- found only in the left dataframe
- found only in the right dataframe
"""
df1 = df1.fillna(pd.NA)
df = df1.merge(df2.fillna(pd.NA), how="outer", on=compare_cols, indicator=True)
df_both = df.loc[df["_merge"] == "both"].drop(columns="_merge")
df_left = df.loc[df["_merge"] == "left_only"].drop(columns="_merge")
df_right = df.loc[df["_merge"] == "right_only"].drop(columns="_merge")
tup = namedtuple("df_diff", ["common", "left", "right"])
return tup(df_both, df_left, df_right)
Usage:
In [1366]: b, l, r = diff_dataframes(df1, df2)
In [1371]: l
Out[1371]:
a b c
0 1 2 4
In [1372]: r
Out[1372]:
a b c
3 1 2 3
Third approach:
In [1440]: eq_ser = df1.eq(df2).sum(axis=1).eq(len(df1.columns))

Pandas dataframe append to column containing list

I have a pandas dataframe with one column that contains an empty list in each cell.
I need to duplicate the dataframe, and append it at the bottom of the original dataframe, but with additional information in the list.
Here is a minimal code example:
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
> df_main
letter mylist
0 a []
1 b []
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist.append(1)
pd.concat([ df_copy,df_main], ignore_index=True)
> result:
letter mylist
0 a None
1 b None
2 a [1]
3 b [1]
As you can see there is a problem that the [] empty list was replaced by a None
Just to make sure, this is what I would like to have:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
How can I achieve that?
append method on list return a None value, that's why None appears in the final dataframe. You may have use + operator for reassignment like this:
import pandas as pd
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist + list([1])
pd.concat([df_main, df_copy], ignore_index=True).head()
Output of this block of code:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
A workaround to solve your problem would be to create a temporary column mylist2 with np.empty((len(df), 0)).tolist()) and use np.where() to change the None values of mylist to an empty list and then drop the empty column.
import pandas as pd, numpy as np
df_main = pd.DataFrame([['a', []], ['b', []]], columns=['letter', 'mylist'])
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = row.mylist.append(1)
df = (pd.concat([df_copy,df_main], ignore_index=True)
.assign(mylist2=np.empty((len(df), 0)).tolist()))
df['mylist'] = np.where((df['mylist'].isnull()), df['mylist2'], df['mylist'])
df= df.drop('mylist2', axis=1)
df
Out[1]:
letter mylist
0 a []
1 b []
2 a [1]
3 b [1]
Not only does append method on list return a None value as indicated in the first answer, but both df_main and df_copy contain pointers to the same lists. So after:
for index, row in df_copy.iterrows():
row.mylist.append(1)
both dataframes have updated lists with one element. For your code to work as expected you can create a new list after you copy the dataframe:
df_copy = df_main.copy()
for index, row in df_copy.iterrows():
row.mylist = []
This question is another great example why we should not put objects in a dataframe.

Concatenating 2 dataframes vertically with empty row in middle

I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.