The pandas isin() function but returning the actual values, not just a boolean - pandas

I have an NumPy array of good animals, and a DataFrame of people with a list of animals they own.
good_animals = np.array(['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin'])
data = {
> 'People': [1, 2, 3, 4, 5],
> 'Animals': [['Owl'], ['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'], []],
> }
df = pd.DataFrame(data)
I want to add another column to my DataFrame, showing all the good animals that person owns.
The following gives me a Series showing whether or not each animal is a good animal.
df['Animals'].apply(lambda x: np.isin(x, good_animals))
But I want to see the actual good animals, not just booleans.

You can use intersection of sets from lists:
df['new'] = df['Animals'].apply(lambda x: list(set(x).intersection(good_animals)))
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Dragon, Owl]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []
If possible duplciated values or if order is important use list comprehension:
s = set(good_animals)
df['new'] = df['Animals'].apply(lambda x: [y for y in x if y in s])
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Owl, Dragon]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []

I`m not very sure if I understood well your questions. Why are you using np.array? You can try this:
good_animals = ['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin']
import pandas as pd
df_dict = {
'People':["1","2","3","4","5"],
'Animals':[['Owl'],['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'],[]],
'Good_animals': [None, None, None,None,None]
}
df = pd.DataFrame(df_dict)
for x in range(df.shape[0]):
row = x
df.Good_animals.iloc[x] = ', ' .join([y for y in df.Animals.iloc[row] if y in good_animals])
The result:
People Animals Good_animals
0 1 [Owl] Owl
1 2 [Owl, Dragon] Owl, Dragon
2 3 [Dog, Human]
3 4 [Unicorn, Pitbull] Unicorn
4 5 []

Related

How to method chaining in pandas to aggregate a DataFrame?

I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:
Having this data:
import pandas as pd
my_df = pd.DataFrame({
'name': ['john', 'diana', 'rachel', 'chris'],
'favorite_color': ['red', 'blue', 'green', 'red']
})
my_df
#> name favorite_color
#> 0 john red
#> 1 diana blue
#> 2 rachel green
#> 3 chris red
and I want to end up with this summary table:
#> total_people total_ppl_who_like_red
#> 0 4 2
Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:
desired_output_via_building_new_df = pd.DataFrame({
'total_people': [len(my_df)],
'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
})
desired_output_via_building_new_df
#> total_people total_ppl_who_like_red
#> 0 4 2
However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of
# pseudo-code; not really working
my_df.agg({
'total_people': lambda x: len(x),
'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
})
I can only borrow inspiration from R/dplyr code:
library(dplyr, warn.conflicts = FALSE)
my_df <-
data.frame(name = c("john", "diana", "rachel", "chris"),
favorite_color = c("red", "blue", "green", "red")
)
my_df |>
summarise(total_people = n(), ## in the context of `summarise()`,
total_ppl_who_like_red = sum(favorite_color == "red")) ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#> total_people total_ppl_who_like_red
#> 1 4 2
Solution for processing one Series:
df = my_df.favorite_color.apply({
'total_people': lambda x: x.count(),
'total_ppl_who_like_red': lambda x: x.eq('red').sum()
}).to_frame(name=0).T
print (df)
total_people total_ppl_who_like_red
0 4 2
General solution for processing DataFrame with DataFrame.pipe - then pandas processing input DataFrame, if use apply or agg processing columns separately:
df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum()}))
.to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red
0 4 2
df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum(),
'max_age':x.age.max()
}).to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red max_age
0 4 2 41

Join 2 data frame with special columns matching new

I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)
Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))
You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3
df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3

How can I flatten the output dataframe of pandas crosstab from two series x and y into a series?

I have the following series x and y:
x = pd.Series(['a', 'b', 'a', 'c', 'c'], name='x')
y = pd.Series([1, 0, 1, 0, 0], name='y')
I call pd.crosstab to get the following dataframe as output:
pd.crosstab(x, y)
Output:
y 0 1
x
a 0 2
b 1 0
c 2 0
I want to transform this into a single series as follows:
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
For a specific dataframe like this one, I can construct this by visual inspection:
pd.Series(
dict(
x_a_y_0=0,
x_a_y_1=2,
x_b_y_0=1,
x_b_y_1=0,
x_c_y_0=2,
x_c_y_1=0
)
)
But given arbitrary series x and y, how do I generate the corresponding final output?
Use DataFrame.stack with change MultiIndex by map:
s = pd.crosstab(x, y).stack()
s.index = s.index.map(lambda x: f'x_{x[0]}_y_{x[1]}')
print (s)
x_a_y_0 0
x_a_y_1 2
x_b_y_0 1
x_b_y_1 0
x_c_y_0 2
x_c_y_1 0
dtype: int64
Also is possible pass s.index.names, thank you #SeaBean:
s.index = s.index.map(lambda x: f'{s.index.names[0]}_{x[0]}_{s.index.names[1]}_{x[1]}')

Weighted mean pandas

Im calculating weighted mean for many columns using pandas. In some cases weight can sum to zero so i use np.ma.average:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),('HeightA', [1, 2, 3]), ('WeightA', [0, 0, 0]),('HeightB', [2, 4, 6]), ('WeightB', [1, 2, 4])]))
>>df
ID HeightA WeightA HeightB WeightB
0 1 1 0 2 1
1 1 2 0 4 2
2 1 3 0 6 4
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
This works but i have many columns of height and weights so i dont want to have to write a lambda function for each one so i try:
def givewm(data,weightcolumn):
return np.ma.average(data, weights=data.loc[data.index, weightcolumn])
f = {'HeightA':givewm(df,'WeightA'),'HeightB':givewm(df,'WeightB')}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
Which give error: builtins.TypeError: Axis must be specified when shapes of a and weights differ.
How can i write a function which returns weighted mean with weight column name as input?
Use double nested functions, solution from github:
df = (pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),
('HeightA', [1, 2, 3]),
('WeightA', [10, 20, 30]),
('HeightB', [2, 4, 6]),
('WeightB', [1, 2, 4])])))
print (df)
ID HeightA WeightA HeightB WeightB
0 1 1 10 2 1
1 1 2 20 4 2
2 1 3 30 6 4
def givewm(weightcolumn):
def f1(x):
return np.ma.average(x, weights=df.loc[x.index, weightcolumn])
return f1
f = {'HeightA':givewm('WeightA'),'HeightB':givewm('WeightB')}
df2 = df.groupby('ID').agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143
Verify solution:
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.