How to method chaining in pandas to aggregate a DataFrame? - pandas

I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:
Having this data:
import pandas as pd
my_df = pd.DataFrame({
'name': ['john', 'diana', 'rachel', 'chris'],
'favorite_color': ['red', 'blue', 'green', 'red']
})
my_df
#> name favorite_color
#> 0 john red
#> 1 diana blue
#> 2 rachel green
#> 3 chris red
and I want to end up with this summary table:
#> total_people total_ppl_who_like_red
#> 0 4 2
Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:
desired_output_via_building_new_df = pd.DataFrame({
'total_people': [len(my_df)],
'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
})
desired_output_via_building_new_df
#> total_people total_ppl_who_like_red
#> 0 4 2
However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of
# pseudo-code; not really working
my_df.agg({
'total_people': lambda x: len(x),
'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
})
I can only borrow inspiration from R/dplyr code:
library(dplyr, warn.conflicts = FALSE)
my_df <-
data.frame(name = c("john", "diana", "rachel", "chris"),
favorite_color = c("red", "blue", "green", "red")
)
my_df |>
summarise(total_people = n(), ## in the context of `summarise()`,
total_ppl_who_like_red = sum(favorite_color == "red")) ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#> total_people total_ppl_who_like_red
#> 1 4 2

Solution for processing one Series:
df = my_df.favorite_color.apply({
'total_people': lambda x: x.count(),
'total_ppl_who_like_red': lambda x: x.eq('red').sum()
}).to_frame(name=0).T
print (df)
total_people total_ppl_who_like_red
0 4 2
General solution for processing DataFrame with DataFrame.pipe - then pandas processing input DataFrame, if use apply or agg processing columns separately:
df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum()}))
.to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red
0 4 2
df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum(),
'max_age':x.age.max()
}).to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red max_age
0 4 2 41

Related

The pandas isin() function but returning the actual values, not just a boolean

I have an NumPy array of good animals, and a DataFrame of people with a list of animals they own.
good_animals = np.array(['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin'])
data = {
> 'People': [1, 2, 3, 4, 5],
> 'Animals': [['Owl'], ['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'], []],
> }
df = pd.DataFrame(data)
I want to add another column to my DataFrame, showing all the good animals that person owns.
The following gives me a Series showing whether or not each animal is a good animal.
df['Animals'].apply(lambda x: np.isin(x, good_animals))
But I want to see the actual good animals, not just booleans.
You can use intersection of sets from lists:
df['new'] = df['Animals'].apply(lambda x: list(set(x).intersection(good_animals)))
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Dragon, Owl]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []
If possible duplciated values or if order is important use list comprehension:
s = set(good_animals)
df['new'] = df['Animals'].apply(lambda x: [y for y in x if y in s])
print (df)
People Animals new
0 1 [Owl] [Owl]
1 2 [Owl, Dragon] [Owl, Dragon]
2 3 [Dog, Human] []
3 4 [Unicorn, Pitbull] [Unicorn]
4 5 [] []
I`m not very sure if I understood well your questions. Why are you using np.array? You can try this:
good_animals = ['Owl', 'Dragon', 'Shark', 'Cat', 'Unicorn', 'Penguin']
import pandas as pd
df_dict = {
'People':["1","2","3","4","5"],
'Animals':[['Owl'],['Owl', 'Dragon'], ['Dog', 'Human'], ['Unicorn', 'Pitbull'],[]],
'Good_animals': [None, None, None,None,None]
}
df = pd.DataFrame(df_dict)
for x in range(df.shape[0]):
row = x
df.Good_animals.iloc[x] = ', ' .join([y for y in df.Animals.iloc[row] if y in good_animals])
The result:
People Animals Good_animals
0 1 [Owl] Owl
1 2 [Owl, Dragon] Owl, Dragon
2 3 [Dog, Human]
3 4 [Unicorn, Pitbull] Unicorn
4 5 []

Join 2 data frame with special columns matching new

I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)
Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))
You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3
df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

Weighted mean pandas

Im calculating weighted mean for many columns using pandas. In some cases weight can sum to zero so i use np.ma.average:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),('HeightA', [1, 2, 3]), ('WeightA', [0, 0, 0]),('HeightB', [2, 4, 6]), ('WeightB', [1, 2, 4])]))
>>df
ID HeightA WeightA HeightB WeightB
0 1 1 0 2 1
1 1 2 0 4 2
2 1 3 0 6 4
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
This works but i have many columns of height and weights so i dont want to have to write a lambda function for each one so i try:
def givewm(data,weightcolumn):
return np.ma.average(data, weights=data.loc[data.index, weightcolumn])
f = {'HeightA':givewm(df,'WeightA'),'HeightB':givewm(df,'WeightB')}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
Which give error: builtins.TypeError: Axis must be specified when shapes of a and weights differ.
How can i write a function which returns weighted mean with weight column name as input?
Use double nested functions, solution from github:
df = (pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),
('HeightA', [1, 2, 3]),
('WeightA', [10, 20, 30]),
('HeightB', [2, 4, 6]),
('WeightB', [1, 2, 4])])))
print (df)
ID HeightA WeightA HeightB WeightB
0 1 1 10 2 1
1 1 2 20 4 2
2 1 3 30 6 4
def givewm(weightcolumn):
def f1(x):
return np.ma.average(x, weights=df.loc[x.index, weightcolumn])
return f1
f = {'HeightA':givewm('WeightA'),'HeightB':givewm('WeightB')}
df2 = df.groupby('ID').agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143
Verify solution:
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)