Filter dataframe based on condition before groupby - pandas

Suppose I have a dataframe like this
Create sample dataframe:
import pandas as pd
import numpy as np
data = {
'gender': np.random.choice(['m', 'f'], size=100),
'vaccinated': np.random.choice([0, 1], size=100),
'got sick': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)
and I want to see, by gender, what proportion of vaccinated people got sick.
I've tries something like this:
df.groupby('gender').agg(lambda group: sum(group['vaccinated']==1 & group['sick']==1)
/sum(group['sick']==1))
but this doesn't work because agg works on the series level. Same applies for transform. apply doesn't work either, but I'm not as clear why or how apply functions on groupby objects.
Any ideas how to accomplish this with a single line of code?

You could first filter for the vaccinated people and then group by gender and calculate the proportion of people that got sick..
df[df.vaccinated == 1].groupby("gender").agg({"got sick":"mean"})
Output:
got sick
gender
f 0.548387
m 0.535714
In this case the proportion is calculated based on a sample data that I've created

The docs for GroupBy.apply state that the function is applied "group-wise". This means that the function is called on each group separately as a data frame.
That is, df.groupby(c).apply(f) is conceptually equivalent to:
results = {}
for val in df[c]:
group = df.loc[df[c] == val]
result = f(group)
results[val] = result
pd.concat(results)
We can use this understanding to apply your custom aggregation function, using a top-level def just to make the code easier to read:
def calc_vax_sick_frac(group):
vaccinated = group['vaccinated'] == 1
sick = group['sick'] == 1
return (vaccinated & sick).sum() / sick.sum()
(
df
.groupby('gender')
.apply(calc_vax_sick_frac)
)

Related

Joining Multiple Data Frames

I am wondering if there is a way in Julia DataFrames to join multiple data frames in one go,
using DataFrames
employer = DataFrame(
ID = Array{Int64}([01,02,03,04,05,09,11,20]),
name = Array{String}(["Matthews","Daniella", "Kofi", "Vladmir", "Jean", "James", "Ayo", "Bill"])
)
salary = DataFrame(
ID = Array{Int64}([01,02,03,04,05,06,08,23]),
amount = Array{Int64}([2050,3000,3500,3500,2500,3400,2700,4500])
)
hours = DataFrame(
ID = Array{Int64}([01,02,03,04,05,08,09,23]),
time = Array{Int64}([40,40,40,40,40,38,45,50])
)
# I tried adding them in an array but ofcoures that results in an error
empSalHrs = innerjoin([employer,salary,hours], on = :ID)
# In python you can achieve this using
import pandas as pd
from functools import reduce
df = reduce(lambda l,r : pd.merge(l,r, on = "ID"), [employer, salary, hours])
Is there a similar way to do this in julia?
You were almost there. As it is written in DataFrames.jl manual you just need to pass more than one dataframe as an argument.
using DataFrames
employer = DataFrame(
ID = [01,02,03,04,05,09,11,20],
name = ["Matthews","Daniella", "Kofi", "Vladmir", "Jean", "James", "Ayo", "Bill"])
salary = DataFrame(
ID = [01,02,03,04,05,06,08,23],
amount = [2050,3000,3500,3500,2500,3400,2700,4500])
hours = DataFrame(
ID = [01,02,03,04,05,08,09,23],
time = [40,40,40,40,40,38,45,50]
)
empSalHrs = innerjoin(employer,salary,hours, on = :ID)
If for some reason you need to put your dataframes in a Vector, you can use splitting to achieve the same result
empSalHrs = innerjoin([employer,salary,hours]..., on = :ID)
Also, note that I've slightly changed the definitions of the dataframes. Since Array{Int} is an abstract type it shouldn't be used for the variable declaration, because it is bad for performance. It may be not important in this particular scenario, but it's better to make good habits from the start. Instead of Array{Int} one can use
Array{Int, 1}([1, 2, 3, 4])
Vector{Int}([1, 2, 3, 4])
Int[1, 2, 3]
[1, 2, 3]
The last one is legit because Julia can infer the type of the container on its own in this simple scenario.

How can I add values from pandas group to new Dataframe after a function?

I am trying to separate a Dataframe into groups, run each group through a function, and have the return value from the first row of each group placed into a new Dataframe.
When I try the code below, I can print out the information I want, but when I try to add it to the new Dataframe, it only shows the values for the last group.
How can I add the values from each group into the new Dataframe?
Thanks,
Here is what I have so far:
import pandas as pd
import numpy as np
#Build random dataframe
df = pd.DataFrame(np.random.randint(0,40,size=10),
columns=["Random"],
index=pd.date_range("20200101", freq='6h',periods=10))
df["Random2"] = np.random.randint(70,100,size=10)
df["Random3"] = 2
df.index =df.index.map(lambda t: t.strftime('%Y-%m-%d'))
df.index.name = 'Date'
df.reset_index(inplace=True)
#Setup groups by date
df = df.groupby(['Date']).apply(lambda x: x.reset_index())
df.drop(["index","Date"],axis=1,inplace = True)
#Creat new dataframe for newValue
df2 = pd.DataFrame(index=(df.index)).unstack()
#random function for an example
def any_func(df):
df["Value"] = df["Random"] * df["Random2"] / df["Random3"]
return df["Value"]
#loop by unique group name
for date in df.index.get_level_values('Date').unique():
#I can print the data I want
print(any_func(df.loc[date])[0])
#But when I add it to a new dataframe, it only shows the value from the last group
df2["newValue"] = any_func(df.loc[date])[0]
df2
Unrelated, but try modifying your any_func to take advantage of vectorized functions is possible.
Now if I understand you correctly:
new_value = df['Random'] * df['Random2'] / df['Random3']
df2['New Value'] = new_value.loc[:, 0]
This line of code gave me the desired outcome. I just needed to set the index using the "date" variable when I created the column, not when I created the Dataframe.
df2.loc[date, "newValue"] = any_func(df.loc[date])[0]

How to apply *multiple* functions to pandas groupby apply?

I have a dataframe which shall be grouped and then on each group several functions shall be applied. Normally, I would do this with groupby().agg() (cf. Apply multiple functions to multiple groupby columns), but the functions I'm interested do not need one column as input but multiple columns.
I learned that, when I have one function that has multiple columns as input, I need apply (cf. Pandas DataFrame aggregate function using multiple columns).
But what do I need, when I have multiple functions that have multiple columns as input?
import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10], 'y':[10, 13, 20, 30], 'id':['a', 'a', 'b', 'b']})
def mindist(data): #of course these functions are more complicated in reality
return min(data['y'] - data['x'])
def maxdist(data):
return max(data['y'] - data['x'])
I would expect something like df.groupby('id').apply([mindist, maxdist])
min max
id
a 8 10
b 30 40
(achieved with pd.DataFrame({'mindist':df.groupby('id').apply(mindist),'maxdist':df.groupby('id').apply(maxdist)} - which obviously isn't very handy if I have a dozend of functions to apply on the grouped dataframe). Initially I thought this OP had the same question, but he seems to be fine with aggregate, meaning his functions take only one column as input.
For this specific issue, how about groupby after difference?
(df['x']-df['y']).groupby(df['id']).agg(['min','max'])
More generically, you could probably do something like
df.groupby('id').apply(lambda x:pd.Series({'min':mindist(x),'max':maxdist(x)}))
IIUC you want to use several functions within the same group. In this case you should return a pd.Series. In the following toy example I want to
sum columns A and B then calculate the mean
sum columns C and D then calculate the std
import pandas as pd
df = pd.util.testing.makeDataFrame().head(10)
df["key"] = ["key1"] * 5 + ["key2"] * 5
def fun(x):
m = (x["A"]+x["B"]).mean()
s = (x["C"]+x["D"]).std()
return pd.Series({"meanAB":m, "stdCD":s})
df.groupby("key").apply(fun)
Update
Which in your case became
import pandas as pd
df = pd.DataFrame({'x':[2, 3, -10, -10],
'y':[10, 13, 20, 30],
'id':['a', 'a', 'b', 'b']})
def mindist(data): #of course these functions are more complicated in reality
return min(data['y'] - data['x'])
def maxdist(data):
return max(data['y'] - data['x'])
def fun(data):
return pd.Series({"maxdist":maxdist(data),
"mindist":mindist(data)})
df.groupby('id').apply(fun)

get less correlated variable names

I have a dataset (50 columns, 100 rows).
Also have 50 variable names, 0,1,2...49 for 50 columns.
I have to find less correlated variables, say correlation < 0.7.
I tried as follows:
import os, glob, time, numpy as np, pandas as pd
data = np.random.randint(1,99,size=(100, 50))
dataframe = pd.DataFrame(data)
print (dataframe.shape)
codes = np.arange(50).astype(str)
dataframe.columns = codes
corr = dataframe.corr()
corr = corr.unstack().sort_values()
print (corr)
corr = corr.values
indices = np.where(corr < 0.7)
print (indices)
res = codes[indices[0]].tolist() + codes[indices[1]].tolist()
print (len(res))
res = list(set(res))
print (len(res))
The result is, 50(all variables!), which is unexpected.
How to solve this problem, guys?
As mentioned in the comments, your question is somewhat ambiguous. First, there is the possibility, that no column pair is correlated. Second, the unstacking doesn't make sense, because you create an index array that you can't directly use on your 2D array. Third, which should be first, but I was blind to this - as #AmiTavory mentioned there is no point in "correlating names".
The correlation procedure per se works, as you can see in the following example:
import numpy as np
import pandas as pd
A = np.arange(100).reshape(25, 4)
#random order in column 2, i.e. a low correlation to the first columns
np.random.shuffle(A[:,2])
#flip column 3 to create a negative correlation with the first columns
A[:,3] = np.flipud(A[:,3])
#column 1 is unchanged, therefore positively correlated to column 0
df = pd.DataFrame(A)
print(df)
#establish a correlation matrix
corr = df.corr()
#retrieve index of pairs below a certain value
#use only the upper triangle with np.triu to filter for symmetric solutions
#use np.abs to take also negative correlation into account
res = np.argwhere(np.triu(np.abs(corr.values) <0.7))
print(res)
Output:
[[0 2]
[1 2]
[2 3]]
As expected, column 2 is the only one that is not correlated to any other, meaning, that all other columns are correlated with each other.

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)