Stratified Sampling with different sizes - pandas

I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.

After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?

Related

Error about unmatched parenthesis running simple imputer

dput(head("CustomTransformerData.csv"))
Here's what I'm trying to do:
Applies the SimpleImputer class to the data, where the strategy is set to mean. The name of this step should be "imputer".
Here's the code I'm using:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
fileName = "CustomTransformerData.csv"
custom_transform = pd.read_csv("CustomTransformerData.csv")
data_num = custom_transform.drop(['x3'], axis = 1); #created the df for categorical data
data_cat = custom_transform.drop(['x1', 'x2', 'x4', 'x5'], axis = 1); #created the df for numerical data
#importing sklearn
from sklearn.base import BaseEstimator,TransformerMixin
##creating the transformer
class Assignment4Transformer(BaseEstimator, TransformerMixin):
def __init__(self, drop_x4 = True, y = None):
self.drop_x4 = drop_x4 #flag to drop the x4 column
def fit_transform(self, data, y=None):
return self
from sklearn.pipeline import Pipeline #importing the pipeline
from sklearn.impute import SimpleImputer #importing the SimpleImputer
from sklearn.preprocessing import StandardScaler #importint the preprocessor
def transform(self, data): #starting the function to determine x4
#not adding the x3 categorical data
if self.drop_x4: #a flag to catch and drop x4, giving a new index
data = np.delete(data, 2, axis=1)
return np.c_[data, new_col]
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')]) # this is where I encounter the below error
File "/var/folders/5v/f6glw1515sqbvblc482qs47c0000gn/T/ipykernel_42484/2823414947.py", line 1
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')])
^
SyntaxError: closing parenthesis ']' does not match opening parenthesis '('
Alternatively, I tried this as well, which did not error, but then the next code errored:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler(num_pipeline)
# Splitting the independent and dependent variables
std_scaler = data_num.data
response = data_num.target
# standardization
scale = object.fit_transform(data_num)
TypeError Traceback (most recent call last)
/var/folders/5v/f6glw1515sqbvblc482qs47c0000gn/T/ipykernel_42484/1423714864.py in <module>
----> 1 std_scaler= StandardScaler(num_pipeline)
2
3 # Splitting the independent and dependent variables
4 std_scaler = data_num.data
5 response = data_num.target
TypeError: __init__() takes 1 positional argument but 2 were given
So I'm not sure if going the second route was truly correct, and I just need help with this portion:
Applies the custom Assignment4Transformer class to the data. Make sure that your custom transformer uses the default argument where you drop the 𝑥4
x
4
column. The name of this step should be "custom_trans".
Applies the StandardScaler class to the data. The name of this step should be "std_scaler?
Data: (since it doesn't appear to be carrying through
x1 x2 x3 x4 x5
1 1.5 2.354152979 COLD 593 0.75
2 2.5 3.31404772 WARM 340 2.083333333
3 3.5 4.021604459 COLD 551 4.083333333
4 4.5 COLD 2368 6.75
5 5.5 5.847601001 WARM 2636 10.08333333
6 6.5 7.229910044 WARM 2779 14.08333333
7 7.5 7.997255234 HOT 1057 18.75
8 8.5 9.203946542 COLD 819 24.08333333
9 9.5 10.33534766 WARM 3349
10 10.5 11.11214192 HOT 3235 36.75
11 11.5 11.75961084 WARM 216 44.08333333
12 12.5 12.62909577 WARM 2529 52.08333333
13 13.5 14.08258887 COLD 1735 60.75
14 14.5 14.65767801 HOT 1254 70.08333333
15 15.5 HOT 1245 80.08333333
16 16.6 17.18411403 WARM 310 90.75
17 17.5 17.80077555 HOT 201 102.0833333
18 18.5 18.57886101 HOT 1767 114.0833333
In your first code block...
This
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')])
Should be this
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean'))])
I added a closing parenthesis after SimpleImputer(strategy='mean')
In your second code block...
StandardScaler is a class and needs to be instantiated before it can be used.
In your code here:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler(num_pipeline)
You give your num_pipeline to the class, but should instead define the std_scaler, and then use .fit or .fit_transform on data, or add it to the pipeline like so
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler()
# applying to data
std_scaler.fit_transform(some_data)
# adding to pipeline at 0th step
num_pipeline.steps.insert(0, ("scale", std_scaler))
# last step
num_pipeline.steps.extend([("scale", std_scaler)])
# some other step???
num_pipeline.steps.insert(1, ("scale", std_scaler))

Specific calculations for unique column values in DataFrame

I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]

Why my lower outliers are not showing in Box plot

Dataset
store id,revenue ,profit
101,779183,281257
101,144829,838451
101,766465,757565
101,353297,261071
101,1615461,275760
101,246731,949229
101,951518,301016
101,444669,430583
Code
import pandas as pd
import seaborn as sns
dummies = pd.read_csv('1.csv')
dummies.sort_values(by=['revenue'], inplace=True)
fea = dummies[['storeid']]
lab = dummies[['revenue']]
param = 'revenue'
qv1 = lab[param].quantile(0.25)
qv2 = lab[param].quantile(0.5)
qv3 = lab[param].quantile(0.75)
qv_limit = 1.5 * (qv3 - qv1)
un_outliers_mask = (lab[param] > qv3 + qv_limit) | (lab[param] < qv1 - qv_limit)
un_outliers_data = lab[param][un_outliers_mask]
un_outliers_name = fea[un_outliers_mask]
un_outliers_data
#41 54437
# 44 89269
# 40 1942989
# 6 1951518
dummies.boxplot(by='storeid', column=['revenue'], grid=False)
un_outliers_data Output is having both outliers higher and lower, But in plot only higher is displayed
My graph is only displaying the higher outliers
un_outliers_data has the global outliers ie you are considering mean of complete data from dummies dataframe. But your box plot filter the data by storeid and then calculates median, percentiles etc for this subset of data.
You will see the required outliers (un_outliers_data) if you just do dummies['revenue'].plot(kind='box')
Example:
Consider the below small dataset:
store id,revenue
101, 10
102, 190
103, 200
104, 210
105, 300
It should be clear that revenue = 10 & 300 are outliers, but they are not outliers if look at the data for store id 101 & 105 respectively.

Custom transformer that splits dates into new column

I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:
Convert string type columns to datetime while taking the date format as a parameter
Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.
Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
But on runtime, I get the following error:
AttributeError: Can only use .dt accessor with datetimelike values
Why am I getting this error and how to fix it?
Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!
I know this is late, but if you're still interested in a way to do this while renaming the columns with the custom transformer...
I used the approach of adding the method get_feature_names to the custom transformer inside a pipeline with the ColumnTransformer (overview). You can then use the .named_steps attribute to access the pipeline's step and then get to get_feature_names and then get the column_names, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post.
I had to run this with a pipeline because when I attempted to do it as a standalone custom transformer it went badly wrong (so I won't post that incomplete attempt here) - though you may have better luck.
Here is the raw code showing the pipeline
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
The output is
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
The custom transformers are here (skipping DateFormatter, since it is identical to yours)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
Rationale for DateEncoder
The loop over pandas columns allows the datetime attributes to be extracted from each datetime column. In the same loop, the custom column names are constructed. These are then added to a blank list under self.column_names which is returned in the method get_feature_names (though it has to be flattened before assigning to a dataframe).
For this particular case, you could potentially skip sklearn_pandas.
Details
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1
I was able to break the data format conversion and date splitter into two separate transformers and it worked.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
This code outputs:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.

Add column of .75 quantile based off groupby

I have df with index as date and also column called scores. Now I want to maintain the df as it is but add column which gives the 0.7 quantile of scores for that day. Method of quantile would need to be midpoint and also be rounded to nearest whole number.
I've outlined one approach you could take, below.
Note that to round a value to the nearest whole number you should use Python's built-in round() function. See round() in the Python documentation for details.
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(748)
# initialize base example dataframe
df = pd.DataFrame({"date":np.arange(10),
"score":np.random.uniform(size=10)})
duplicate_dates = np.random.choice(df.index, 5)
df_dup = pd.DataFrame({"date":np.random.choice(df.index, 5),
"score":np.random.uniform(size=5)})
# finish compiling example data
df = df.append(df_dup, ignore_index=True)
# calculate 0.7 quantile result with specified parameters
result = df.groupby("date").quantile(q=0.7, axis=0, interpolation='midpoint')
# print resulting dataframe
# contains one unique 0.7 quantile value per date
print(result)
"""
0.7 score
date
0 0.585087
1 0.476404
2 0.426252
3 0.363376
4 0.165013
5 0.927199
6 0.575510
7 0.576636
8 0.831572
9 0.932183
"""
# to apply the resulting quantile information to
# a new column in our original dataframe `df`
# we can apply a dictionary to our "date" column
# create dictionary
mapping = result.to_dict()["score"]
# apply to `df` to produce desired new column
df["quantile_0.7"] = [mapping[x] for x in df["date"]]
print(df)
"""
date score quantile_0.7
0 0 0.920895 0.585087
1 1 0.476404 0.476404
2 2 0.380771 0.426252
3 3 0.363376 0.363376
4 4 0.165013 0.165013
5 5 0.927199 0.927199
6 6 0.340008 0.575510
7 7 0.695818 0.576636
8 8 0.831572 0.831572
9 9 0.932183 0.932183
10 7 0.457455 0.576636
11 6 0.650666 0.575510
12 6 0.500353 0.575510
13 0 0.249280 0.585087
14 2 0.471733 0.426252
"""