Custom transformer that splits dates into new column - pandas

I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:
Convert string type columns to datetime while taking the date format as a parameter
Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.
Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
But on runtime, I get the following error:
AttributeError: Can only use .dt accessor with datetimelike values
Why am I getting this error and how to fix it?
Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!

I know this is late, but if you're still interested in a way to do this while renaming the columns with the custom transformer...
I used the approach of adding the method get_feature_names to the custom transformer inside a pipeline with the ColumnTransformer (overview). You can then use the .named_steps attribute to access the pipeline's step and then get to get_feature_names and then get the column_names, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post.
I had to run this with a pipeline because when I attempted to do it as a standalone custom transformer it went badly wrong (so I won't post that incomplete attempt here) - though you may have better luck.
Here is the raw code showing the pipeline
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
The output is
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
The custom transformers are here (skipping DateFormatter, since it is identical to yours)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
Rationale for DateEncoder
The loop over pandas columns allows the datetime attributes to be extracted from each datetime column. In the same loop, the custom column names are constructed. These are then added to a blank list under self.column_names which is returned in the method get_feature_names (though it has to be flattened before assigning to a dataframe).
For this particular case, you could potentially skip sklearn_pandas.
Details
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1

I was able to break the data format conversion and date splitter into two separate transformers and it worked.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
This code outputs:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.

Related

Error about unmatched parenthesis running simple imputer

dput(head("CustomTransformerData.csv"))
Here's what I'm trying to do:
Applies the SimpleImputer class to the data, where the strategy is set to mean. The name of this step should be "imputer".
Here's the code I'm using:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
fileName = "CustomTransformerData.csv"
custom_transform = pd.read_csv("CustomTransformerData.csv")
data_num = custom_transform.drop(['x3'], axis = 1); #created the df for categorical data
data_cat = custom_transform.drop(['x1', 'x2', 'x4', 'x5'], axis = 1); #created the df for numerical data
#importing sklearn
from sklearn.base import BaseEstimator,TransformerMixin
##creating the transformer
class Assignment4Transformer(BaseEstimator, TransformerMixin):
def __init__(self, drop_x4 = True, y = None):
self.drop_x4 = drop_x4 #flag to drop the x4 column
def fit_transform(self, data, y=None):
return self
from sklearn.pipeline import Pipeline #importing the pipeline
from sklearn.impute import SimpleImputer #importing the SimpleImputer
from sklearn.preprocessing import StandardScaler #importint the preprocessor
def transform(self, data): #starting the function to determine x4
#not adding the x3 categorical data
if self.drop_x4: #a flag to catch and drop x4, giving a new index
data = np.delete(data, 2, axis=1)
return np.c_[data, new_col]
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')]) # this is where I encounter the below error
File "/var/folders/5v/f6glw1515sqbvblc482qs47c0000gn/T/ipykernel_42484/2823414947.py", line 1
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')])
^
SyntaxError: closing parenthesis ']' does not match opening parenthesis '('
Alternatively, I tried this as well, which did not error, but then the next code errored:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler(num_pipeline)
# Splitting the independent and dependent variables
std_scaler = data_num.data
response = data_num.target
# standardization
scale = object.fit_transform(data_num)
TypeError Traceback (most recent call last)
/var/folders/5v/f6glw1515sqbvblc482qs47c0000gn/T/ipykernel_42484/1423714864.py in <module>
----> 1 std_scaler= StandardScaler(num_pipeline)
2
3 # Splitting the independent and dependent variables
4 std_scaler = data_num.data
5 response = data_num.target
TypeError: __init__() takes 1 positional argument but 2 were given
So I'm not sure if going the second route was truly correct, and I just need help with this portion:
Applies the custom Assignment4Transformer class to the data. Make sure that your custom transformer uses the default argument where you drop the š‘„4
x
4
column. The name of this step should be "custom_trans".
Applies the StandardScaler class to the data. The name of this step should be "std_scaler?
Data: (since it doesn't appear to be carrying through
x1 x2 x3 x4 x5
1 1.5 2.354152979 COLD 593 0.75
2 2.5 3.31404772 WARM 340 2.083333333
3 3.5 4.021604459 COLD 551 4.083333333
4 4.5 COLD 2368 6.75
5 5.5 5.847601001 WARM 2636 10.08333333
6 6.5 7.229910044 WARM 2779 14.08333333
7 7.5 7.997255234 HOT 1057 18.75
8 8.5 9.203946542 COLD 819 24.08333333
9 9.5 10.33534766 WARM 3349
10 10.5 11.11214192 HOT 3235 36.75
11 11.5 11.75961084 WARM 216 44.08333333
12 12.5 12.62909577 WARM 2529 52.08333333
13 13.5 14.08258887 COLD 1735 60.75
14 14.5 14.65767801 HOT 1254 70.08333333
15 15.5 HOT 1245 80.08333333
16 16.6 17.18411403 WARM 310 90.75
17 17.5 17.80077555 HOT 201 102.0833333
18 18.5 18.57886101 HOT 1767 114.0833333
In your first code block...
This
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')])
Should be this
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean'))])
I added a closing parenthesis after SimpleImputer(strategy='mean')
In your second code block...
StandardScaler is a class and needs to be instantiated before it can be used.
In your code here:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler(num_pipeline)
You give your num_pipeline to the class, but should instead define the std_scaler, and then use .fit or .fit_transform on data, or add it to the pipeline like so
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('attribs_adder', Assignment4Transformer()),])
std_scaler= StandardScaler()
# applying to data
std_scaler.fit_transform(some_data)
# adding to pipeline at 0th step
num_pipeline.steps.insert(0, ("scale", std_scaler))
# last step
num_pipeline.steps.extend([("scale", std_scaler)])
# some other step???
num_pipeline.steps.insert(1, ("scale", std_scaler))

Problems with list creation and the use of .unique()

Here is the exercise: We have to use the boxplot() method of
matplotlib.pyplot If I understood well, in this exercise, we will
create a list of lists and then, we will display the lists as a
moustache graph.
For me, the difficulty is to understand the creation of the list of
lists. I tried to create one with random numbers but without success
because the commands of the exercise do not work.
The given solution is:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df['Month']= df.Month.apply(lambda x : x[3:])
print(df.head(3))
l=list()
print("shape = ", df.Month.shape)
for i in df.Month.unique():
l.append(df[df['Month'] == i]['Turnover'])
print(l[0:3])
plt.boxplot(l);
plt.xticks(range(1,13),df.Month.unique());
plt.show();
That produces:
Month Product1 Product2 Returns Turnover Month
0 01-Jan 266 355 0 25285 Jan
1 01-Feb 145 204 6 14255 Feb
2 01-March 183 196 11 15225 March
shape = (36,)
[0 25285
12 15700
24 17490
Name: Turnover, dtype: int64, 1 14255
13 19660
25 29665
Name: Turnover, dtype: int64, 2 15225
14 15360
26 22815
Name: Turnover, dtype: int64]
I don't understand by what mechanism the loop creates a succession of
tables.
I tried to recreate an example to do the same thing with numbers.
k = list()
nbre = np.random.choice(11,40)
NBR = pd.DataFrame(nbre)
print("shape =",NBR.shape)
for n in NBR.unique():
k.append(n)
print(k)
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Why does the following work?
k = list()
nbre = np.random.choice(11,40)
for n in pd.unique(nbre):
k.append(n)
print(k)
On the other hand,
k = list()
nbr = np.random.choice(11,40)
nbr = pd.DataFrame(nbr)
#print(nbr)
for n in nbr.unique():
k.append(n)
print(k)
Does not work... I was thinking of creating a data frame from which I
could create a list of lists from random numbers and .unique() but it
fails..
Regards, Atapalou

Stratified Sampling with different sizes

I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?

Pandas split ages by group

I'm quite new with pandas and need a bit help. I have a column with ages and need to make groups of these:
Young people: ageā‰¤30
Middle-aged people: 30<ageā‰¤60
Old people:60<age
Here is the code, but it gives me an error:
def get_num_people_by_age_category(dataframe):
young, middle_aged, old = (0, 0, 0)
dataframe["age"] = pd.cut(x=dataframe['age'], bins=[30,31,60,61], labels=["young","middle_aged","old"])
return young, middle_aged, old
ages = get_num_people_by_age_category(dataframe)
print(dataframe)
Code below gets the age groups using pd.cut().
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'age': [1,20,30,31,50,60,61,80,90] #np.random.randint(1,100,50)
})
# Function: Copy-pasted from question and modified
def get_num_people_by_age_category(df):
df["age_group"] = pd.cut(x=df['age'], bins=[0,30,60,100], labels=["young","middle_aged","old"])
return df
# Call function
df = get_num_people_by_age_category(df)
Output
print(df)
age age_group
0 1 young
1 20 young
2 30 young
3 31 middle_aged
4 50 middle_aged
5 60 middle_aged
6 61 old
7 80 old
8 90 old

Add column of .75 quantile based off groupby

I have df with index as date and also column called scores. Now I want to maintain the df as it is but add column which gives the 0.7 quantile of scores for that day. Method of quantile would need to be midpoint and also be rounded to nearest whole number.
I've outlined one approach you could take, below.
Note that to round a value to the nearest whole number you should use Python's built-in round() function. See round() in the Python documentation for details.
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(748)
# initialize base example dataframe
df = pd.DataFrame({"date":np.arange(10),
"score":np.random.uniform(size=10)})
duplicate_dates = np.random.choice(df.index, 5)
df_dup = pd.DataFrame({"date":np.random.choice(df.index, 5),
"score":np.random.uniform(size=5)})
# finish compiling example data
df = df.append(df_dup, ignore_index=True)
# calculate 0.7 quantile result with specified parameters
result = df.groupby("date").quantile(q=0.7, axis=0, interpolation='midpoint')
# print resulting dataframe
# contains one unique 0.7 quantile value per date
print(result)
"""
0.7 score
date
0 0.585087
1 0.476404
2 0.426252
3 0.363376
4 0.165013
5 0.927199
6 0.575510
7 0.576636
8 0.831572
9 0.932183
"""
# to apply the resulting quantile information to
# a new column in our original dataframe `df`
# we can apply a dictionary to our "date" column
# create dictionary
mapping = result.to_dict()["score"]
# apply to `df` to produce desired new column
df["quantile_0.7"] = [mapping[x] for x in df["date"]]
print(df)
"""
date score quantile_0.7
0 0 0.920895 0.585087
1 1 0.476404 0.476404
2 2 0.380771 0.426252
3 3 0.363376 0.363376
4 4 0.165013 0.165013
5 5 0.927199 0.927199
6 6 0.340008 0.575510
7 7 0.695818 0.576636
8 8 0.831572 0.831572
9 9 0.932183 0.932183
10 7 0.457455 0.576636
11 6 0.650666 0.575510
12 6 0.500353 0.575510
13 0 0.249280 0.585087
14 2 0.471733 0.426252
"""