I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?
My dataframe(df) has some NaN entries in the new column, 's_score' which I can exclude by using func(x).
i.e. the execution of document_path_similarity() leads to some NaNs, preventing the execution of most_similar_docs() (if I don't use func(x) first).
D1,D2 are df.columns with string data.
df
Quality D1 D2
0 1 Ms Stewart, the chief executive... Ms Stewart, 61, its chief executive
1 1 After more than two years' det... After more than two years in
def most_similar_docs():
def func(x):
try:
return document_path_similarity(x['D1'], x['D2'])
except:
return np.nan
df['s_score'] = df.apply(func, axis=1)
Is there a way to rewrite this code as a one liner?
My attempts such as below lead to 'ValueError: ('max() arg is an empty sequence' or SyntaxError.
df['s_scores'] = df.apply(lambda x: document_path_similarity(x.D1, x.D2),axis=1)
paraphrases['s_scores'] = paraphrases.apply(lambda x: document_path_similarity(x.D1, x.D2),axis=1 if np.isnan(x))
I don't think there is anything wrong with your pandas code. What I did find is that similarity_score() is failing because it's trying to take max of an empty list. I forced the list to be non-empty by forcing in a zero score. This is first time I've looked at this library so please don't assume my patch is a good quality patch.
import io
df = pd.read_csv(io.StringIO(""" Quality D1 D2
0 1 Ms Stewart, the chief executive... Ms Stewart, 61, its chief executive
1 1 After more than two years' det... After more than two years in """), sep="\s\s+", engine="python")
def similarity_score(s1, s2):
list1 = []
for a in s1:
# patch +[0] at end so never finding max of empty list
list1.append(max([i.path_similarity(a) for i in s2 if i.path_similarity(a) is not None]+[0]))
output = sum(list1)/len(list1)
return output
df = df.assign(
s_scores=lambda x: x.apply(lambda r: document_path_similarity(r.D1, r.D2), axis=1)
)
print(df.to_string(index=False))
output
Quality D1 D2 s_scores
1 Ms Stewart, the chief executive... Ms Stewart, 61, its chief executive 0.838889
1 After more than two years' det... After more than two years in 0.912500
How do I select columns a and b from df, and save them into a new dataframe df1?
index a b c
1 2 3 4
2 3 4 5
Unsuccessful attempt:
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).
df1 = df[['a', 'b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:, 0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the .copy() method to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0, 0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.
{df.columns.get_loc(c): c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc.
As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:
df.loc[:, 'C':'E']
is equivalent to
df[['C', 'D', 'E']] # or df.loc[:, ['C', 'D', 'E']]
and returns columns C through E.
A demo on a randomly generated DataFrame:
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
df.head()
Out:
A B C D E F
R0 99 78 61 16 73 8
R1 62 27 30 80 7 76
R2 15 53 80 27 44 77
R3 75 65 47 30 84 86
R4 18 9 41 62 1 82
To get the columns from C to E (note that unlike integer slicing, E is included in the columns):
df.loc[:, 'C':'E']
Out:
C D E
R0 61 16 73
R1 30 80 7
R2 80 27 44
R3 47 30 84
R4 41 62 1
R5 5 58 0
...
The same works for selecting rows based on labels. Get the rows R6 to R10 from those columns:
df.loc['R6':'R10', 'C':'E']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc also accepts a Boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.
df.loc[:, df.columns.isin(list('BCD'))]
Out:
B C D
R0 78 61 16
R1 27 30 80
R2 53 80 27
R3 65 47 30
R4 9 41 62
R5 78 5 58
...
Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the
third and fourth columns. If you don't know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is zero-offset! The "third" entry is at slot two.
As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural, because it uses the vanilla one-dimensional Python list indexing/slicing syntax.
Warning: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, an Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of its elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.
In the latest version of Pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)
In [39]: df
Out[39]:
index a b c
0 1 2 3 4
1 2 3 4 5
In [40]: df1 = df[['b', 'c']]
In [41]: df1
Out[41]:
b c
0 3 4
1 4 5
With Pandas,
wit column names
dataframe[['column1','column2']]
to select by iloc and specific columns with index number:
dataframe.iloc[:,[1,2]]
with loc column names can be used like
dataframe.loc[:,['column1','column2']]
You can use the pandas.DataFrame.filter method to either filter or reorder columns like this:
df1 = df.filter(['a', 'b'])
This is also very useful when you are chaining methods.
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.
Just saying
colsToDrop = ['a']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b and c.
The drop method is documented here.
I found this method to be very useful:
# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]
More details can be found here.
Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:
df1 = df.reindex(columns=['b','c'])
In prior versions, using .loc[list-of-labels] would work as long as at least one of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().
Read more at Indexing and Selecting Data.
You can use Pandas.
I create the DataFrame:
import pandas as pd
df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]],
index=['Jane', 'Peter','Alex','Ann'],
columns=['Test_1', 'Test_2', 'Test_3'])
The DataFrame:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
To select one or more columns by name:
df[['Test_1', 'Test_3']]
Test_1 Test_3
Jane 1 5
Peter 5 5
Alex 7 8
Ann 7 9
You can also use:
df.Test_2
And you get column Test_2:
Jane 2
Peter 4
Alex 7
Ann 6
You can also select columns and rows from these rows using .loc(). This is called "slicing". Notice that I take from column Test_1 to Test_3:
df.loc[:, 'Test_1':'Test_3']
The "Slice" is:
Test_1 Test_2 Test_3
Jane 1 2 5
Peter 5 4 5
Alex 7 7 8
Ann 7 6 9
And if you just want Peter and Ann from columns Test_1 and Test_3:
df.loc[['Peter', 'Ann'], ['Test_1', 'Test_3']]
You get:
Test_1 Test_3
Peter 5 5
Ann 7 9
If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can imagine.
Or you can use df.ix[0,'b'] - mixed usage of index and label.
Note: Since v0.20, ix has been deprecated in favour of loc / iloc.
df[['a', 'b']] # Select all rows of 'a' and 'b'column
df.loc[0:10, ['a', 'b']] # Index 0 to 10 select column 'a' and 'b'
df.loc[0:10, 'a':'b'] # Index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # Index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # Index 3 of column 3 to 5
Try to use pandas.DataFrame.get (see the documentation):
import pandas as pd
import numpy as np
dates = pd.date_range('20200102', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df.get(['A', 'C'])
One different and easy approach: iterating rows
Using iterows
df1 = pd.DataFrame() # Creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index, 'A'] = df.loc[index, 'A']
df1.loc[index, 'B'] = df.loc[index, 'B']
df1.head()
The different approaches discussed in the previous answers are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between 'C' : 'E').
pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
df = pd.DataFrame([[2,3,4], [3,4,5]], columns=['a','b','c'], index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)
The output would be:
b c
1 3 4
2 4 5
You can also use df.pop():
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
... ('parrot', 'bird', 24.0),
... ('lion', 'mammal', 80.5),
... ('monkey', 'mammal', np.nan)],
... columns=('name', 'class', 'max_speed'))
>>> df
name class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
Please use df.pop(c).
I've seen several answers on that, but one remained unclear to me. How would you select those columns of interest?
The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.
Example
print(extracted_features.shape)
print(extracted_features)
(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']
I have the following list/NumPy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use
dataset[extracted_features]
And you will end up with this
This something you would use quite often in machine learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other Stack Overflower users.
To exclude some columns you can drop them in the column index. For example:
A B C D
0 1 10 100 1000
1 2 20 200 2000
Select all except two:
df[df.columns.drop(['B', 'D'])]
Output:
A C
0 1 100
1 2 200
You can also use the method truncate to select middle columns:
df.truncate(before='B', after='C', axis=1)
Output:
B C
0 10 100
1 20 200
To select multiple columns, extract and view them thereafter: df is the previously named data frame. Then create a new data frame df1, and select the columns A to D which you want to extract and view.
df1 = pd.DataFrame(data_frame, columns=['Column A', 'Column B', 'Column C', 'Column D'])
df1
All required columns will show up!
def get_slize(dataframe, start_row, end_row, start_col, end_col):
assert len(dataframe) > end_row and start_row >= 0
assert len(dataframe.columns) > end_col and start_col >= 0
list_of_indexes = list(dataframe.columns)[start_col:end_col]
ans = dataframe.iloc[start_row:end_row][list_of_indexes]
return ans
Just use this function
I think this is the easiest way to reach your goal.
import pandas as pd
cols = ['a', 'b']
df1 = pd.DataFrame(df, columns=cols)
df1 = df.iloc[:, 0:2]
I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:
Convert string type columns to datetime while taking the date format as a parameter
Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.
Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
But on runtime, I get the following error:
AttributeError: Can only use .dt accessor with datetimelike values
Why am I getting this error and how to fix it?
Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!
I know this is late, but if you're still interested in a way to do this while renaming the columns with the custom transformer...
I used the approach of adding the method get_feature_names to the custom transformer inside a pipeline with the ColumnTransformer (overview). You can then use the .named_steps attribute to access the pipeline's step and then get to get_feature_names and then get the column_names, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post.
I had to run this with a pipeline because when I attempted to do it as a standalone custom transformer it went badly wrong (so I won't post that incomplete attempt here) - though you may have better luck.
Here is the raw code showing the pipeline
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
The output is
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
The custom transformers are here (skipping DateFormatter, since it is identical to yours)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
Rationale for DateEncoder
The loop over pandas columns allows the datetime attributes to be extracted from each datetime column. In the same loop, the custom column names are constructed. These are then added to a blank list under self.column_names which is returned in the method get_feature_names (though it has to be flattened before assigning to a dataframe).
For this particular case, you could potentially skip sklearn_pandas.
Details
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1
I was able to break the data format conversion and date splitter into two separate transformers and it worked.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
This code outputs:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.