Problems with list creation and the use of .unique() - pandas

Here is the exercise: We have to use the boxplot() method of
matplotlib.pyplot If I understood well, in this exercise, we will
create a list of lists and then, we will display the lists as a
moustache graph.
For me, the difficulty is to understand the creation of the list of
lists. I tried to create one with random numbers but without success
because the commands of the exercise do not work.
The given solution is:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df['Month']= df.Month.apply(lambda x : x[3:])
print(df.head(3))
l=list()
print("shape = ", df.Month.shape)
for i in df.Month.unique():
l.append(df[df['Month'] == i]['Turnover'])
print(l[0:3])
plt.boxplot(l);
plt.xticks(range(1,13),df.Month.unique());
plt.show();
That produces:
Month Product1 Product2 Returns Turnover Month
0 01-Jan 266 355 0 25285 Jan
1 01-Feb 145 204 6 14255 Feb
2 01-March 183 196 11 15225 March
shape = (36,)
[0 25285
12 15700
24 17490
Name: Turnover, dtype: int64, 1 14255
13 19660
25 29665
Name: Turnover, dtype: int64, 2 15225
14 15360
26 22815
Name: Turnover, dtype: int64]
I don't understand by what mechanism the loop creates a succession of
tables.
I tried to recreate an example to do the same thing with numbers.
k = list()
nbre = np.random.choice(11,40)
NBR = pd.DataFrame(nbre)
print("shape =",NBR.shape)
for n in NBR.unique():
k.append(n)
print(k)
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Why does the following work?
k = list()
nbre = np.random.choice(11,40)
for n in pd.unique(nbre):
k.append(n)
print(k)
On the other hand,
k = list()
nbr = np.random.choice(11,40)
nbr = pd.DataFrame(nbr)
#print(nbr)
for n in nbr.unique():
k.append(n)
print(k)
Does not work... I was thinking of creating a data frame from which I
could create a list of lists from random numbers and .unique() but it
fails..
Regards, Atapalou

Related

How can I create a bar chart with ranges of values

I have a data frame with, among other things, a user id and an age. I need to produce a bar chart of the number of users that fall with ranges of ages. What's throwing me is that there is really no upper bound for the age range. The specific ranges I'm trying to plot are age <= 25, 25 < age <= 75 and age > 75.
I'm relatively new to Pandas and plotting, and I'm sure this is a simple thing for more experienced data wranglers. Any assistance would be greatly appreciated.
You'll need to use the pandas.cut method to do this, and you can supply custom bins and labels!
from pandas import DataFrame, cut
from numpy.random import default_rng
from numpy import arange
from matplotlib.pyplot import show
# Make som dummy data
rng = default_rng(0)
df = DataFrame({'id': arange(100), 'age': rng.normal(50, scale=20, size=100).clip(min=0)})
print(df.head())
id age
0 0 52.514604
1 1 47.357903
2 2 62.808453
3 3 52.098002
4 4 39.286613
# Use pandas.cut to bin all of the ages & assign
# these bins to a new column to demonstrate how it works
## bins are [0-25), [25-75), [75-inf)
df['bin'] = cut(df['age'], [0, 25, 75, float('inf')], labels=['under 25', '25 up to 75', '75 or older'])
print(df.head())
id age bin
0 0 52.514604 25 up to 75
1 1 47.357903 25 up to 75
2 2 62.808453 25 up to 75
3 3 52.098002 25 up to 75
4 4 39.286613 25 up to 75
# Get the value_counts of those bins and plot!
df['bin'].value_counts().sort_index().plot.bar()
show()

Stratified Sampling with different sizes

I am trying to create a function for stratified sampling which takes in a dataframe created using the faker module along with strata, sample size and a random seed. For the sample size, I want the number of samples in each strata to vary based on user input. This is my code for creating the data:
import pandas as pd
import numpy as np
import random as rn#generating random numbers
from faker import Faker
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),
"district": fake.random_number(2,fix_len=True),
"enum_area": fake.random_number(5,fix_len=True),
"hhs": fake.random_number(3),
"pop": fake.random_number(4),
"area": fake.random_number(1)} for x in range(100)])
# check for and remove duplicates from enum area (should be unique)
# before any further analysis
mask= frame_fake.duplicated('enum_area', keep='last')
duplicates = frame_fake[mask]
# print(duplicates)
# drop all except last
frame_fake = frame_fake.drop_duplicates('enum_area',
keep='last').sort_values(by='enum_area',ascending=True)
# reset index to have them sequentially after sorting by enum_area and
# drop the old index column
frame_fake = frame_fake.reset_index().drop('index',axis=1)
frame_fake
This is the code for sampling:
def stratified_custom(data,strata,sample_size, seed=None):
# for this part, we sample 5 enum areas in each strata/region
# we groupby strata and use the transform method with 'count' parameter
# to get strata sizes
data['strat_size'] = data.groupby(strata)[strata].transform('count')
# map input sample size to each strata
data['strat_sample_size'] = data[strata].map(sample_size)
# grouby strata, get sample size per stratum, cast to int and reset
# index.
smp_size = data.groupby(strata)
['strat_sample_size'].unique().astype(int).reset_index()
# groupby strata and select sample per stratum based on the sample size
# for that strata
sample = (data.groupby(strata, group_keys=False)
.apply(lambda x: x.sample(smp_size,random_state=seed)))
# probability of inclusion
sample['inclusion_prob'] =
sample['strat_sample_size']/sample['strat_size']
return sample
s_size={1:7,2:5,3:5,4:5,5:5,6:5,7:5,8:5,9:8} #pass in strata and sample
# size as dict. (key, values)
(stratified_custom(data=frame_fake,strata='region',sample_size=s_size,
seed=99).sort_values(by=['region','enum_area'],ascending=True))
I however receive this error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I can't figure out what this error is talking about. Any help is appreciated.
After much research, I stumbled upon this post https://stackoverflow.com/a/58794577/14198137 and implemented this in my code to not only sample based on varying sample sizes but also with fixed ones using the same function. Here is my code for the data:
import pandas as pd
import numpy as np
import random as rn
from faker import Faker
Faker.seed(99)
fake = Faker()
frame_fake = pd.DataFrame( [{"region":
fake.random_number(1,fix_len=True),"district":
fake.random_number(2,fix_len=True),"enum_area":
fake.random_number(5,fix_len=True), "hhs":
fake.random_number(3),"pop":
fake.random_number(4),"area":
rn.randint(1,2)} for x in range(100)])
frame_fake = frame_fake.drop_duplicates('enum_area',keep='last').sort_values(by='enum_area',ascending=True)
frame_fake = frame_fake.reset_index().drop('index',axis=1)
Here is the updated code for stratified sampling which now works.
def stratified_custom(data,strata,sample_size, seed=None):
data = data.copy()
data['strat_size'] = data.groupby(strata)[strata].transform('count')
try:
data['strat_sample_size'] = data[strata].map(sample_size)
smp_size = data.set_index(strata)['strat_sample_size'].to_dict()
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(smp_size[x.name],random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
except:
data['strat_sample_size'] = sample_size
strat2_sample = (data.groupby(strata, group_keys=False).apply(lambda x: x.sample(sample_size,random_state=seed)))
strat2_sample['inclusion_prob'] = strat2_sample['strat_sample_size']/strat2_sample['strat_size']
return strat2_sample
s_size={1:3,2:9,3:5,4:5,5:5,6:5,7:5,8:5,9:8}
variablesize = (stratified_custom(data=frame_fake,strata='region',sample_size=s_size, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
variablesize
fixedsize = (stratified_custom(data=frame_fake,strata='region',sample_size=3, seed=99).sort_values(by=['region','enum_area'],ascending=True)).head()
fixedsize
The output of variable sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
0 2 65 10566 ... 10 9 0.9
15 2 22 25560 ... 10 9 0.9
The output of fixed sample size:
region district enum_area ... strat_size strat_sample_size inclusion_prob
5 1 60 14737 ... 5 3 0.6
26 1 42 34017 ... 5 3 0.6
68 1 31 72092 ... 5 3 0.6
38 2 74 48408 ... 10 3 0.3
43 2 15 56365 ... 10 3 0.3
I was however wondering if there is a better way of achieving this?

Counter calling in pandas?

I want to call counter values inside pandas.
Effort so far:
from __future__ import unicode_literals
import spacy,en_core_web_sm
from collections import Counter
import pandas as pd
nlp = en_core_web_sm.load()
c = Counter(([token.pos_ for token in nlp('The cat sat on the mat.')]))
sbase = sum(c.values())
for el, cnt in c.items():
el, '{0:2.2f}%'.format((100.0* cnt)/sbase)
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
print df
Current Output:
index 0
0 NOUN 2
1 VERB 1
2 DET 2
3 ADP 1
4 PUNCT 1
Expected Output:
The below inside dataframe:
(u'NOUN', u'28.57%')
(u'VERB', u'14.29%')
(u'DET', u'28.57%')
(u'ADP', u'14.29%')
(u'PUNCT', u'14.29%')
I want to call el and cnt inside the data frame how?
It was a folow up question wherein i wanted to get percentage of POS distribution listed.
Percentage Count Verb, Noun using Spacy?
I was of understanding I need to put group el and cnt in place of c below:
df = pd.DataFrame.from_dict(c, orient='index').reset_index()
I can only fix your out put since I do not have the original data
(df['0']/df['0'].sum()).map("{0:.2%}".format)
Out[827]:
0 28.57%
1 14.29%
2 28.57%
3 14.29%
4 14.29%
Name: 0, dtype: object

Custom transformer that splits dates into new column

I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:
Convert string type columns to datetime while taking the date format as a parameter
Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.
Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
But on runtime, I get the following error:
AttributeError: Can only use .dt accessor with datetimelike values
Why am I getting this error and how to fix it?
Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!
I know this is late, but if you're still interested in a way to do this while renaming the columns with the custom transformer...
I used the approach of adding the method get_feature_names to the custom transformer inside a pipeline with the ColumnTransformer (overview). You can then use the .named_steps attribute to access the pipeline's step and then get to get_feature_names and then get the column_names, which ultimately holds the names of the custom column names to be used. This way you can retrieve column names similar to the approach in this SO post.
I had to run this with a pipeline because when I attempted to do it as a standalone custom transformer it went badly wrong (so I won't post that incomplete attempt here) - though you may have better luck.
Here is the raw code showing the pipeline
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
data2 = pd.DataFrame(
{"dates1": ["2001-12-20", "2002-10-21", "2003-08-22", "2004-08-23",
"2004-07-20", "2007-12-21", "2006-12-22", "2003-04-23"
], "dates2": ["2012-12-20", "2009-10-21", "2016-08-22", "2017-08-23",
"2014-07-20", "2011-12-21", "2014-12-22", "2015-04-23"]})
DATE_COLS = ['dates1', 'dates2']
pipeline = Pipeline([
('transform', ColumnTransformer([
('datetimes', Pipeline([
('formatter', DateFormatter()), ('encoder', DateEncoder()),
]), DATE_COLS),
])),
])
data3 = pd.DataFrame(pipeline.fit_transform(data2))
data3_names = (
pipeline.named_steps['transform']
.named_transformers_['datetimes']
.named_steps['encoder']
.get_feature_names()
)
data3.columns = data3_names
print(data2)
print(data3)
The output is
dates1 dates2
0 2001-12-20 2012-12-20
1 2002-10-21 2009-10-21
2 2003-08-22 2016-08-22
3 2004-08-23 2017-08-23
4 2004-07-20 2014-07-20
5 2007-12-21 2011-12-21
6 2006-12-22 2014-12-22
7 2003-04-23 2015-04-23
dates1_year dates1_month dates1_day dates2_year dates2_month dates2_day
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
The custom transformers are here (skipping DateFormatter, since it is identical to yours)
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dfs = []
self.column_names = []
for column in X:
dt = X[column].dt
# Assign custom column names
newcolumnnames = [column+'_'+col for col in ['year', 'month', 'day']]
df_dt = pd.concat([dt.year, dt.month, dt.day], axis=1)
# Append DF to list to assemble list of DFs
dfs.append(df_dt)
# Append single DF's column names to blank list
self.column_names.append(newcolumnnames)
# Horizontally concatenate list of DFs
dfs_dt = pd.concat(dfs, axis=1)
return dfs_dt
def get_feature_names(self):
# Flatten list of column names
self.column_names = [c for sublist in self.column_names for c in sublist]
return self.column_names
Rationale for DateEncoder
The loop over pandas columns allows the datetime attributes to be extracted from each datetime column. In the same loop, the custom column names are constructed. These are then added to a blank list under self.column_names which is returned in the method get_feature_names (though it has to be flattened before assigning to a dataframe).
For this particular case, you could potentially skip sklearn_pandas.
Details
sklearn = 0.20.0
pandas = 0.23.4
numpy = 1.15.2
python = 2.7.15rc1
I was able to break the data format conversion and date splitter into two separate transformers and it worked.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
This code outputs:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.

pandas faster series of lists unrolling for one-hot encoding?

I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )