Consider I have the following dataframe:
Survived Pclass Sex Age Fare
0 0 3 male 22.0 7.2500
1 1 1 female 38.0 71.2833
2 1 3 female 26.0 7.9250
3 1 1 female 35.0 53.1000
4 0 3 male 35.0 8.0500
I used the get_dummies() function to create dummy variable. The code and output are as follows:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
This will return:
Survived Pclass Age Fare Sex_female Sex_male
0 0 3 22 7.2500 0 1
1 1 1 38 71.2833 1 0
2 1 3 26 7.9250 1 0
3 1 1 35 53.1000 1 0
4 0 3 35 8.0500 0 1
What I would like to have is a single column for Sex having the values 0 or 1 instead of 2 columns.
Interestingly, when I used get_dummies() on a different dataframe, it worked just like I wanted.
For the following dataframe:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
With the code:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
It returns:
Message ... Category_spam
0 Go until jurong point, crazy.. Available only ... ... 0
1 Ok lar... Joking wif u oni... ... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... ... 1
3 U dun say so early hor... U c already then say... ... 0
4 Nah I don't think he goes to usf, he lives aro... ... 0
Why does get_dummies() work differently on these two dataframes?
How can I make sure I get the 2nd output everytime?
Here are multiple ways you can do:
from sklearn.preprocessing import LabelEncoder
lbl=LabelEncoder()
df['Sex_encoded'] = lbl.fit_transform(df['Sex'])
# using only pandas
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
Survived Pclass Sex Age Fare Sex_encoded
0 0 3 male 22.0 7.2500 0
1 1 1 female 38.0 71.2833 1
2 1 3 female 26.0 7.9250 1
3 1 1 female 35.0 53.1000 1
4 0 3 male 35.0 8.0500 0
Related
I have created a set of 4 clusters using kmeans, but I'd like to reorder the clusters in an ascending manner to have a predictable way of outputting an analysis every time the script is executed.
The resulting df with the clusters is something like:
customer_id recency frequency monetary_value recency_cluster \
0 44792907512250289 21 1 43.76 0
1 4277896431638207047 443 1 73.13 1
2 1509512561185834874 559 1 37.50 1
3 -8259919882769629944 437 1 34.38 1
4 8269311313560571571 133 2 324.78 0
5 6521698907264712834 311 1 6.32 3
6 9102795320443090762 340 1 174.99 3
7 6203217338400763719 39 1 77.50 0
8 7633758030510673403 625 1 95.26 2
9 -2417721548925747504 644 1 76.84 2
frequency_cluster monetary_value_cluster
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 1 1
7 1 0
8 1 0
9 1 0
The recency clusters are not sorted by the data, I'd like for example that the recency cluster 0 to be the one with the min value = 1.0 (recency cluster 1).
recency_cluster count mean std min 25% 50% 75% max
0 17609.0 700.900960 56.895995 609.0 651.0 697.0 749.0 807.0
1 16458.0 102.692672 62.952229 1.0 47.0 101.0 159.0 210.0
2 17166.0 515.971746 56.592490 418.0 466.0 517.0 567.0 608.0
3 18634.0 317.599227 58.852980 211.0 269.0 319.0 367.0 416.0
Using something like:
rfm_df.groupby('recency_cluster')['recency'].transform('min')
Will return a colum with the min value of each clusters
0 1
1 418
2 418
3 418
4 1
...
69862 609
69863 1
69864 211
69865 609
69866 211
I guess there's got to be a way to convert this categories [1,211,418,609] into [0, 1, 2, 3] in order to get the desired result but I can't come up with a solution.
Or maybe there's a better approach to the problem.
Edit: I did this and I think it's working:
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
rfm_df['recency_normalized_cluster'] = rfm_df.groupby('recency_cluster')['recency'].transform('min').astype('category').cat.codes
I have a dataframe
text label title version
0 Alice is in Seattle SA 1
1 Alice is in wonderland. Portlang SA 2
2 Mallory has done the task. Gotland sometitle 4
3 Mallory has done the task. california sometitle 4
4 Mallory has california sometitle 2
5 Bob is final. Portland some different title 3
6 Mallory has done Portland sometitle 3
The final result I want is to find the hightest version text for given title and corresponding label, however the label should be divided as columns.
Here is the final result:
text Seattle Portlang Gotland california Portland title
0 Alice is in wonderland. 0 1 0 0 0 SA
1 Mallory has done the task. 0 0 1 1 0 sometitle
2 Bob is final. 0 0 0 0 1 some different title
Thanks in advance,
Use pivot_table. First rename text values with the title of the highest version for each title the pivot your dataframe:
out = (
df.assign(dummy=1)
.mask(df.groupby('title')['version'].rank(method='dense', ascending=False) > 1)
.pivot_table('dummy', ['title', 'text'], 'label', fill_value=0)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> out
title text Gotland Portland Portlang california
0 SA Alice is in wonderland. 0 0 1 0
1 some different title Bob is final. 0 1 0 0
2 sometitle Mallory has done the task. 1 0 0 1
This question already has an answer here:
Quickest way to make a get_dummies type dataframe from a column with a multiple of strings
(1 answer)
Closed 1 year ago.
I have a df
name cars
john honda,kia
tom honda,kia,nissan
jack toyota
johnny honda,kia
tommy honda,kia,nissan
jacky toyota
What is a best way using pandas to create a data frame that would add a 1 if car present else 0 to existing df which would look like this.
name cars honda kia nissan toyota
john honda,kia 1 1 0 0
tom honda,kia,nissan 1 1 1 0
jack toyota 0 0 0 1
johnny honda,kia 1 1 0 0
tommy honda,kia,nissan 1 1 1 0
jacky toyota 0 0 0 1
i tried using np.where with multiple conditions as described here but i don't think its the right approach.
That’s exactly what pd.Series.str.get_dummies does, just join it’s result to your dataframe without the cars column:
>>> df.drop(columns=['cars']).join(df['cars'].str.get_dummies(sep=','))
name honda kia nissan toyota
0 john 1 1 0 0
1 tom 1 1 1 0
2 jack 0 0 0 1
3 johnny 1 1 0 0
4 tommy 1 1 1 0
5 jacky 0 0 0 1
I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
I have read many online articles on feature hashing of categorical variables for machine learning. Unfortunately, I still couldn't grasp the concept and understand how it works. I will illustrate my confusion through the sample dataset and hashing function below that I grabbed from another site:
>>>data
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 New York 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 1.8 Oregon 2003
>>> def hash_col(df, col, N):
cols = [col + "_" + str(i) for i in range(N)]
def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
df[cols] = df[col].apply(xform)
return df.drop(col,axis=1)
The below functions are used to print out the different transformed output by specifying different number of dimensions (or in other words, hashed features):
>>> print(hash_col(data, 'state',4))
pop year state_0 state_1 state_2 state_3
0 1.5 2000 0 0 1 0
1 1.7 2001 0 0 1 0
2 3.6 2002 0 0 0 1
3 2.4 2001 0 1 0 0
4 2.9 2002 0 1 0 0
5 1.8 2003 0 0 0 1
>>> print(hash_col(data, 'state',5))
pop year state_0 state_1 state_2 state_3 state_4
0 1.5 2000 1 0 0 0 0
1 1.7 2001 1 0 0 0 0
2 3.6 2002 1 0 0 0 0
3 2.4 2001 0 0 1 0 0
4 2.9 2002 0 0 1 0 0
5 1.8 2003 0 0 0 0 1
>>> print(hash_col(data, 'state',6))
pop year state_0 state_1 state_2 state_3 state_4 state_5
0 1.5 2000 0 0 0 0 1 0
1 1.7 2001 0 0 0 0 1 0
2 3.6 2002 0 0 0 0 0 1
3 2.4 2001 0 0 0 1 0 0
4 2.9 2002 0 0 0 1 0 0
5 1.8 2003 0 0 0 0 0 1
What I can't understand is what does each of the 'state_0', 'state_1', 'state_2' etc. column represent. Also, since there are 4 unique states in my dataset (Ohio, New York, Nevada, Oregon), why are all the '1' allocated to just 3 'state_n' columns instead of 4 as in one hot encoding? For example, when I set the number of dimensions to 6, the output had two '1' in state_3, state_4 and state_5, but there was no '1' in state_0, state_1 and state_2. Any feedback would be greatly appreciated!
Feature hashing is typically used when you don't know all the possible values of a categorical variable. Because of this, we can't create a static mapping from categorical values to columns. So a hash function is used to determine which column each categorical value corresponds to.
This is not the best use case because we know there are exactly 50 states and could just use one hot encoding.
A hash function will also have collisions, where different values are mapped to the same value. That is what's happening here. Two different state names hash to the same value after the modulus operation in the hash function.
One way to alleviate collisions is to make your feature space(number of columns) larger than the number of possible categorical values.