Encode categorical features with multiple categories per example - sklearn - pandas

I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.
The data looks like this-
movieId genres
0 1 [Adventure, Animation, Children, Comedy, Fantasy]
1 2 [Adventure, Children, Fantasy]
2 3 [Comedy, Romance]
3 4 [Comedy, Drama, Romance]
4 5 [Comedy]
I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.
I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.
Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.

This SO question has some impressive answers. On your example data, the last answer by Teoretic (using sklearn.preprocessing.MultiLabelBinarizer) is 14 times faster than the solution by Paulo Alves (and both are faster than the accepted answer!):
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)
# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
movieId Adventure Animation Children Comedy Drama Fantasy Romance
0 1 1 1 1 1 0 1 0
1 2 1 0 1 0 0 1 0
2 3 0 0 0 1 0 0 1
3 4 0 0 0 1 1 0 1
4 5 0 0 0 1 0 0 0

Related

Groupby and multiindexes - how to organize data with irregular sizes?

I am trying to organize 3D data collected from several participants with a different number of samples for each participant. Each participant has a unique session and seat index in the experiment. For each participant i, I have a 3D array composed of Ni images (height*width).
I first tried by creating a Dataset of participants but I ended up having many NaNs due to the fact that participants have different samples on the same dimension (sample dim). I then switched to a unique DataArray containing all my participants data concatenated on a single dimension I call depth. This dimension is then associated to a multiindex coordinate combining session, seatand sample coordinates:
<xarray.DataArray (depth: 52, height: 4, width: 4)>
array([[[0.92337111, 0.86505447, 0.08541727, 0.74850848],
[0.02336959, 0.0495726 , 0.98745956, 0.58831929],
[0.62128185, 0.7732787 , 0.27716268, 0.83634779],
[0.08146719, 0.35851012, 0.44170263, 0.74338872]],
...
[[0.4365896 , 0.23527988, 0.86891853, 0.94486637],
[0.20884748, 0.81012315, 0.61542411, 0.76706922],
[0.33391262, 0.88955315, 0.25329999, 0.35803887],
[0.49586615, 0.94767265, 0.40868892, 0.42393425]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- session (depth) int64 0 0 0 0 0 0 0 0 0 0 0 1 1 ... 3 3 3 3 3 3 3 3 3 3 3 3
- seat (depth) int64 0 0 0 0 0 1 1 1 1 1 1 0 0 ... 0 0 0 0 0 1 1 1 1 1 1 1
- sample (depth) int64 0 1 2 3 4 0 1 2 3 4 5 0 1 ... 1 2 3 4 5 0 1 2 3 4 5 6
However I find this solution not really usable for several reasons:
each time I want to perform a groupby I have to reset the index to recreate one with the coordinates I want to group since xarray does not support multiple groupby on the same dim:
da = da.reset_index('depth')
da = da.set_index(depth=['session', 'seat'])
da.groupby('depth').mean()
the result of the code above is not perfect as it does not maintain the multiindex names:
<xarray.DataArray (depth: 8, height: 4, width: 4)>
array([[[0.47795382, 0.67322777, 0.12946181, 0.48983815],
[0.33895882, 0.46772217, 0.62886196, 0.55970122],
[0.57370573, 0.47272117, 0.31529004, 0.63230245],
[0.63230284, 0.5352105 , 0.65805407, 0.65274841]],
...
[[0.55672404, 0.37963945, 0.57334768, 0.64853806],
[0.46608072, 0.39506509, 0.66339553, 0.71447367],
[0.58989461, 0.66066485, 0.53271228, 0.43036214],
[0.44163921, 0.54990042, 0.4229631 , 0.5941268 ]]])
Coordinates:
* height (height) int64 0 1 2 3
* width (width) int64 0 1 2 3
* depth (depth) MultiIndex
- depth_level_0 (depth) int64 0 0 1 1 2 2 3 3
- depth_level_1 (depth) int64 0 1 0 1 0 1 0 1
I can use sel only on fully indexed data (i.e. by using session, seatand sample in the depth index), so I end up re-indexing my data again and again.
I find using hvplot on such DataArray not really straightforward (skipping the details here for easier reading of this already long post).
Is there something I am missing ? Is there a better way to organize my data ? I tried to create mutliple indexes on the same dim for convenience but without success.

Dataframe apply set is not removing duplicate values

My dataset can sometimes include duplicates in one concatenated column like this:
Total
0 Thriller,Satire,Thriller
1 Horror,Thriller,Horror
2 Mystery,Horror,Mystery
3 Adventure,Horror,Horror
When doind this
df['Total'].str.split(",").apply(set)
I get
Total
0 {Thriller,Satire}
1 {Horror,Thriller}
2 {Mystery,Horror,Crime}
3 {Adventure,Horror}
And after encoding it with
df['Total'].str.get_dummies(sep=",")
I get a header looking like this
{'Horror {'Mystery {'Thriller ... Horror Thriller'}
Instead of
Horror Mystery Thriller
How do I get rid of the curly brackets when using Pandas dataframe?
Method Series.str.get_dummies working nice also with duplicates.
So omit code for unique values:
df['Total'] = df['Total'].str.split(",").apply(set)
And use only:
df1 = df['Total'].str.get_dummies(sep=",")
print (df1)
Adventure Horror Mystery Satire Thriller
0 0 0 0 1 1
1 0 1 0 0 1
2 0 1 1 0 0
3 1 1 0 0 0
BUt if need remopve duplicates add Series.str.join:
df1 = df['Total'].str.split(",").apply(set).str.join(',').str.get_dummies(sep=",")

Pandas group by one hot encoded columns

I have my Pandas data frame in the following way (basically one hot encoded columns):
MovieID Action Adventure Animation Childrens Comedy Crime Documentary rating
1 0 0 1 1 1 0 0 4
2 1 0 0 0 1 0 0 5
3 0 0 0 0 0 1 0 2
4 0 0 0 0 0 0 0 4
5 0 0 0 1 1 0 0 7
What I want to do it group by the different movie genres (action, adventure, animation etc.) and count how many times there was a rating given for each of the genres.
Expected output:
Genre Number of times rated
Action 1
Adventure 0
Animation 1
Childrens 2
Comedy 3
......
Genre Action was rated 1 time, adventure 0 times etc.
Code until now:
number_of_ratings = data.groupby(['Action']).agg({"rating": "count"})
Is there a way to select all genre columns at once, as it does not seem ideal to type all the genres (they are much more)?
Does it handle the fact that some of the movies belong to more genres?
Thank you in advance!
Sounds like we can try
output = df.drop(['MovieID', 'rating'], axis=1).sum()
Action 1
Adventure 0
Animation 1
Childrens 2
Comedy 3
Crime 1
Documentary 0
dtype: int64

Converting a pandas crosstab into a stacked dataframe (a regular table)

Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True

Select rows of dataframe based on column values

Problem
I am working on a machine learning project which aims to see on what kind of raw data (text) the classifiers tend to make mistakes and on what kind of data they have no consensus.
Now I have a dataframe with labels, prediction results of 2 classifiers and text data. I am wondering if there is a simple way I could select rows based on some set operations of those columns with predictions or labels.
Data might look like
score review svm_pred dnn_pred
0 0 I went and saw this movie last night after bei... 0 1
1 1 Actor turned director Bill Paxton follows up h... 1 1
2 1 As a recreational golfer with some knowledge o... 0 1
3 1 I saw this film in a sneak preview, and it is ... 1 1
4 1 Bill Paxton has taken the true story of the 19... 1 1
5 1 I saw this film on September 1st, 2005 in Indi... 1 1
6 1 Maybe I'm reading into this too much, but I wo... 0 1
7 1 I felt this film did have many good qualities.... 1 1
8 1 This movie is amazing because the fact that th... 1 1
9 0 "Quitting" may be as much about exiting a pre-... 1 1
For example, I want to select rows both make mistakes, then the index 9 will be returned.
A made-up MWE data example is provided here
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])
which returns
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
2 0 0 0
3 1 0 0
4 0 0 1
5 0 1 1
6 1 0 1
7 0 1 1
8 1 1 1
9 1 1 1
What I Have Done
I know I could list all possible combinations, 000, 001, etc. However,
This is not doable when I want to compare more classifiers.
This will not work for multi-class classification problem.
Could someone help me, thank you in advance.
Why This Question is Not a Duplicate
The existing answers only consider the case where the number of columns are limited. However, in my application, the number of predictions given by classifier (i.e. columns) could be large and this makes the existing answer not quite applicable.
At the same time, the use of pd.Series.ne function is first seen to use this in particular application, which might shed some light to people with similar confusion.
Create a helper Series of "number of incorrect classifers" that you can do logical operations on. This makes the assumption that true score is in column 1 and subsequent prediction values are in columns 2-onwards - You may need to update the slicing indices accordingly
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
Example Usage:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
columns=["score", "svm_pred", "dnn_pred"])
s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)
# Return rows where all classifers got it right
df[s.eq(0)]
score svm_pred dnn_pred
2 0 0 0
8 1 1 1
9 1 1 1
# Return rows where 1 classifer got it wrong
df[s.eq(1)]
score svm_pred dnn_pred
0 0 1 0
1 0 0 1
4 0 0 1
6 1 0 1
# Return rows where all classifers got it wrong
df[s.eq(2)]
score svm_pred dnn_pred
3 1 0 0
5 0 1 1
7 0 1 1
You can use set operations on the selection of rows:
# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index
# returns indexes of those rows where both predictions are wrong
df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index
# returns indexes of those rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index
If you are not only interested in the index, but the complete row, omit the last part:
# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]