scikit-learn - vectorizing both integer and string features at the same time - pandas

Is there a way of applying one-hot coding to both strings and integers at the same time?
DictVectorizer is used for strings, OneHotEncoder is used for integers. Is there something that kind of combines them (treat all feature values as categorical regardless of their type)?
For Example: I have a pandas DataFrame, some of the columns are integers and some are strings:
>>> df
a b c d
0 2 0 w K
1 0 1 f K
2 1 2 y L
3 0 0 f M
All columns are actually categorical. There's no meaning for some of them being integers.
Now if I use a DictVectorizer like this:
vectorizer = DictVectorizer(sparse=False)
df_dict = df.T.to_dict().values()
vectorizer.fit_transform(df_dict)
I get a nice big matrix for columns 'c' and 'd', but the values in 'a' and 'b' stay exactly the same. I need them to get the same action.
One option is of course applying the str function on 'a' and 'b' but that's both implicit (the original data is always integers) and not efficient (iterating over all the column, which might be quite big and applying a wasteful task..).
Is there a simple way of doing this?
Thanks

Looks like get_dummies is what you want. This will take any column and convert it into a pivot of categorical indicators.
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.get_dummies.html

Related

Pandas dataframe select rows where a list-column contains a specific set of elements

This is a follow-up to the following post: Pandas dataframe select rows where a list-column contains any of a list of strings
I want to be able to select rows that contain the exact pair of strings from the selection list (where selection= ['cat', 'dog']).
starting df:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
df I want:
molecule species
2 c [cat, dog]
I tried the following and it returned only the columns labels.
df[pd.DataFrame(df.species.tolist()).isin(selection).all(1)]
One way to do it:
df['joined'] = df.species.str.join(sep=',')
selection = ['cat,dog']
filtered = df.loc[df.joined.isin(selection)]
This won't find cases with different sorting (i.e. 'dog,cat' or 'horse,cat,pig'), but if that is not an issue then it works fine.
This will find anything.
import pandas as pd
selection = ['cat', 'dog']
mols = pd.DataFrame({'molecule':['a','b','c','d','e'],'species':[['dog'],['horse','pig'],['cat','dog'],['cat','horse','pig'],['chicken','pig']]})
mols.loc[np.where(pd.Series([all(w in selection for w in mols.species.values[k]) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
If you want to find any rows that have at least the elements in the list (and could have others as well), use:
mols.loc[np.where(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
This is an interesting application of matrices as selectors. Use the transposed mols to multiply the vector of zeroes and ones that points which rows in mols fit your criteria:
mols.to_numpy().T.dot(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}))
Another (more readable) solution would be to assign, to mols, a column where the condition is True, map it to 0 and 1 and query mols where that column is equal to 1.

Creating a CategoricalDtype from an int column in Dask

dask.__version__ = 2.5.0
I have a table with columns containing many uint16 range 0,...,n & a bunch of lookup tables containing the mappings from these 'codes' to their 'categories'.
My question: Is there a way to make these integer columns 'categorical' without parsing the data or first replacing the codes with the categories.
Ideally I want Dask can keep the values as is and accept them as category codes and and accept the categories I tell Dask belong to these codes?
dfp = pd.DataFrame({'c01': np.random.choice(np.arange(3),size=10), 'v02': np.random.randn(10)})
dfd = dd.from_pandas(dfp, npartitions=2)
mdt = pd.CategoricalDtype(list('abc'), ordered=True)
dfd.c01 = dfd.c01.map_partitions(lambda s: pd.Categorical.from_codes(s, dtype=mdt), meta='category')
dfd.dtypes
The above does not work, the dtype is 'O' (it seem to have replaced the ints with strings)? I can subsequently do the following (which seems to do the trick):
dfd.c01 = dfd.c01.astype('category')
But than seems inefficient for big data sets.
Any pointers are much appreciated.
Some context: I have a big dataset (>500M rows) with many columns containing a limited number of strings. The perfect usecase for dtype categorical. The data gets extracted from a Teradata DW using Parallel Transporter, meaning it produces a delimited UTF-8 file. To make this process faster, I categorize the data on the Teradata side and I just need to create the dtype category from the codes on the dask side of the fence.
As long as you have an upper bound on largest integer, which you call n (equal to 3), then the following will work.
In [33]: dfd.c01.astype('category').cat.set_categories(np.arange(len(mdt.categories))).cat.rename_categories(list(mdt.categories))
Out[33]:
Dask Series Structure:
npartitions=2
0 category[known]
5 ...
9 ...
Name: c01, dtype: category
Dask Name: cat, 10 tasks
Which is the following when computed
Out[34]:
0 b
1 b
2 c
3 c
4 a
5 c
6 a
7 a
8 a
9 a
Name: c01, dtype: category
Categories (3, object): [a, b, c]
The basic idea is to make an intermediate Categorical whose categories are the codes (0, 1, ... n) and then move from those numerical categories to the actual one (a, b, c).
We have an open issue for making this nicer https://github.com/dask/dask/issues/2829

Pipelining pandas: create columns that depend on freshly created ones

Let's say you have the following DataFrame
df=pd.DataFrame({'A': [1, 2]})
now I want to construct the column B = A+1, then the column C=A+2 and D = B +C. These calculations are only here for simplicity. Typically, I want to use some e.g. nonlinear transformations, normalizations etc.
what one could do is the following:
df.assign(**{'B': lambda x: x['A'] +1, 'C': lambda x :['A']+2})\
.assign(**{'D':lambda x: x['B']+ x['C']})
However, this is obviously a bit annoying, specifically, if you have a large number of preprocessing steps in a pipeline. Putting both dictionaries together (even in an orderedDict) fails.
Is there a way to obtain a similar result faster or more elegantly?
Additionally, the same problem occurs, if you want to add a column that uses e.g. the sum of a just defined column. This, as far as I know, will always require two assign calls.
You can using eval
df.eval("""
B= A+1
C= A+2
D = B+C""", inplace=False)
Out[625]:
A B C D
0 1 2 3 5
1 2 3 4 7
If you want the calculation within the query ''
df.eval('B=A.max()',inplace=True)
df
Out[647]:
A B
0 1 2
1 2 2

Extract different rows from two numpy 2D arrays

I generated a new random rows matrix B (50, 40) from a matrix A (100, 40):
B = A[np.random.randint(0,100,size=50)] # it works fine.
Now, I want to take the rows from A that isn't in matrix B.
C = A not in B # pseudocode.
This should do the job:
import numpy as np
A=np.random.randint(5,size=[100,40])
l=np.random.choice(100, size=50, replace=False)
B = A[l]
C= A[np.setdiff1d(np.arange(0,100),l)]
l stores the selected rows, and for C you take the complement of l. Then C is the required matrix.
Note that I set l=np.random.choice(100, size=50, replace=False) to avoid replacement. If you use np.random.randint(0,100,size=50) you may get repeated rows as the same number is selected at random.
Inspried by this question, Check whether each row of a matrix is in another matrix [Python]. First get indices of rows exists in B, then get difference from whole A indices. select rows using difference in the end.
index = np.argwhere((B[:,None,:] == A[:,:]).all(-1))[:, 1]
C = A[np.setdiff1d(np.arange(100), index)]
The numpy_indexed package (Disclaimer: i am its author) has efficient vectorized functionality for all these kinds of operations.
import numpy_indexed as npi
C = npi.difference(A, B)

groupby on sparse matrix in pandas: filling them first

I have a pandas DataFrame df with shape (1000000,3) as follows:
id cat team
1 'cat1' A
1 'cat2' A
2 'cat3' B
3 'cat1' A
4 'cat3' B
4 'cat1' B
Then I dummify with respect to the cat column in order to get ready for a machine learning classification.
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True)
But when I try to do:
df2.groupby(['id','team']).sum()
It get stuck and the computing never ends. So instead of grouping by right away, I try:
df2 = df2.fillna(0)
But it does not work and the DataFrame is still full of NaN values. Why does the fillna() function does not fill my DataFrame as it should?
In other words, how can a pandas sparse matrix I got from get_dummies be filled with 0 instead of NaN?
I also tried:
df2 = pandas.get_dummies(df,columns=['cat'], sparse=True).to_sparse(fill_value=0)
This time, df2 is well filled with 0, but when I try:
print df2.groupby(['id','sexe']).sum()
I get:
C:\Anaconda\lib\site-packages\pandas\core\groupby.pyc in loop(labels, shape)
3545 for i in range(1, nlev):
3546 stride //= shape[i]
-> 3547 out += labels[i] * stride
3548
3549 if xnull: # exclude nulls
ValueError: operands could not be broadcast together with shapes (1205800,) (306994,) (1205800,)
My solution was to do:
df2 = pandas.DataFrame(np.nan_to_num(df2.as_matrix()))
df2.groupby(['id','sexe']).sum()
And it works, but it takes a lot of memory. Can someone help me to find a better solution or at least understand why I can't fill sparse matrix with zeros easily? And why it is impossible to use groupby() then sum() on a sparse matrix?
I think your problem is due to mixing of dtypes. But you could get around it like this. First, provide only the relevant column to get_dummies() rather than the whole dataframe:
df2 = pd.get_dummies(df['cat']).to_sparse(0)
After that, you can add other variables back but everything needs to be numeric. A pandas sparse dataframe is just a wrapper on a sparse (and homogenous dtype) numpy array.
df2['id'] = df['id']
'cat1' 'cat2' 'cat3' id
0 1 0 0 1
1 0 1 0 1
2 0 0 1 2
3 1 0 0 3
4 0 0 1 4
5 1 0 0 4
For non-numeric types, you could do the following:
df2['team'] = df['team'].astype('category').cat.codes
This groupby seems to work OK:
df2.groupby('id').sum()
'cat1' 'cat2' 'cat3'
id
1 1 1 0
2 0 0 1
3 1 0 0
4 1 0 1
An additional but possibly important point for memory management is that you can often save substantial memory with categoricals rather than string objects (perhaps you are already doing this though):
df['cat2'] = df['cat'].astype('category')
df[['cat','cat2']].memory_usage()
cat 48
cat2 30
Not much savings here for the small example dataframe but could be a substantial difference in your actual dataframe.
I was tackling a similar problem before. What I did was, I applied the groupby operation before and followed it up with get_dummies().
This worked for me as groupby, after formation of thousands of dummified columns (in my case), is very slow especially on sparse dataframes. It basically gave up for me. Grouping over the columns first and then dummifying made it work.
df = pd.DataFrame(df.groupby(['id','team'])['cat'].unique())
df.columns = ['cat']
df.reset_index(inplace=True)
df = df[['id','team']].join(df['cat'].str.join('|').str.get_dummies().add_prefix('CAT_'))
Hope this helps out someone!