Creating a CategoricalDtype from an int column in Dask - dataframe

dask.__version__ = 2.5.0
I have a table with columns containing many uint16 range 0,...,n & a bunch of lookup tables containing the mappings from these 'codes' to their 'categories'.
My question: Is there a way to make these integer columns 'categorical' without parsing the data or first replacing the codes with the categories.
Ideally I want Dask can keep the values as is and accept them as category codes and and accept the categories I tell Dask belong to these codes?
dfp = pd.DataFrame({'c01': np.random.choice(np.arange(3),size=10), 'v02': np.random.randn(10)})
dfd = dd.from_pandas(dfp, npartitions=2)
mdt = pd.CategoricalDtype(list('abc'), ordered=True)
dfd.c01 = dfd.c01.map_partitions(lambda s: pd.Categorical.from_codes(s, dtype=mdt), meta='category')
dfd.dtypes
The above does not work, the dtype is 'O' (it seem to have replaced the ints with strings)? I can subsequently do the following (which seems to do the trick):
dfd.c01 = dfd.c01.astype('category')
But than seems inefficient for big data sets.
Any pointers are much appreciated.
Some context: I have a big dataset (>500M rows) with many columns containing a limited number of strings. The perfect usecase for dtype categorical. The data gets extracted from a Teradata DW using Parallel Transporter, meaning it produces a delimited UTF-8 file. To make this process faster, I categorize the data on the Teradata side and I just need to create the dtype category from the codes on the dask side of the fence.

As long as you have an upper bound on largest integer, which you call n (equal to 3), then the following will work.
In [33]: dfd.c01.astype('category').cat.set_categories(np.arange(len(mdt.categories))).cat.rename_categories(list(mdt.categories))
Out[33]:
Dask Series Structure:
npartitions=2
0 category[known]
5 ...
9 ...
Name: c01, dtype: category
Dask Name: cat, 10 tasks
Which is the following when computed
Out[34]:
0 b
1 b
2 c
3 c
4 a
5 c
6 a
7 a
8 a
9 a
Name: c01, dtype: category
Categories (3, object): [a, b, c]
The basic idea is to make an intermediate Categorical whose categories are the codes (0, 1, ... n) and then move from those numerical categories to the actual one (a, b, c).
We have an open issue for making this nicer https://github.com/dask/dask/issues/2829

Related

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Pipelining pandas: create columns that depend on freshly created ones

Let's say you have the following DataFrame
df=pd.DataFrame({'A': [1, 2]})
now I want to construct the column B = A+1, then the column C=A+2 and D = B +C. These calculations are only here for simplicity. Typically, I want to use some e.g. nonlinear transformations, normalizations etc.
what one could do is the following:
df.assign(**{'B': lambda x: x['A'] +1, 'C': lambda x :['A']+2})\
.assign(**{'D':lambda x: x['B']+ x['C']})
However, this is obviously a bit annoying, specifically, if you have a large number of preprocessing steps in a pipeline. Putting both dictionaries together (even in an orderedDict) fails.
Is there a way to obtain a similar result faster or more elegantly?
Additionally, the same problem occurs, if you want to add a column that uses e.g. the sum of a just defined column. This, as far as I know, will always require two assign calls.
You can using eval
df.eval("""
B= A+1
C= A+2
D = B+C""", inplace=False)
Out[625]:
A B C D
0 1 2 3 5
1 2 3 4 7
If you want the calculation within the query ''
df.eval('B=A.max()',inplace=True)
df
Out[647]:
A B
0 1 2
1 2 2

Extract different rows from two numpy 2D arrays

I generated a new random rows matrix B (50, 40) from a matrix A (100, 40):
B = A[np.random.randint(0,100,size=50)] # it works fine.
Now, I want to take the rows from A that isn't in matrix B.
C = A not in B # pseudocode.
This should do the job:
import numpy as np
A=np.random.randint(5,size=[100,40])
l=np.random.choice(100, size=50, replace=False)
B = A[l]
C= A[np.setdiff1d(np.arange(0,100),l)]
l stores the selected rows, and for C you take the complement of l. Then C is the required matrix.
Note that I set l=np.random.choice(100, size=50, replace=False) to avoid replacement. If you use np.random.randint(0,100,size=50) you may get repeated rows as the same number is selected at random.
Inspried by this question, Check whether each row of a matrix is in another matrix [Python]. First get indices of rows exists in B, then get difference from whole A indices. select rows using difference in the end.
index = np.argwhere((B[:,None,:] == A[:,:]).all(-1))[:, 1]
C = A[np.setdiff1d(np.arange(100), index)]
The numpy_indexed package (Disclaimer: i am its author) has efficient vectorized functionality for all these kinds of operations.
import numpy_indexed as npi
C = npi.difference(A, B)

Conditional join result count in a large dataframe

I have a data set of about 100m rows, 4gb, containing two lists like these:
Seed
a
r
apple
hair
brush
tree
Phrase
apple tree
hair brush
I want to get the count of unique matched 'Phrase's for each unique 'Seed'. So for example, the seed 'a' is contained in both 'apple tree' and 'hair brush', so it's 'Phrases_matched_count' should be '2'. Matches are just using partial patches (i.e. a 'string contains' match, does not need to be a regex or anything complex).
Seed Phrases_matched_count
a 2
r 2
apple 1
hair 1
brush 1
tree 1
I have been trying to find a way to do this using Apache Pig (on a small Amazon EMR cluster), and Python Pandas (the data set just about fits in memory), but just can't find a way to do this without looping through every row for each unique 'seed', which will take very long, or a cross product of the tables, which will use too much memory.
Any ideas?
This can be done by using built-in contains but I'm not sure of its scalability on an important number of data.
# Test data
seed = pd.Series(['a','r', 'apple', 'hair', 'brush', 'tree'])
phrase = pd.Series(['apple tree', 'hair brush'])
# Creating a DataFrame with seeds as index and phrases as columns
df = pd.DataFrame(index=seed, columns=phrase)
# Checking if each seed is contained in each phrase
df = df.apply(lambda x: x.index.str.contains(x.name), axis=1)
# Getting the result
df.sum(axis=1)
# The result
a 2
r 2
apple 1
hair 1
brush 1
tree 1

scikit-learn - vectorizing both integer and string features at the same time

Is there a way of applying one-hot coding to both strings and integers at the same time?
DictVectorizer is used for strings, OneHotEncoder is used for integers. Is there something that kind of combines them (treat all feature values as categorical regardless of their type)?
For Example: I have a pandas DataFrame, some of the columns are integers and some are strings:
>>> df
a b c d
0 2 0 w K
1 0 1 f K
2 1 2 y L
3 0 0 f M
All columns are actually categorical. There's no meaning for some of them being integers.
Now if I use a DictVectorizer like this:
vectorizer = DictVectorizer(sparse=False)
df_dict = df.T.to_dict().values()
vectorizer.fit_transform(df_dict)
I get a nice big matrix for columns 'c' and 'd', but the values in 'a' and 'b' stay exactly the same. I need them to get the same action.
One option is of course applying the str function on 'a' and 'b' but that's both implicit (the original data is always integers) and not efficient (iterating over all the column, which might be quite big and applying a wasteful task..).
Is there a simple way of doing this?
Thanks
Looks like get_dummies is what you want. This will take any column and convert it into a pivot of categorical indicators.
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.get_dummies.html