dask - CountVectorizer returns "ValueError('Cannot infer dataframe metadata with a `dask.delayed` argument')" - dataframe

I have a Dask Dataframe with the following content:
X_trn y_trn
0 java repeat task every random seconds p m alre... LQ_CLOSE
1 are java optionals immutable p d like to under... HQ
2 text overlay image with darkened opacity react... HQ
3 ternary operator in swift is so picky p questi... HQ
4 hide show fab with scale animation p m using c... HQ
I am trying to use CountVectorizer from dask.ml's library. When I do pass my X_trn to fit_transform, I get the Value Error "Cannot infer dataframe metadata with a dask.delayed argument'".
vectorizer = CountVectorizer()
countMatrix = vectorizer.fit_transform(training['X_trn'])

This answer will probably come too late for the original author but may still help others. The answer is actually in the documentation I also overlooked it at first:
The Dask-ML implementation currently requires that raw_documents is a
dask.bag.Bag of documents (lists of strings).
This apparently innocense sentence is your problem. You are passing a dask.dataframe and not a dask.bag.Bag of documents
import dask.bag as db
corpus = db.from_sequence(training['X_trn'], npartitions=2)
And then, you can pass it to the vectorizer as you were doing:
X = vectorizer.fit_transform(corpus)

Related

A more efficient way of creating an NxM array in Python

In Python, I need to create an NxM matrix in which the ij entry has value of i^2 + j^2.
I'm currently constructing it using two for loops, but the array is quite big and the computation time is long and I need to perform it several times. Is there a more efficient way of constructing such matrix using maybe Numpy ?
You can use broadcasting in numpy. You may refer to the official documentation. For example,
import numpy as np
N = 3; M = 4 #whatever values you'd like
a = (np.arange(N)**2).reshape((-1,1)) #make it to column vector
b = np.arange(M)**2
print(a+b) #broadcasting applied
Instead of using np.arange(), you can use np.array([...some array...]) for customizing it.

pandas apply for performance

I have a pandas apply function that runs inference over a 10k csv of strings
account messages
0 th_account Forgot to tell you Evan went to sleep a little...
1 th_account Hey I heard your buying a house I m getting ri...
2 th_account They re releasing a 16 MacBook
3 th_account 5 cups of coffee today I may break the record
4 th_account Apple Store Items in order W544414717 were del...
The function takes about 17 seconds to run.
I'm working on a text classifier and was wondering if there is a quicker way to write it
def _predict(messages):
results = []
for message in messages:
message = vectorizer.transform([message])
message = message.toarray()
results.append(model.predict(message))
return results
df["pred"] = _predict(df.messages.values)
the vectorizer is a TfidfVectorizer and model is a GaussianNB model from sklearn.
I need to loop through every messsage in the csv and perform a prediction to be shown in a new column
You can try built-in function apply in pandas. Its underlying uses C language passby GIL. But still slow.
def _predict(message):
"""message is each row in dataframe
Each row of dataframe return a result
"""
message = vectorizer.transform([message])
message = message.toarray()
return model.predict(message)
df["pred"] = df.apply(_predict, axis=1)
You can run the following code to evaluate the time.
df.head().apply(_predict, axis=1)

image from [3,M,N] to [M,N,3]

I have a ndarray representing an image with different channels like this:
image = (8,100,100) where 8=channels, 100x100 the actual image per channel
I am interested in extracting the RGB components of that image:
imageRGB = np.take(image, [4,2,1], axis = 0)
in this way I have an array of (3,100,100) with the RGB components.
However, I need to visualize it so I need an array of (100,100,3), I think it's quite straightforward to do it but I all the methods I try do not work.
numpy einsum is a good tool to be used.
Official document: https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html
import numpy as np
imageRGB = np.random.randint(0,5,size=(3,100,101))
# set the last dim to 101 just to make stuff more clear
imageRGB.shape
# (3,100,101)
imageRGB_reshape = np.einsum('kij->ijk',imageRGB)
imageRGB_reshape.shape
# (100,101,3)
In my opinion it's the most clear way to write and read.
Wow thank you! I have never thought to use Einstein summation, actually it works very well.
Just for curiosity is it possible to build it manually?
For example:
R = image[4,:,:]
G = image[2,:,:]
B = image[1,:,:]
imageRGB = ???

change scientific notation abbreviation of y axis units to a string

First I would like to apologize as I know I am not asking this question correctly (which is why I cant find what is likely a simple answer).
I have a graph
As you can see above the y axis it says 1e11 meaning that the units are in 100 Billions. I would like to change the graph to read 100 Billion instead of 1e11.
I am not sure what such a notation is called.
To be clear I am not asking to change the whole y axis to number values like other questions I only want to change the top 1e11 to be more readable to those who are less mathematical.
ax.get_yaxis().get_major_formatter().set_scientific(False)
results in an undesired result
import numpy as np
from matplotlib.ticker import FuncFormatter
def billions(x, pos):
return '$%1.1fB' % (x*1e-9)
formatter = FuncFormatter(billions)
ax.yaxis.set_major_formatter(formatter)
located from https://matplotlib.org/examples/pylab_examples/custom_ticker1.html
produces

pandas access axis by user-defined name

I am wondering whether there is any way to access axes of pandas containers (DataFrame, Panel, etc...) by user-defined name instead of integer or "index", "columns", "minor_axis" etc...
For example, with the following data container:
df = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
df.index.name = 'myaxis1'
df.columns.name = 'myaxis2'
I would like to do this:
df.sum(axis='myaxis1')
df.xs('c1', axis='myaxis2') # cross section
Also very useful would be:
df.reshape(['myaxis2','myaxis1'])
(in this case not so relevant, but it could become so if the dimension increases)
The reason is that I work a lot with multi-dimensional arrays of varying dimensions, like "time", "variable", "percentile" etc...and a same piece of code is often applied to objects which can be DataFrame, Panel or even Panel4D or DataFrame with MultiIndex. For now I often make test on the shape of the object, or on the general settings of the script in order to know which axis is the relevant one to compute a sum or mean. But I think it would be much more convenient to forget about how the container is implemented in the detail (DataFrame, Panel etc...), and simply think about the nature of the problem (say I want to average over the time, I do not want to think about whether I work with in "probabilistic" mode with several percentiles, or in "deterministic" mode with a single time series).
Writing this post I have (re)discovered the very useful axes attribute. The above code could be translated into:
nms = [ax.name for ax in df.axes]
axid1 = nms.index('myaxis1')
axid2 = nms.index('myaxis2')
df.sum(axis=axid1)
df.xs('c1', axis=axid2) # cross section
and the "reshape" feature (does not apply to 3-d case though...):
newshape = ['myaxis2','myaxis1']
axid = [nms.index(nm) for nm in newshape]
df.swapaxes(*axid)
Well, I have to admit that I have found these solutions while writing this post (and this is already very convenient), but it could be generalized to account for DataFrame (or other) with MultiIndex axes, do a search on all axes and labels...
In my opinion it would be a major improvement to the user-friendliness of pandas (ok, forgetting about the actual structure could have a performance cost, but the user worried about performance can be careful in how he/she organizes the data).
What do you think?
This is still experimental, but look at this page:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panelnd-experimental
import pandas
import numpy as np
from pandas.core import panelnd
MyPanel4D = panelnd.create_nd_panel_factory(
klass_name = 'MyPanel4D',
axis_orders = ['axis4', 'axis3', 'axis2', 'axis1'],
axis_slices = {'axis3': 'items',
'axis2': 'major_axis',
'axis1': 'minor_axis'},
slicer = 'Panel',
stat_axis=2)
mp4d = MyPanel4D(np.random.rand(5,4,3,2))
print mp4d
Results in this
<class 'pandas.core.panelnd.MyPanel4D'>
Dimensions: 5 (axis4) x 4 (axis3) x 3 (axis2) x 2 (axis1)
Axis4 axis: 0 to 4
Axis3 axis: 0 to 3
Axis2 axis: 0 to 2
Axis1 axis: 0 to 1
Here's the caveat, when you slice it like mp4d[0] you are going to get back a Panel, unless you create a hierarchy of custom objects (unfortunately will need to wait for 0.12-dev for support for 'renaming' Panel/DataFrame, its non-trivial and haven't had any requests)
So for higher dim objects you can impose your own name structure. The axis
aliasing should work like you are suggesting, but I think there are some bugs there