CountVectorizer not working in ColumnTransformer - pandas

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create a sample data frame
df = pd.DataFrame({
'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
'Is this the first document?', 'I have the fourth document'],
'word_length': [27, 37, 26, 27, 26]
})
text_feature = ["corpus"]
count_transformer = CountVectorizer()
# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
("count", count_transformer, text_feature)],
remainder='passthrough')
ct.fit_transform(df)
The output says:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5
I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().
np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values].
The result is the numpy array below:
array([[ 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 1, 37],
1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 26],
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 26]], dtype=int64)

Related

Numpy array of u32 to binary bool array

I have an array:
a = np.array([[3701687786, 458299110, 2500872618, 3633119408],
[2377269574, 2599949379, 717229868, 137866584]], dtype=np.uint32)
How do I convert this to an array of shape (2, 128) where 128 represents the binary value for all the numbers in each row (boolean dtype) ?
I can sort of get the bytes by using list comprehension:
b''.join(struct.pack('I', x) for x in a[0])
But this is not quite right. How can I do this using numpy?
You can do bitwise and with the powers of 2:
vals = (a[...,None] & 2**np.arange(32, dtype='uint32')[[None,None,::-1])
out = (vals>0).astype(int).reshape(a.shape[0],-1)
# test
out[0,:32]
Output:
array([1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0])

How to print a specific information from value_count()?

import pandas as pd
data = {'qtd': [0, 1, 4, 0, 1, 3, 1, 3, 0, 0,
3, 1, 3, 0, 1, 1, 0, 0, 1, 3,
0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
0, 1, 1, 1, 1, 3, 0, 3, 0, 0,
2, 0, 0, 2, 0, 0, 2, 0, 0, 2,
0, 2, 0, 0, 2, 0, 0, 2, 0, 0,
2, 0, 0, 2, 0, 0, 2, 0, 0, 1,
1, 1, 1, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1]
}
df = pd.DataFrame (data, columns = ['qtd'])
Counting
df['qtd'].value_counts()
0 43
1 34
2 10
3 7
4 1
Name: qtd, dtype: int64
What I want is to print a phrase: "The total with zero occurrencies is 43"
Tried with .head(1) but shows more than I want.
Does this solve your problem? The [0] indicates the index you wish to print, in this case the very first occurrence in your column of a data frame.
print('The total with zero occurences is:', df['qtd'].value_counts()[0])
The output of the code above will be:
The total with zero occurences is: 43
I am not sure if you want this but may be helpful:
import inflect
e = inflect.engine()
(df['qtd'].map(e.number_to_words).radd("The total with ").add(" occurances is ")
.value_counts().astype(str).reset_index().agg(':'.join,1))
0 The total with zero occurances is :43
1 The total with one occurances is :34
2 The total with two occurances is :10
3 The total with three occurances is :7
4 The total with four occurances is :1
dtype: object

Counting zeros in a rolling - numpy array (including NaNs)

I am trying to find a way of Counting zeros in a rolling using numpy array ?
Using pandas I can get it using:
df['demand'].apply(lambda x: (x == 0).rolling(7).sum()).fillna(0))
or
df['demand'].transform(lambda x: x.rolling(7).apply(lambda x: 7 - np.count _nonzero(x))).fillna(0)
In numpy, using the code from Here
def rolling_window(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
print(shape)
strides = (a.strides[0],) + a.strides
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
np.count_nonzero(rolling_window(arr==0, 7), axis=1)
Output:
array([2, 3])
However, I need the first 6 NaNs as well, and fill it with zeros:
Expected output:
array([0, 0, 0, 0, 0, 0, 2, 3])
Think an efficient one would be with 1D convolution -
def sum_occurences_windowed(arr, W):
K = np.ones(W, dtype=int)
out = np.convolve(arr==0,K)[:len(arr)]
out[:W-1] = 0
return out
Sample run -
In [42]: arr
Out[42]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [43]: sum_occurences_windowed(arr,W=7)
Out[43]: array([0, 0, 0, 0, 0, 0, 2, 3])
Timings on varying length arrays and window of 7
Including count_rolling from #Quang Hoang's post.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [sum_occurences_windowed, count_rolling]
in_ = {n:(np.random.randint(0,5,(n)),7) for n in [10,20,50,100,200,500,1000,2000,5000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Length')
t.plot(logx=True, save='timings.png')
Extending to generic n-dim arrays
from scipy.ndimage.filters import convolve1d
def sum_occurences_windowed_ndim(arr, W, axis=-1):
K = np.ones(W, dtype=int)
out = convolve1d((arr==0).astype(int),K,axis=axis,origin=-(W//2))
out.swapaxes(axis,0)[:W-1] = 0
return out
So, on a 2D array, for counting along each row, use axis=1 and for cols, axis=0 and so on.
Sample run -
In [155]: np.random.seed(0)
In [156]: a = np.random.randint(0,3,(3,10))
In [157]: a
Out[157]:
array([[0, 1, 0, 1, 1, 2, 0, 2, 0, 0],
[0, 2, 1, 2, 2, 0, 1, 1, 1, 1],
[0, 1, 0, 0, 1, 2, 0, 2, 0, 1]])
In [158]: sum_occurences_windowed_ndim(a, W=7)
Out[158]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
# Verify with earlier 1D solution
In [159]: np.vstack([sum_occurences_windowed(i,7) for i in a])
Out[159]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
Let's test out our original 1D input array -
In [187]: arr
Out[187]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [188]: sum_occurences_windowed_ndim(arr, W=7)
Out[188]: array([0, 0, 0, 0, 0, 0, 2, 3])
I would modify the function as follow:
def count_rolling(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
strides = (a.strides[0],) + a.strides
rolling = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
out = np.zeros_like(a)
out[window_size-1:] = (rolling == 0).sum(1)
return out
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
count_rolling(arr,7)
Output:
array([0, 0, 0, 0, 0, 0, 2, 3])

Fill values in numpy array that are between a certain value

Let's say I have an array that looks like this:
a = np.array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0])
I want to fill the values that are between 1's with 1's.
So this would be the desired output:
a = np.array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0])
I have taken a look into this answer, which yields the following:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1])
I am sure this answer is really close to the output I want. However, although tried countless times, I can't change this code into making it work the way I want, as I am not that proficient with numpy arrays.
Any help is much appreciated!
Try this
b = ((a == 1).cumsum() % 2) | a
Out[10]:
array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0], dtype=int32)
From #Paul Panzer: use ufunc.accumulate with bitwise_xor
b = np.bitwise_xor.accumulate(a)|a
Try this:
import numpy as np
num_lst = np.array(
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0])
i = 0
while i < len(num_lst): # Iterate through the list
if num_lst[i]: # Check if element is 1 at i-th position
if not num_lst[i+1]: # Check if next element is 0
num_lst[i+1] = 1 # Change next element to 1
i += 1 # Continue through loop
else: # Check if next element is 1
i += 2 # Skip next element
else:
i += 1 # Continue through loop
print(num_lst)
This is probably not the most elegant way to execute this, but it should work. Basically, we loop through the list to find any 1s. When we find an element that is 1, we check if the next element is 0. If it is, then we change the next element to 1. If the next element is 1, that means we should stop changing 0s to 1s, so we jump over that element and proceed with the iteration.

Is there a way to slice out multiple 2D numpy arrays from one 2D numpy array in one batch operation?

I have a numpy array heatmap of shape (img_height, img_width) and another array bboxes of shape (K, 4), where K is a number of bounding boxes.
Each bounding box is defined
like so: [x_top_left, y_top_left, width, height].
Here's an example of such array:
bboxes = np.array([
[0, 0, 4, 7],
[3, 4, 3, 4],
[7, 2, 3, 7]
])
heatmap is initally filled with zeros.
What I need to do is to put value 1 for each bounding box in it's corresponding place.
The resulting heatmap should be:
heatmap = np.array([
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0],
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
])
Important things to note:
axis 0 corresponds to image height
axis 1 corresponds to image width
I've already solved this using python for loop, like so:
for bbox in bboxes:
# y_top_left:y_top_left + img_height, x_top_left:x_top_left + img_width
heatmap[bbox[1] : bbox[1] + bbox[3], bbox[0] : bbox[0] + bbox[2]] = 1
I would like to avoid using python for loops (if it's possible) and be able to do something like this:
heatmap[bboxes[:,1] : bboxes[:,1] + bboxes[:,3], bboxes[:,0]:bboxes[:,0] + bboxes[:,2]] = 1
Is there a way of doing such multiple slicing in numpy?
I am aware of numpy integer array indexing, but to generate such indices I am also unable to avoid python for loops.