How to print a specific information from value_count()? - pandas

import pandas as pd
data = {'qtd': [0, 1, 4, 0, 1, 3, 1, 3, 0, 0,
3, 1, 3, 0, 1, 1, 0, 0, 1, 3,
0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
0, 1, 1, 1, 1, 3, 0, 3, 0, 0,
2, 0, 0, 2, 0, 0, 2, 0, 0, 2,
0, 2, 0, 0, 2, 0, 0, 2, 0, 0,
2, 0, 0, 2, 0, 0, 2, 0, 0, 1,
1, 1, 1, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1]
}
df = pd.DataFrame (data, columns = ['qtd'])
Counting
df['qtd'].value_counts()
0 43
1 34
2 10
3 7
4 1
Name: qtd, dtype: int64
What I want is to print a phrase: "The total with zero occurrencies is 43"
Tried with .head(1) but shows more than I want.

Does this solve your problem? The [0] indicates the index you wish to print, in this case the very first occurrence in your column of a data frame.
print('The total with zero occurences is:', df['qtd'].value_counts()[0])
The output of the code above will be:
The total with zero occurences is: 43

I am not sure if you want this but may be helpful:
import inflect
e = inflect.engine()
(df['qtd'].map(e.number_to_words).radd("The total with ").add(" occurances is ")
.value_counts().astype(str).reset_index().agg(':'.join,1))
0 The total with zero occurances is :43
1 The total with one occurances is :34
2 The total with two occurances is :10
3 The total with three occurances is :7
4 The total with four occurances is :1
dtype: object

Related

CountVectorizer not working in ColumnTransformer

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create a sample data frame
df = pd.DataFrame({
'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
'Is this the first document?', 'I have the fourth document'],
'word_length': [27, 37, 26, 27, 26]
})
text_feature = ["corpus"]
count_transformer = CountVectorizer()
# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
("count", count_transformer, text_feature)],
remainder='passthrough')
ct.fit_transform(df)
The output says:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5
I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().
np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values].
The result is the numpy array below:
array([[ 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 2, 0, 0, 0, 1, 0, 1, 1, 0, 1, 37],
1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 26],
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 27],
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 26]], dtype=int64)

Numpy array of u32 to binary bool array

I have an array:
a = np.array([[3701687786, 458299110, 2500872618, 3633119408],
[2377269574, 2599949379, 717229868, 137866584]], dtype=np.uint32)
How do I convert this to an array of shape (2, 128) where 128 represents the binary value for all the numbers in each row (boolean dtype) ?
I can sort of get the bytes by using list comprehension:
b''.join(struct.pack('I', x) for x in a[0])
But this is not quite right. How can I do this using numpy?
You can do bitwise and with the powers of 2:
vals = (a[...,None] & 2**np.arange(32, dtype='uint32')[[None,None,::-1])
out = (vals>0).astype(int).reshape(a.shape[0],-1)
# test
out[0,:32]
Output:
array([1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 1, 0, 1, 0, 1, 0])

Fill values in numpy array that are between a certain value

Let's say I have an array that looks like this:
a = np.array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0])
I want to fill the values that are between 1's with 1's.
So this would be the desired output:
a = np.array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0])
I have taken a look into this answer, which yields the following:
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1])
I am sure this answer is really close to the output I want. However, although tried countless times, I can't change this code into making it work the way I want, as I am not that proficient with numpy arrays.
Any help is much appreciated!
Try this
b = ((a == 1).cumsum() % 2) | a
Out[10]:
array([0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0], dtype=int32)
From #Paul Panzer: use ufunc.accumulate with bitwise_xor
b = np.bitwise_xor.accumulate(a)|a
Try this:
import numpy as np
num_lst = np.array(
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0])
i = 0
while i < len(num_lst): # Iterate through the list
if num_lst[i]: # Check if element is 1 at i-th position
if not num_lst[i+1]: # Check if next element is 0
num_lst[i+1] = 1 # Change next element to 1
i += 1 # Continue through loop
else: # Check if next element is 1
i += 2 # Skip next element
else:
i += 1 # Continue through loop
print(num_lst)
This is probably not the most elegant way to execute this, but it should work. Basically, we loop through the list to find any 1s. When we find an element that is 1, we check if the next element is 0. If it is, then we change the next element to 1. If the next element is 1, that means we should stop changing 0s to 1s, so we jump over that element and proceed with the iteration.

Vectorize this for loop in numpy

I am trying to compute matrix z (defined below) in python with numpy.
Here's my current solution (using 1 for loop)
z = np.zeros((n, k))
for i in range(n):
v = pi * (1 / math.factorial(x[i])) * np.exp(-1 * lamb) * (lamb ** x[i])
numerator = np.sum(v)
c = v / numerator
z[i, :] = c
return z
Is it possible to completely vectorize this computation? I need to do this computation for thousands of iterations, and matrix operations in numpy is much faster than huge for loops.
Here is a vectorized version of E. It replaces the for-loop and scalar arithmetic with NumPy broadcasting and array-based arithmetic:
def alt_E(x):
x = x[:, None]
z = pi * (np.exp(-lamb) * (lamb**x)) / special.factorial(x)
denom = z.sum(axis=1)[:, None]
z /= denom
return z
I ran em.py to get a sense for the typical size of x, lamb, pi, n and k. On data of this size,
alt_E is about 120x faster than E:
In [32]: %timeit E(x)
100 loops, best of 3: 11.5 ms per loop
In [33]: %timeit alt_E(x)
10000 loops, best of 3: 94.7 µs per loop
In [34]: 11500/94.7
Out[34]: 121.43611404435057
This is the setup I used for the benchmark:
import math
import numpy as np
import scipy.special as special
def alt_E(x):
x = x[:, None]
z = pi * (np.exp(-lamb) * (lamb**x)) / special.factorial(x)
denom = z.sum(axis=1)[:, None]
z /= denom
return z
def E(x):
z = np.zeros((n, k))
for i in range(n):
v = pi * (1 / math.factorial(x[i])) * \
np.exp(-1 * lamb) * (lamb ** x[i])
numerator = np.sum(v)
c = v / numerator
z[i, :] = c
return z
n = 576
k = 2
x = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5])
lamb = np.array([ 0.84835141, 1.04025989])
pi = np.array([ 0.5806958, 0.4193042])
assert np.allclose(alt_E(x), E(x))
By the way, E could also be calculated using scipy.stats.poisson:
import scipy.stats as stats
pois = stats.poisson(mu=lamb)
def alt_E2(x):
z = pi * pois.pmf(x[:,None])
denom = z.sum(axis=1)[:, None]
z /= denom
return z
but this does not turn out to be faster, at least for arrays of this length:
In [33]: %timeit alt_E(x)
10000 loops, best of 3: 94.7 µs per loop
In [102]: %timeit alt_E2(x)
1000 loops, best of 3: 278 µs per loop
For larger x, alt_E2 is faster:
In [104]: x = np.random.random(10000)
In [106]: %timeit alt_E(x)
100 loops, best of 3: 2.18 ms per loop
In [105]: %timeit alt_E2(x)
1000 loops, best of 3: 643 µs per loop

Transform a matrix made of binomial vectors to ranges for consecutive zeros

I am trying to figure out how to do this transformation symbolically in theano a matrix of undetermined size
From:
[[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1],
.
.
]
To:
[[1, 2, 3, 0, 1, 2, 3, 4, 5, 0, 0, 1, 0, 1, 2, 3, 0],
[1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 0, 0, 0, 0, 0, 0],
.
.
]
So for every consecutive 0 I want an increasing range and whenever I stumble on a 1 the range resets.
Here's one way to do it, using inefficient scans:
import theano
import theano.tensor as tt
def inner_step(x_t_t, y_t_tm1):
return tt.switch(x_t_t, 0, y_t_tm1 + 1)
def outer_step(x_t):
return theano.scan(inner_step, sequences=[x_t], outputs_info=[0])[0]
def compile():
x = tt.bmatrix()
y = theano.scan(outer_step, sequences=[x])[0]
return theano.function([x], y)
def main():
f = compile()
data = [[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]]
print f(data)
main()
When run, this prints:
[[1 2 3 0 1 2 3 4 5 0 0 1 0 1 2 3 0]
[1 2 3 4 5 6 7 8 0 1 2 0 0 0 0 0 0]]