Related
In François Chollet's Deep Learning with Python, appears this function:
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
I understand what this function does. This function is asked about in this quesion and in this question as well, also mentioned here, here, here, here, here & here. Despite being so wide-spread, this vectorization is, according to Chollet's book is done "manually for maximum clarity." I am interested whether there is a standard, not "manual" way of doing it.
Is there a standard Keras / Tensorflow / Scikit-learn / Pandas / Numpy implementation of a function which behaves very similarly to the function above?
Solution with MultiLabelBinarizer
Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=range(dimension))
mlb.fit_transform(sequences)
Solution with Numpy broadcasting
Assuming sequences is an array of integers with maximum possible value upto dimension-1
(np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
Worked out example
>>> sequences
[[4, 1, 0],
[4, 0, 3],
[3, 4, 2]]
>>> dimension = 10
>>> mlb = MultiLabelBinarizer(classes=range(dimension))
>>> mlb.fit_transform(sequences)
array([[1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])
>>> (np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
array([[0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0, 0, 0]])
I'm a noob in python!
I'd like to get sequences and anomaly together like this:
and sort only normal sequence.(if a value of anomaly column is 0, it's a normal sequence)
turn normal sequences to numpy array (without anomaly column)
each row(Sequence) is one session. so in this case their are 6 independent sequences.
each element represent some specific activity.
'''
sequence = np.array([[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 300, 200, 100]])
anomaly = np.array((0,0,0,0,0,1))
'''
i got these two variables and have to sort only normal sequences.
Here is the code i tried:
'''
# sequence to dataframe
empty_df = pd.DataFrame(columns = ['Sequence'])
empty_df.reset_index()
for i in range(sequence.shape[0]):
empty_df = empty_df.append({"Sequence":sequence[i]},ignore_index = True) #
#concat anomaly
anomaly_df = pd.DataFrame(anomaly)
df = pd.concat([empty_df,anomaly_df],axis = 1)
df.columns = ['Sequence','anomaly']
df
'''
I didn't want to use pd.DataFrame because it gives me this:
pd.DataFrame(sequence)
anyways, after making df, I tried to sort normal sequences
#sorting normal seq
normal = df[df['anomaly'] == 0]['Sequence']
# back to numpy. only sequence column.
normal = normal.to_numpy()
normal.shape
'''
and this numpy gives me different shape1 from the variable sequence.
sequence.shape: (6,6) normal.shape =(5,)
I want to have (5,6). Tried reshape but didn't work..
Can someone help me with this?
If there are any unspecific explanation from my question, plz leave a comment. I appreciate it.
I am not quite sure of what you need but here you could do:
import pandas as pd
df = pd.DataFrame({'sequence':sequence.tolist(), 'anomaly':anomaly})
df
sequence anomaly
0 [5, 1, 1, 0, 0, 0] 0
1 [5, 1, 1, 0, 0, 0] 0
2 [5, 1, 1, 0, 0, 0] 0
3 [5, 1, 1, 0, 0, 0] 0
4 [5, 1, 1, 0, 0, 0] 0
5 [5, 1, 1, 300, 200, 100] 1
Convert it into list then create an array.
Try:
normal = df.loc[df['anomaly'].eq(0), 'Sequence']
normal = np.array(normal.tolist())
print(normal.shape)
# (5,6)
I have a matrix of size (456, 456). I would like to make it of size (460, 460) but adding a frame of two zeros all around it.
Here is an example with a smaller matrix. I would like to transform matrixsmall into matrixbig. What is the best way to do it? The original code operates on lots of data to it would be great to have an efficient solution. Thank you in advance for your help!
import numpy as np
matrixsmall = np.array([[1,2],[2,1]])
matrixbig = np.array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 2, 0, 0],
[0, 0, 2, 1, 0, 0],
[0 ,0 ,0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]])
np.pad(matrixsmall, (2,2), "constant", constant_values=(0,0))
will do the trick
I have a array from my teacher, he gave me an array like below:
array contains 0,1,none
[[1, 1, 0, 0, none, 0, 1], [1, 0, 0, 0, none,0, 1], [1, 1, none,0, 1, 0, none], [1,1,1,0,none, 0, 0], [1, 1,0, none, 0, 0,1]]
and asked me to duplicate the array ten times but however each column must have similar percent distribution say each column have no more than 8% percent compared with the origin array.
how should i achieve the goal?
I want to do a voxel-based measurement of spherical objects, represented in a numpy array. Because of the sampling, these spheres are represented as a group of cubes (because they are sampled in the array). I want to do a simulation of the introduced error by this grid-restriction. Is there any way to paint a 3D sphere in a numpy grid to run my simulations on? (So basically, a sphere of unit length one, would be one point in the array)
Or is there another way of calculating the error introduced by sampling?
In 2-D that seems to be easy...
The most direct approach is to create a bounding box array, holding at each point the distance to the center of the sphere:
>>> radius = 3
>>> r2 = np.arange(-radius, radius+1)**2
>>> dist2 = r2[:, None, None] + r2[:, None] + r2
>>> volume = np.sum(dist2 <= radius**2)
>>> volume
123
The 2D case is easier to visualize:
>>> dist2 = r2[:, None] + r2
>>> (dist2 <= radius**2).astype(np.int)
array([[0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 1, 0, 0, 0]])
>>> np.sum(dist2 <= radius**2)
29