numpy array to data frame and vice versa - pandas

I'm a noob in python!
I'd like to get sequences and anomaly together like this:
and sort only normal sequence.(if a value of anomaly column is 0, it's a normal sequence)
turn normal sequences to numpy array (without anomaly column)
each row(Sequence) is one session. so in this case their are 6 independent sequences.
each element represent some specific activity.
'''
sequence = np.array([[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 0, 0, 0],
[5, 1, 1, 300, 200, 100]])
anomaly = np.array((0,0,0,0,0,1))
'''
i got these two variables and have to sort only normal sequences.
Here is the code i tried:
'''
# sequence to dataframe
empty_df = pd.DataFrame(columns = ['Sequence'])
empty_df.reset_index()
for i in range(sequence.shape[0]):
empty_df = empty_df.append({"Sequence":sequence[i]},ignore_index = True) #
#concat anomaly
anomaly_df = pd.DataFrame(anomaly)
df = pd.concat([empty_df,anomaly_df],axis = 1)
df.columns = ['Sequence','anomaly']
df
'''
I didn't want to use pd.DataFrame because it gives me this:
pd.DataFrame(sequence)
anyways, after making df, I tried to sort normal sequences
#sorting normal seq
normal = df[df['anomaly'] == 0]['Sequence']
# back to numpy. only sequence column.
normal = normal.to_numpy()
normal.shape
'''
and this numpy gives me different shape1 from the variable sequence.
sequence.shape: (6,6) normal.shape =(5,)
I want to have (5,6). Tried reshape but didn't work..
Can someone help me with this?
If there are any unspecific explanation from my question, plz leave a comment. I appreciate it.

I am not quite sure of what you need but here you could do:
import pandas as pd
df = pd.DataFrame({'sequence':sequence.tolist(), 'anomaly':anomaly})
df
sequence anomaly
0 [5, 1, 1, 0, 0, 0] 0
1 [5, 1, 1, 0, 0, 0] 0
2 [5, 1, 1, 0, 0, 0] 0
3 [5, 1, 1, 0, 0, 0] 0
4 [5, 1, 1, 0, 0, 0] 0
5 [5, 1, 1, 300, 200, 100] 1

Convert it into list then create an array.
Try:
normal = df.loc[df['anomaly'].eq(0), 'Sequence']
normal = np.array(normal.tolist())
print(normal.shape)
# (5,6)

Related

Check every 4 values and change values accordingly in an np array

Hi there I have an np array with zeros and ones. I would like to check every 4 values, and if there is at least one (1) to put all four values equal to (1). Else leave all them to zero.
do you know how to do it? thanks here is a sample
np= [ 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 ]
np_corrected=np= [ 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 ]
many thanks, hope the question is now clear!
Probably not the shortest solution but definitely working and fast:
reshape
create a mask
apply the mask and get the result:
import numpy as np
array = np.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0])
array
groups = array.reshape(-1, 4) # group every 4 elements into new columns
groups
mask = groups.sum(axis=1)>0 # identify groups with at least one '1'
mask
np.logical_or(groups.T, mask).T.astype(int).flatten()
# swap rows and columns in groups, apply mask, swap back,
# replace True/False with 1/0 and restore original shape
Returns (in Jupyter notebook):
array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0])
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[0, 0, 1, 0],
[0, 0, 0, 0]])
array([False, True, True, False])
array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0])

Standard implementation of vectorize_sequences

In François Chollet's Deep Learning with Python, appears this function:
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
I understand what this function does. This function is asked about in this quesion and in this question as well, also mentioned here, here, here, here, here & here. Despite being so wide-spread, this vectorization is, according to Chollet's book is done "manually for maximum clarity." I am interested whether there is a standard, not "manual" way of doing it.
Is there a standard Keras / Tensorflow / Scikit-learn / Pandas / Numpy implementation of a function which behaves very similarly to the function above?
Solution with MultiLabelBinarizer
Assuming sequences is an array of integers with maximum possible value upto dimension-1, we can use MultiLabelBinarizer from sklearn.preprocessing to replicate the behaviour of the function vectorize_sequences
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=range(dimension))
mlb.fit_transform(sequences)
Solution with Numpy broadcasting
Assuming sequences is an array of integers with maximum possible value upto dimension-1
(np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
Worked out example
>>> sequences
[[4, 1, 0],
[4, 0, 3],
[3, 4, 2]]
>>> dimension = 10
>>> mlb = MultiLabelBinarizer(classes=range(dimension))
>>> mlb.fit_transform(sequences)
array([[1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])
>>> (np.array(sequences)[:, :, None] == range(dimension)).any(1).view('i1')
array([[0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

how to get row indices where row slice contains a single value (0)

With the numpy array
arr = np.array([[1, 1, 0, 0, 0, 1], [1, 1, 0, 0, 1, 1], [1, 1, 0, 0, 0, 1]])
I would like to get the indices of all rows where the row slice 2:5 contains all zeros.
In the above example, it should return rows 0 and 2.
I tried:
zero_indices = np.where(not np.any(arr[:,2:5]))
but it doesn't seem to work.
I'm trying to do this over a large array with several million rows.
Try this
np.nonzero((~arr[:,2:5].astype(bool)).all(1))[0]
Out[133]: array([0, 2], dtype=int32)
Or
np.nonzero((arr[:,2:5] == 0).all(1))[0]
Out[139]: array([0, 2], dtype=int32)

Compare all values in one column with all values in another column and return indexes

I am interested in comparing all values from 1 dataframe column with all values from a 2nd column and then generating a list or a subset df with values from a 3rd column that is adjacent to the 1st column hits. Hopefully this example will explain it better:
For a simplified example, say I generate the following pandas dataframe:
fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
print fake_df
What I am interested in doing is finding all values in column 'm' that are within 0.1 of any value in
column 'mod' and return the values in column 'n' that correspond to the column 'm' hits. So for the above code, the return would be:
10.2, 2.0, 1.1
(since 101,201 and 501 all have close hits in column 'mod').
I have found ways to compare across the same row, but not like above. Is there a way to do this in pandas without extensive loops?
Thanks!
I don't know such method in pandas, but when you extend your scope to include
numpy, two options come to mind.
Easy/Expensive Method
If you can live with N**2 memory overhead, you can do numpy broadcasting to
find out all "adjacent" elements in one step:
In [25]: fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
In [26]: mvals = fake_df['m'].values
In [27]: modvals = fake_df['mod'].values
In [28]: is_close = np.abs(mvals - modvals[:, np.newaxis]) <= 0.1; is_close.astype(int)
Out[28]:
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])
As we care only about 'mod' values that have adjacent 'm's, aggregate over axis=0:
In [29]: is_close.any(axis=0).astype(int)
Out[29]: array([0, 0, 1, 0, 1, 1, 0, 0, 0, 0])
Or otherwise
In [30]: fake_df.ix[is_close.any(axis=0), 'n']
Out[30]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Efficient/Complex Method
To find adjacent elements in less than O(N**2) without any hashing/rounding
tricks, you have to do some sorting:
In [103]: modvals_sorted = np.sort(modvals)
In [104]: next_indices = np.searchsorted(modvals_sorted, mvals)
You have indices of next elements, but they may point beyond the original
array, so you need a one extra NaN at the end to avoid IndexError. Same
logic applies to previous elements which are next_indices - 1: to avoid
indexing before the first element we must prepend one NaN, too. Note the + 1 that arises because one of NaN has been added to the beginning.
In [105]: modvals_sorted_plus = np.r_[np.nan, modvals_sorted, np.nan]
In [106]: nexts = modvals_sorted_plus[next_indices + 1]
In [107]: prevs = modvals_sorted_plus[(next_indices - 1) + 1]
Now it's trivial. Note that we already have prevs <= mvals <= nexts, so we
don't need to use np.abs. Also, all missing elements are NaN and comparing with them results in False that doesn't alter the result of any operation.
In [108]: adjacent = np.c_[prevs, mvals, nexts]; adjacent
Out[108]:
array([[ nan, 100. , 101.001],
[ 102.001, 120. , 121.001],
[ nan, 101. , 101.001],
[ 121.001, 200. , 201.001],
[ 121.001, 201. , 201.001],
[ 421.001, 501. , 501.001],
[ 202.001, 350. , 351.001],
[ 351.001, 420. , 421.001],
[ 502.001, 525. , 526.001],
[ 421.001, 500. , 501.001]])
In [109]: (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
Out[109]: array([False, False, True, False, True, True, False, False, False, False], dtype=bool)
In [110]: mask = (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
In [112]: fake_df.ix[mask, 'n']
Out[112]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Try the following:
# I assume all arrays involved to be or to be converted to numpy arrays
import numpy as np
m = np.array([100,120,101,200,201,501,350,420,525,500])
n = np.array([10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0])
mod = np.array([101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001])
res = []
# for each entry in mod, look in m for "close" values
for i in range(len(mod)):
# for each hit, store entry from n in result list
res.extend(n[np.fabs(mod[i]-m)<=0.1])
# cast result to numpy array
res = np.array(res)
print res
The output is
[ 10.2 2. 1.1]
I'll be making of numpy (imported as np) which pandas uses under the hood. np.isclose returns a boolean indexer: for each value of the iterable, there's a True or False value corresponding to the value m being within atol of each value of df["mod"].
>>> for i, m in df["m"].iteritems():
... indices = np.isclose(m, df["mod"], atol=0.1)
... if any(indices):
... print df["n"][i]
Using the DataFrame you gave produces the output:
10.2
2.0
1.1

Drawing/sampling a sphere in a 3D numpy grid

I want to do a voxel-based measurement of spherical objects, represented in a numpy array. Because of the sampling, these spheres are represented as a group of cubes (because they are sampled in the array). I want to do a simulation of the introduced error by this grid-restriction. Is there any way to paint a 3D sphere in a numpy grid to run my simulations on? (So basically, a sphere of unit length one, would be one point in the array)
Or is there another way of calculating the error introduced by sampling?
In 2-D that seems to be easy...
The most direct approach is to create a bounding box array, holding at each point the distance to the center of the sphere:
>>> radius = 3
>>> r2 = np.arange(-radius, radius+1)**2
>>> dist2 = r2[:, None, None] + r2[:, None] + r2
>>> volume = np.sum(dist2 <= radius**2)
>>> volume
123
The 2D case is easier to visualize:
>>> dist2 = r2[:, None] + r2
>>> (dist2 <= radius**2).astype(np.int)
array([[0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 1, 0, 0, 0]])
>>> np.sum(dist2 <= radius**2)
29