Converting strings Series to numeric one - pandas

I'm doing
X = data['x'].apply(lambda h: [int(h[i:i + 2], 16) for i in (0, 2 ,4)])
Where x has strings of hex colors, and I'd like to map them to RGB arrays (3 values each). After that, X hasdtype='object, and X.values is a numpy array of numpy arrays.
My final goal is making it an 3 * n numpy array and use it with sklearn.cluster.KMeans. What is the best way to achieving this?

After creating X, you can split up the data into 3 columns like this
X = data['x'].apply(lambda h: [int(h[i:i + 2], 16) for i in (0, 2 ,4)])
data[['R','G','B']] = pd.DataFrame(X.values.tolist(), index=X.index)
so that
data[['R','G','B']]
has the result in three columns for further processing

Related

Merge masked selection of array with original array

I'm facing a problem with an assignment at the moment.
So I have an array which contains 400 2d Points. So an array of shape 400 X 2.
Then I have a mask that selects m points (rows) that I wanna compute some changes on.
As per the assignment I'm supposed to store the points that I want to change in an array of shape m X 2.
Then I do my changes on this resulting array. But now after the changes I want to insert these new computed values in my original array at the original indices. And I just have no clue how to do that.
So I basically have:
orig (400 X 2)
mask (400 X 1) (boolean mask selecting the rows to edit)
change (m X 2) (just the changes I want to add)
changed (m X 2) (the original values + the change (with a factor applied) added together
How do I transform my change or changed arrays with the mask so that I can add/insert the changes into my original array?
Look at this example with 4 rows.
The principle is that the mask that "extract" from orig can also return the sub-array to the original place.
import numpy as np
x = np.array([[1,2],[3,4],[5,6],[7,8]])
print(x)
mask_ix = np.array([True,False, True, False])
masked = x[mask_ix,:]
masked = masked * 10 # the change
print(masked)
x[mask_ix] = masked # return to the original x in the mask_ix mask
print(x)
x =[[1 2]
[3 4]
[5 6]
[7 8]]
masked = [[10 20]
[50 60]]
x = [[10 20]
[ 3 4]
[50 60]
[ 7 8]]

How to find matrix common members of matrices in Numpy

I have a 2D matrix A and a vector B. I want to find all row indices of elements in A that are also contained in B.
A = np.array([[1,9,5], [8,4,9], [4,9,3], [6,7,5]], dtype=int)
B = np.array([2, 4, 8, 10, 12, 18], dtype=int)
My current solution is only to compare A to one element of B at a time but that is horribly slow:
res = np.array([], dtype=int)
for i in range(B.shape[0]):
cres, _ = (B[i] == A).nonzero()
degElem = np.append(res, cres)
res = np.unique(res)
The following Matlab statement would solve my issue:
find(any(reshape(any(reshape(A, prod(size(A)), 1) == B, 2),size(A, 1),size(A, 2)), 2))
However comparing a row and a colum vector in Numpy does not create a Boolean intersection matrix as it does in Matlab.
Is there a proper way to do this in Numpy?
We can use np.isin masking.
To get all the row numbers, it would be -
np.where(np.isin(A,B).T)[1]
If you need them split based on each element's occurence -
[np.flatnonzero(i) for i in np.isin(A,B).T if i.any()]
Posted MATLAB code seems to be doing broadcasting. So, an equivalent one would be -
np.where(B[:,None,None]==A)[1]

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

How to efficiently compute an L2 distance between rows of two array using only basic numpy operations? [duplicate]

I have 2 lists of points as numpy.ndarray, each row is the coordinate of a point, like:
a = np.array([[1,0,0],[0,1,0],[0,0,1]])
b = np.array([[1,1,0],[0,1,1],[1,0,1]])
Here I want to calculate the euclidean distance between all pairs of points in the 2 lists, for each point p_a in a, I want to calculate the distance between it and every point p_b in b. So the result is
d = np.array([[1,sqrt(3),1],[1,1,sqrt(3)],[sqrt(3),1,1]])
How to use matrix multiplication in numpy to compute the distance matrix?
Using direct numpy broadcasting, you can do this:
dist = np.sqrt(((a[:, None] - b[:, :, None]) ** 2).sum(0))
Alternatively, scipy has a routine that will compute this slightly more efficiently (particularly for large matrices)
from scipy.spatial.distance import cdist
dist = cdist(a, b)
I would avoid solutions that depend on factoring-out matrix products (of the form A^2 + B^2 - 2AB), because they can be numerically unstable due to floating point roundoff errors.
To compute the squared euclidean distance for each pair of elements off them - x and y, we need to find :
(Xik-Yjk)**2 = Xik**2 + Yjk**2 - 2*Xik*Yjk
and then sum along k to get the distance at coressponding point as dist(Xi,Yj).
Using associativity, it reduces to :
dist(Xi,Yj) = sum_k(Xik**2) + sum_k(Yjk**2) - 2*sum_k(Xik*Yjk)
Bringing in matrix-multiplication for the last part, we would have all the distances, like so -
dist = sum_rows(X^2), sum_rows(Y^2), -2*matrix_multiplication(X, Y.T)
Hence, putting into NumPy terms, we would end up with the euclidean distances for our case with a and b as the inputs, like so -
np.sqrt((a**2).sum(1)[:,None] + (b**2).sum(1) - 2*a.dot(b.T))
Leveraging np.einsum, we could replace the first two summation-reductions with -
np.einsum('ij,ij->i',a,a)[:,None] + np.einsum('ij,ij->i',b,b)
More info could be found on eucl_dist package's wiki page (disclaimer: I am its author).
If you have 2 each 1-dimensional arrays, x and y, you can convert the arrays into matrices with repeating columns, transpose, and apply the distance formula. This assumes that x and y are coordinated pairs. The result is a symmetrical distance matrix.
x = [1, 2, 3]
y = [4, 5, 6]
xx = np.repeat(x,3,axis = 0).reshape(3,3)
yy = np.repeat(y,3,axis = 0).reshape(3,3)
dist = np.sqrt((xx-xx.T)**2 + (yy-yy.T)**2)
dist
Out[135]:
array([[0. , 1.41421356, 2.82842712],
[1.41421356, 0. , 1.41421356],
[2.82842712, 1.41421356, 0. ]])
L2 distance = (a^2 + b^2 - 2ab)^0.5
a = np.random.randn(5, 3)
b = np.random.randn(2, 3)
a2 = np.sum(np.square(a), axis = 1)[..., None]
b2 = np.sum(np.square(b), axis = 1)[None, ...]
ab = -2*np.dot(a, b.T)
dist = np.sqrt(a2 + b2 + ab)

how to avoid split and sum of pieces in pytorch or numpy

I want to split a long vector into smaller unequal pieces, do a summation on each piece and gather the results into a new vector.
I need to do this in pytorch but I am also interested to see how this is done with numpy.
This can easily be accomplish by splitting the vector.
sizes = [3, 7, 5, 9]
X = torch.ones(sum(sizes))
Y = torch.tensor([s.sum() for s in torch.split(X, sizes)])
or with np.ones and np.split.
Is there a more efficient way to do this?
Edit:
Inspired by the first comment:
indices = np.cumsum([0]+sizes)[:-1]
Y = np.add.reduceat(X, indices.tolist())
solves it for numpy. I am still looking for a solution with pytorch.
index_add_ is your friend!
# inputs
sizes = torch.tensor([3, 7, 5, 9], dtype=torch.long)
x = torch.ones(sizes.sum())
# prepare an index vector for summation (what elements of x are summed to each element of y)
ind = torch.zeros(sizes.sum(), dtype=torch.long)
ind[torch.cumsum(sizes, dim=0)[:-1]] = 1
ind = torch.cumsum(ind, dim=0)
# prepare the output
y = torch.zeros(len(sizes))
# do the actual summation
y.index_add_(0, ind, x)