lost dimension in numpy array acquired from a dataframe - pandas

I have a dataframe (you may see the image of the dataframe from the provided link).
df.shape, type(girdi), type(girdi.iloc[0, 0])
>>>(10292, 5), pandas.core.frame.DataFrame, numpy.ndarray)
Each value in this dataframe is a NumPy array with 55 data points in it (55, ).
df.iloc[0,0]
>>>array([64.75, 65.62, 64.21, 64.62, 63.94, 62.63, 62.24, 62.65, 62.47,
63.17, 63.46, 63.75, 65.41, 65.35, 65.68, 65.97, 66.6 , 66.45,
66.11, 65.48, 64.22, 63.54, 62.81, 63.58, 62.46, 61.23, 62.26,
61.13, 61.68, 61.36, 61.93, 61.48, 61.92, 62.43, 63.37, 62.59,
63.33, 63.52, 63.23, 62.52, 63.03, 63.61, 63.83, 63.7 , 63.94,
65.14, 66. , 66.65, 65.87, 64.93, 65.84, 64.75, 65.5 , 65.7 ,
66.83])
dataframe
When I convert the whole dataframe to a NumPy array NumPy does not recognize the 3rd dimension.
X = np.array(df)
X.shape
>>>(10292, 5)
X.shape, type(X)
>>>((10292, 5), numpy.ndarray)
X[0].shape, type(X[0])
>>>((5,), numpy.ndarray)
X[0, 0].shape, type(X[0, 0])
>>>((55,), numpy.ndarray)
I expect (and desire) to get:
X.shape, X[0, 0, 0]
>>>(10292, 5, 55), 64.75
To access the data, using X[0][0][0] or X[0, 0] [0] helps, though it does not help my needs. I want to access the data with X[0, 0, 0].
I tried using np.vstack or np.expand_dims, though I was unsuccessful. How may I turn the data to (10292, 5, 55) dimensions?
Thank you,
Evrim

I suppose you have to create a new array with the desired dimensions and traverse your matrix of arrays (X)
result = np.zeros(X.shape + X[0, 0].shape)
for i in range(X.shape[0]):
for j in range(X.shape[1]):
result[i, j] = X[i, j]

Related

Is there a numpy function like np.fill(), but for arrays as fill value?

I'm trying to build an array of some given shape in which all elements are given by another array. Is there a function in numpy which does that efficiently, similar to np.full(), or any other elegant way, without simply employing for loops?
Example: Let's say I want an array with shape
(dim1,dim2) filled with a given, constant scalar value. Numpy has np.full() for this:
my_array = np.full((dim1,dim2),value)
I'm looking for an analog way of doing this, but I want the array to be filled with another array of shape (filldim1,filldim2) A brute-force way would be this:
my_array = np.array([])
for i in range(dim1):
for j in range(dim2):
my_array = np.append(my_array,fill_array)
my_array = my_array.reshape((dim1,dim2,filldim1,filldim2))
EDIT
I was being stupid, np.full() does take arrays as fill value if the shape is modified accordingly:
my_array = np.full((dim1,dim2,filldim1,filldim2),fill_array)
Thanks for pointing that out, #Arne!
You can use np.tile:
>>> shape = (2, 3)
>>> fill_shape = (4, 5)
>>> fill_arr = np.random.randn(*fill_shape)
>>> arr = np.tile(fill_arr, [*shape, 1, 1])
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True
Edit: better answer, as suggested by #Arne, directly using np.full:
>>> arr = np.full([*shape, *fill_shape], fill_arr)
>>> arr.shape
(2, 3, 4, 5)
>>> np.all(arr[0, 0] == fill_arr)
True

Move for loop into numpy single expression when calling polyfit

Fairly new to numpy/python here, trying to figure out some less c-like, more numpy-like coding styles.
Background
I've got some code done that takes a fixed set of x values and multiple sets of corresponding y value sets and tries to find which set of the y values are the "most linear".
It does this by going through each set of y values in a loop, calculating and storing the residual from a straight line fit of those y's against the x's, then once the loop has finished finding the index of the minimum residual value.
...sorry this might make a bit more sense with the code below.
import numpy as np
import numpy.polynomial.polynomial as poly
# set of x values
xs = [1,22,33,54]
# multiple sets of y values for each of the x values in 'xs'
ys = np.array([[1, 22, 3, 4],
[2, 3, 1, 5],
[3, 2, 1, 1],
[34,23, 5, 4],
[23,24,29,33],
[5,19, 12, 3]])
# array to store the residual from a linear fit of each of the y's against x
residuals = np.empty(ys.shape[0])
# loop through the xs's and calculate the residual of a linear fit for each
for i in range(ys.shape[0]):
_, stats = poly.polyfit(xs, ys[i], 1, full=True)
residuals[i] = stats[0][0]
# the 'most linear' of the ys's is at np.argmin:
print('most linear at', np.argmin(residuals))
Question
I'd like to know if it's possible to "numpy'ize" that into a single expression, something like
residuals = get_residuals(xs, ys)
...I've tried:
I've tried the following, but no luck (it always passes the full arrays in, not row by row):
# ------ ok try to do it without a loop --------
def wrap(x, y):
_, stats = poly.polyfit(x, y, 1, full=True)
return stats[0][0]
res = wrap(xs, ys) # <- fails as passes ys as full 2D array
res = wrap(np.broadcast_to(xs, ys.shape), ys) # <- fails as passes both as 2D arrays
Could anyone give any tips on how to numpy'ize that?
From the numpy.polynomial.polynomial.polyfit docs (not to be confused with numpy.polyfit which is not interchangable)
:
x : array_like, shape (M,)
y : array_like, shape (M,) or (M, K)
Your ys needs to be transposed to have ys.shape[0] equal to xs.shape
def wrap(x, y):
_, stats = poly.polyfit(x, y.T, 1, full=True)
return stats[0]
res = wrap(xs, ys)
res
Out[]: array([284.57337884, 5.54709898, 0.41399317, 91.44641638,
6.34982935, 153.03515358])

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.
Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.

Is there a Julia analogue to numpy.argmax?

In Python, there is numpy.argmax:
In [7]: a = np.random.rand(5,3)
In [8]: a
Out[8]:
array([[ 0.00108039, 0.16885304, 0.18129883],
[ 0.42661574, 0.78217538, 0.43942868],
[ 0.34321459, 0.53835544, 0.72364813],
[ 0.97914267, 0.40773394, 0.36358753],
[ 0.59639274, 0.67640815, 0.28126232]])
In [10]: np.argmax(a,axis=1)
Out[10]: array([2, 1, 2, 0, 1])
Is there a Julia analogue to Numpy's argmax? I only found a indmax, which only accept a vector, not a two dimensional array as np.argmax.
The fastest implementation will usually be findmax (which allows you to reduce over multiple dimensions at once, if you wish):
julia> a = rand(5, 3)
5×3 Array{Float64,2}:
0.867952 0.815068 0.324292
0.44118 0.977383 0.564194
0.63132 0.0351254 0.444277
0.597816 0.555836 0.32167
0.468644 0.336954 0.893425
julia> mxval, mxindx = findmax(a; dims=2)
([0.8679518267243425; 0.9773828942695064; … ; 0.5978162823947759; 0.8934254589671011], CartesianIndex{2}[CartesianIndex(1, 1); CartesianIndex(2, 2); … ; CartesianIndex(4, 1); CartesianIndex(5, 3)])
julia> mxindx
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 1)
CartesianIndex(2, 2)
CartesianIndex(3, 1)
CartesianIndex(4, 1)
CartesianIndex(5, 3)
According to the Numpy documentation, argmax provides the following functionality:
numpy.argmax(a, axis=None, out=None)
Returns the indices of the maximum values along an axis.
I doubt a single Julia function does that, but combining mapslices and argmax is just the ticket:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> mapslices(argmax,a,dims=2)
5x1 Array{Int64,2}:
3
2
3
1
2
Of course, because Julia's array indexing is 1-based (whereas Numpy's array indexing is 0-based), each element of the resulting Julia array is offset by 1 compared to the corresponding element in the resulting Numpy array. You may or may not want to adjust that.
If you want to get a vector rather than a 2D array, you can simply tack [:] at the end of the expression:
julia> b = mapslices(argmax,a,dims=2)[:]
5-element Array{Int64,1}:
3
2
3
1
2
To add to the jub0bs's answer, argmax in Julia 1+ mirrors the behavior of np.argmax, by replacing axis with dims keyword, returning CarthesianIndex instead of index along given dimension:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> argmax(a, dims=2)
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 3)
CartesianIndex(2, 2)
CartesianIndex(3, 3)
CartesianIndex(4, 1)
CartesianIndex(5, 2)

Numpy : resize array

I have two Numpy array whose size is 994 and 1000. As such I when I am doing the below operation:
X * Y
I get error that "ValueError: operands could not be broadcast together with shapes (994) (1000)"
Hence as per fix I am trying to pad extras / trailing zeros to the array which great size by below method:
padzero = 0
if(bw.size > w.size):
padzero = bw.size - w.size
w = np.pad(w,padzero, 'constant', constant_values=0)
if(bw.size < w.size):
padzero = w.size - bw.size
bw = np.pad(bw,padzero, 'constant', constant_values=0)
But now the issue comes that if the size difference is 6 then 12 0's are getting padded in the array - which exactly should be six in my case.
I tried many ways to achieve this but its not resulting to resolve the issue. If I try he below way:
bw = np.pad(bw,padzero/2, 'constant', constant_values=0)
ValueError: Unable to create correctly shaped tuple from 3.0
How can I fix the issue?
a = np.array([1, 2, 3])
To insert zeros front:
np.pad(a,(2,0),'constant', constant_values=0)
array([0, 0, 1, 2, 3])
To insert zeros back:
np.pad(a,(0,2),'constant', constant_values=0)
array([1, 2, 3, 0, 0])
Front and back:
np.pad(a,(1,1),'constant', constant_values=0)
array([0, 1, 2, 3, 0])