Vectors vs ndarrays in pandas/numpy - pandas

I know for a 4D vector, shape should be (4, 1) which is actually represented in 4D space but ndim is 2, and for some ndarray to be in 4 dimension, its shape should be something like (2, 3, 4, 5).
So, Is it like dimensional concept differs between vector and matrices (or arrays)? I'm trying to understand from mathematical perspective and how it's derived to pandas programming.

The dimensionality of a mathematical object is usually determined by the number of independent parameters in that particular object. For example, a 4-D vector is mathematically 4 dimensional because it contains 4 independent elements (unless some relation between them has been specified). Such a vector, if represented as a column vector in numpy, would have a shape (4, 1) because it has 4 rows and 1 column. The transpose of this vector, a row vector, has shape (4, ) because it has 4 columns and only 1 row, and the row-style view is default, so if there is 1 row, it's not explicitly mentioned.
Note however, that the column vector and row vector are dimensionally equivalent mathematically. Both have 4 dimensions.
For a 3 x 3 matrix, the most general mathematical dimension is 9, because it has 9 independent elements in general. The shape of a corresponding numpy array would be (3, 3). If you're looking for the maximum number of elements in any numpy array, ndarray.size is the way to go.
ndarray.ndim, however, yields the number of axes in a numpy array. That is, the number of directions along which values are placed (sloppy terminology!). So for the 3 x 3 matrix, ndim yields 2. For an array of shape (3, 7, 2, 1), ndim would yield 4. But, as we already discussed, the mathematical dimension would generally be 3 x 7 x 2 x 1 = 42 (So this is a matrix in 42-dimensional space! But the numpy array has just 4 dimensions). Thereby, as you might've already noticed, ndarray.size is just the product of the numbers in ndarray.shape.
Note that these are not just concepts of programming. We are used to saying "2-D matrices" in mathematics, but that is not to be confused with the space in which the matrices reside.

Related

How can I reconstruct original matrix from SVD components with following shapes?

I am trying to reconstruct the following matrix of shape (256 x 256 x 2) with SVD components as
U.shape = (256, 256, 256)
s.shape = (256, 2)
vh.shape = (256, 2, 2)
I have already tried methods from documentation of numpy and scipy to reconstruct the original matrix but failed multiple times, I think it maybe 3D matrix has a different way of reconstruction.
I am using numpy.linalg.svd for decompostion.
From np.linalg.svd's documentation:
"... If a has more than two dimensions, then broadcasting rules apply, as explained in :ref:routines.linalg-broadcasting. This means that SVD is
working in "stacked" mode: it iterates over all indices of the first
a.ndim - 2 dimensions and for each combination SVD is applied to the
last two indices."
This means that you only need to handle the s matrix (or tensor in general case) to obtain the right tensor. More precisely, what you need to do is pad s appropriately and then take only the first 2 columns (or generally, the number of rows of vh which should be equal to the number of columns of the returned s).
Here is a working code with example for your case:
import numpy as np
mat = np.random.randn(256, 256, 2) # Your matrix of dim 256 x 256 x2
u, s, vh = np.linalg.svd(mat) # Get the decomposition
# Pad the singular values' arrays, obtain diagonal matrix and take only first 2 columns:
s_rep = np.apply_along_axis(lambda _s: np.diag(np.pad(_s, (0, u.shape[1]-_s.shape[0])))[:, :_s.shape[0]], 1, s)
mat_reconstructed = u # s_rep # vh
mat_reconstructed equals to mat up to precision error.

How to create a 3-D array by multiplying vectors from two 2-D matrices

I have two 2-D matrices which have a shared axis.
I want to get a 3-D array that holds the results of every pairwise multiplication made between all the combinations of vectors from each matrix along that shared axis.
What is the best way to achieve this? (assuming that the matrices are big)
As an illustration, let's say I have 100 technicians and 1000 customers.
For each of these individuals I have a 1-D array with ones and zeros representing their availability on a each day of the week.
That's a 7x100 matrix for the technicians, a 7x1000 matrix for the customers.
import numpy as np
technicians = np.random.randint(low=0,high=2,size=(7,100))
customers = np.random.randint(low=0,high=2,size=(7,1000))
result = solution(technicians, customers)
result.shape # (7,100,1000)
I want to find for each technician-customer couple the days they are both available.
If I perform a pairwise multiplication between each combination of technician availability and customer availability I get a 1-D arrays that shows for each couple whether they are both available on these days. Together they create the 3-D array I'm aiming for, shaped something like 7x100x1000.
Thanks!
Try
ans = technicians.reshape((7, 1, 100)) * customers.reshape((7, 1000, 1))
We make use of numpy.broadcasting.
General Broadcasting Rules: When operating on two arrays, NumPy
compares their shapes element-wise. It starts with the trailing
dimensions, and works its way forward. Two dimensions are compatible
when
(1) they are equal, or (2) one of them is 1
Now, we are matching the shape of technicians and customers as
technician : 7 x 1 x 100
customers : 7 x 1000 x 1
Result (3d array): 7 x 1000 x 100
using reshape. Then, we can apply elementwise multiplication with *.

NumPy indexing ambiguity in 3D arrays [duplicate]

I have the following minimal example:
a = np.zeros((5,5,5))
a[1,1,:] = [1,1,1,1,1]
print(a[1,:,range(4)])
I would expect as output an array with 5 rows and 4 columns, where we have ones on the second row. Instead it is an array with 4 rows and 5 columns with ones on the second column. What is happening here, and what can I do to get the output I expected?
This is an example of mixed basic and advanced indexing, as discussed in https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
The slice dimension has been appended to the end.
With one scalar index this is a marginal case for the ambiguity described there. It's been discussed in previous SO questions and one or more bug/issues.
Numpy sub-array assignment with advanced, mixed indexing
In this case you can replace the range with a slice, and get the expected order:
In [215]: a[1,:,range(4)].shape
Out[215]: (4, 5) # slice dimension last
In [216]: a[1,:,:4].shape
Out[216]: (5, 4)
In [219]: a[1][:,[0,1,3]].shape
Out[219]: (5, 3)

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?
Because when doing this:
df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
I get the following error:
ValueError: Shape of passed values is (5, 3), indices imply (5, 5)
Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.
So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.
If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].
# You need this
df_pcs=pca.transform(df)
# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)
pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.
According to documentation:-
components_ : array, [n_components, n_features]
Principal axes in feature space, representing the
directions of maximum variance in the data.

why does numpy.dot is fault for numpy.dot(a,b) but has output for numpy.dot(b,a)?

I'm trying to understand why numpy's dot function behaves as it does:
t1 = np.array( [1, 0] )
t2 = np.array( [ [7,6],
[7,6],
[7,6],
[7,6]] )
np.dot(t1, t2) is fault because of wrong matrix multiplication:
ValueError: shapes (2,) and (4,2) not aligned: 2 (dim 0) != 4 (dim 0)
this is right. I can understand it. But why does np.dot(t2, t1) has output instead of the same fault with np.dot(t1, t2)? The different order of parameters is interpreted differently.
[7 7 7 7]
Thanks.
Please refer documentation:
Function raises ValueError:
If the last dimension of a is not the same size as the second-to-last dimension of b.
Notice you are not only working with 1D arrays:
In [6]: t1.ndim
Out[6]: 1
In [7]: t2.ndim
Out[7]: 2
So, t2 is a 2D array.
You also see this in the output of t2.shape: (4,2) indicates two dimensions as (2,) is one dimension.
The behaviour of np.dot is different for 1D and 2D arrays (from the docs):
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D
arrays to inner product of vectors
That is the reason you get different results, because you are mixing 1D and 2D arrays. Since t2 is a 2D array, np.dot(b, a) tries a matrix multiplication and np.dot(a, b) tries inner product, which fails.
In matrix multiplication case (refer docs): second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed. In simple words, t2 shape is (4,2) and t1 shape is (2,). t1 is 1D, shape of t1 is converted to (2,1) and after matrix multiplication 1 is removed. Hence, if you will store output of dot product, you can check shape will be (4, ).
t = np.dot(t2,t1)
t.shape
t.shape
Out[57]: (4,)