Implementing custom matrix multiplication-like operations in numpy - pandas

I want to implement an operation on two matrices that is similar to matrix multiplication in that it each element of the resulting matrix is a function of the ith row of the first matrix and the jth column of the second matrix.
I would like to be able to do this using numpy and/or pandas using vectorized computations.
In other words:
How do I implemenent $A \bigotimes B = C$
where $C_{ij} = sum_k f(a_{ik}, b_{kj})$ in numpy and/or pandas?

Related

How to vectorize this operation in numpy?

I have a 2d array s and I want to calculate differences elementwise, i.e.:
Since it cannot be written as a single matrix multiplication, I was wondering what is the proper way to vectorize it?
You can use broadcasting for that: d = s[:, None, :] - s[None, :, :]. Note the None enable you to create a new dimension. Numpy implicitly perform the broadcasting operation between the two arrays.

A more efficient way of creating an NxM array in Python

In Python, I need to create an NxM matrix in which the ij entry has value of i^2 + j^2.
I'm currently constructing it using two for loops, but the array is quite big and the computation time is long and I need to perform it several times. Is there a more efficient way of constructing such matrix using maybe Numpy ?
You can use broadcasting in numpy. You may refer to the official documentation. For example,
import numpy as np
N = 3; M = 4 #whatever values you'd like
a = (np.arange(N)**2).reshape((-1,1)) #make it to column vector
b = np.arange(M)**2
print(a+b) #broadcasting applied
Instead of using np.arange(), you can use np.array([...some array...]) for customizing it.

How can I combine multiple numpy arrays into a single memoryview for cython?

I have a list of varying size that contains numpy arrays with the same data type and shape. I would like to process this data using a function written in Cython without copying the data. Both memoryviews and the Python buffer protocol seem to support this kind of data using indirect for the first dimension. So I was hoping that something like this could work:
%%cython
from cython.view cimport indirect
def test(list a):
cdef double[::indirect, :] x
x = a
x[0, 0] = 42
Unfortunately it doesn't.
Is there a way to convert this list of numpy arrays into such a memoryview?

Huge sparse dataframe to scipy sparse matrix without dense transform

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users).
I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix?
I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S.
Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Does my answer from a few months back help?
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
It was accepted but I didn't get any further feedback.
I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

einsum on a sparse matrix

It seems numpy's einsum function does not work with scipy.sparse matrices. Are there alternatives to do the sorts of things einsum can do with sparse matrices?
In response to #eickenberg's answer: The particular einsum I'm wanting to is numpy.einsum("ki,kj->ij",A,A) - the sum of the outer products of the rows.
A restriction of scipy.sparse matrices is that they represent linear operators and are thus kept two dimensional, which leads to the question: Which operation are you seeking to do?
All einsum operations on a pair of 2D matrices are very easy to write without einsum using dot, transpose and pointwise operations, provided that the result does not exceed two dimensions.
So if you need a specific operation on a number of sparse matrices, it is probable that you can write it without einsum.
UPDATE: A specific way to implement np.einsum("ki, kj -> ij", A, A) is A.T.dot(A). In order to convince yourself, please try the following example:
import numpy as np
rng = np.random.RandomState(42)
a = rng.randn(3, 3)
b = rng.randn(3, 3)
the_einsum_ab = np.einsum("ki, kj -> ij", a, b)
the_a_transpose_times_b = a.T.dot(b)
# We write a test in order to assert equality
from numpy.testing import assert_array_equal
assert_array_equal(the_einsum_ab, the_a_transpose_times_b) # This passes, so equality
This result is slightly more general. Now if you use b = a you obtain your specific result.
einsum translates the index string into a calculation using the C version of np.nditer. http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html is a nice introduction to nditer. Note especially the Cython example at the end.
https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py is a Python simulation of the einsum.
scipy.sparse has its own code (ultimately in C) to perform the basic operations, summation, matrix multiplication, etc. Sparse matricies have their own data structures. They can be lists, dictionaries, or a set of numpy arrays. Numpy notation can be used because sparse has the appropriate __xxx__ methods.
A sparse matrix is a matrix, a 2d array object. A sparse einsum could be written, but it would end up using the sparse matrix multiplication, not nditer. So at best it would be a notational convenience.
Sparse csr_matrix.dot is:
def dot(self, other):
"""Ordinary dot product
...
"""
return self * other
A=sparse.csr_matrix([[1,2],[3,4]])
A.dot(A.T).A
(A*A.T).A
A.__rmul__(A.T).A
A.__mul__(A.T).A
np.einsum('ij,kj',A.A,A.A)
# array([[ 5, 11],
# [11, 25]])