Convert Breeze Matrix to Numpy Array - numpy

Is it possible to convert a breeze dense matrix to numpy array using spark?
I have here a breeze dense matrix I want to convert to numpy array.

Here is a way that works correctly but is slow / inefficient (creates multiple copies). i used zeppelin spark and pyspark interpreters (i guess toree should also be possible):
in spark:
%spark
import breeze.linalg._
import breeze.numerics._
z.put("matrix", DenseMatrix.eye[Double](4));
z.get("matrix")
then in python:
%pyspark
import numpy as np
def breeze2numpy(breeze_matrix):
data = list(breeze_matrix.copy().data())
return np.array(data).reshape(breeze_matrix.rows(), breeze_matrix.cols(), order='F')
breeze2numpy(z.z.get("matrix"))
this works but will be impractical for big datasets (because of the multiple copies involved via a python list). it would be nice to have a zero-copy method using python's buffer protocol like there is for C++ Eigen matrix --> numpy array.

Related

Any way to use ":" to access a row for a 1D NumPy array

Import numpy as np
A=np.array([1,2,3])
Is there any way to achive A[1,:], in MATLAB it is fine
If you want to treat your numpy array as 2 dimensional array like in MatLab, you have to tell it explicitly, by creating a new array and using np.newaxis .
import numpy as np
A=np.array([1,2,3])
print(A);
B = A[np.newaxis,:]
print(B)
# Here you go
print(B[0,:])
Test it on Online Python
Side note:
I wrote B[0,:], not B[1,:], because Python array indices are 0-based, not 1-based like MatLab.

Numpy broadcasting comparison report "'bool' object has no attribute 'sum'" error when dealing with large dataframe

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'

Implementing pythonic statistical functions on spark dataframes

I have very large datasets in spark dataframes that are distributed across the nodes.
I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .
If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:
import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)
I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.
My question is:
Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?
Thank you for any valuable insight
Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats
#pandas_udf('double', PandasUDFType.SCALAR)
def vector_jarque_bera(x):
return stats.jarque_bera(x)
# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))
Note that the vectorized function column takes a column as its argument and returns a column.
(Nb. The #pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)

Feature request: NumPy Reader

I have a large collection of NumPy arrays saved on disk. I would like to read them efficiently and concurrently with the training. I can't load them all into memory at once - the data set is too large.
Additionally, it would be nice to apply some user defined transforms on the fly. Also it would be nice to be able to read them from C++, not just Python.
I believe CNTK does not have this capability now, am I correct?
Currently, we don't have build-in numpy reader. However, you have multiple options:
Read the numpy data in batches and feed them to the trainer, here an example that read images into numpy array and feed it to the trainer:
https://github.com/Microsoft/FERPlus
What the data inside your numpy array? Can you convert it to a format readable by one of the CNTK readers?

Fastest way to load multiple numpy arrays into spark rdd?

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)
For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.