Fastest way to load multiple numpy arrays into spark rdd? - numpy

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)

For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.

Related

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

Numpy broadcasting comparison report "'bool' object has no attribute 'sum'" error when dealing with large dataframe

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'

Create Spark DataFrame from Pandas DataFrames inside RDD

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science

Big numpy array to spark dataframe

I have big numpy array. Its shape is (800,224,224,3), which means that there are images (224 * 244) with 3 channels. For distributed deep learning in Spark, I want to change 'numpy array' to 'spark dataframe'.
My method is:
Changed numpy array to csv
Loaded csv and make spark dataframe with 150528 columns (224*224*3)
Use VectorAssembler to create a vector of all columns (features)
Reshape the output of 3 but in the third step, I failed since computation might be too much high
In order to make a vector from this:
+------+------+
|col_1 | col_2|
+------+------+
|0.1434|0.1434|
|0.1434|0.1451|
|0.1434|0.1467|
|0.3046|0.3046|
|0.3046|0.3304|
|0.3249|0.3046|
|0.3249|0.3304|
|0.3258|0.3258|
|0.3258|0.3263|
|0.3258|0.3307|
+------+------+
to this:
+-------------+
| feature |
+-------------+
|0.1434,0.1434|
|0.1434,0.1451|
|0.1434,0.1467|
|0.3046,0.3046|
|0.3046,0.3304|
|0.3249,0.3046|
|0.3249,0.3304|
|0.3258,0.3258|
|0.3258,0.3263|
|0.3258,0.3307|
+-------------+
But the number of columns are really many...
I also tried to convert numpy array to rdd directly but I got 'out of memory' error. In single machine, my job works well with this numpy array.
You should be able to convert the numpy array directly to a Spark dataframe, without going through a csv file. You could try something like the below code:
from pyspark.ml.linalg import Vectors
num_rows = 800
arr = map(lambda x: (Vectors.dense(x), ), numpy_arr.reshape(num_rows, -1))
df = spark.createDataFrame(arr, ["features"])
You can also do this, which I find most convenient:
import numpy as np
import pandas as pd
import pyspark
sc = pyspark.SparkContext()
sqlContext = SQLContext(sc)
array = np.linspace(0, 10)
df_spark = sqlContext.createDataFrame(pd.DataFrame(array))
df_spark.show()
The only downside is that pandas needs to be installed.
Increase worker memory from the default value of 1 GB using spark.executor.memory flag to resolve out of memory error if you are getting error in worker node otherwise if you are getting this error in driver then try increasing the driver memory as suggested by #pissall. Also, try to identify proper fraction of memory(spark.memory.fraction) to be used for keeping RDD in memory.

Convert Breeze Matrix to Numpy Array

Is it possible to convert a breeze dense matrix to numpy array using spark?
I have here a breeze dense matrix I want to convert to numpy array.
Here is a way that works correctly but is slow / inefficient (creates multiple copies). i used zeppelin spark and pyspark interpreters (i guess toree should also be possible):
in spark:
%spark
import breeze.linalg._
import breeze.numerics._
z.put("matrix", DenseMatrix.eye[Double](4));
z.get("matrix")
then in python:
%pyspark
import numpy as np
def breeze2numpy(breeze_matrix):
data = list(breeze_matrix.copy().data())
return np.array(data).reshape(breeze_matrix.rows(), breeze_matrix.cols(), order='F')
breeze2numpy(z.z.get("matrix"))
this works but will be impractical for big datasets (because of the multiple copies involved via a python list). it would be nice to have a zero-copy method using python's buffer protocol like there is for C++ Eigen matrix --> numpy array.