Create Spark DataFrame from Pandas DataFrames inside RDD - pandas

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?

See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.

Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science

Related

How will data redistribute when a pandas dataframe is coverted into a spark dataframe?

If I call sparkSession.createDataFrame() to convert a pandas dataframe into spark dataframe,how will my data be distributed among executors? Are there protential data skewing issues?

RDD vs Pandas Dataframe vs Direct Read to create Spark DataFrame

For creating Spark DataFrame, we can read directly from raw data, pass RDD OR pass pandas Dataframe.
I was doing experimentation with three of these methods,
Spark: Standalone Mode
using pyspark.sql module
Method1 : Reading text/csv file in Pandas and passing pandas DataFrame to create Spark DataFrame.
df3=spark.createDataFrame(pandas_df)
Method2 :I have created RDD by passing text file to 'sc.textFile'. Then I used this RDD to create Spark DataFrame
df3=spark.createDataFrame(RDD_list, StringType())
Method3 :Reading directly from raw data to create Spark DataFrame
df3=spark.read.text("Data/bookpage.txt")
What I have observed:
Num of default partitions in three cases are different.
Method1:(pandas) - 8 ( I have 8 cores)
Method2:(RDD) - 2
Method3:(Direct raw read)- 1
Conversion
Method1 : Raw Data => Pandas DF => Spark DataFrame
Method2 : Raw Data => RDD => Spark DataFrame
Method3 : Raw Data => Spark DataFrame
Questions:
Which method is more efficient?
As everything in spark implemented at RDD level, so creating RDD in Method2, can make it more efficient?
For same data, there are different default partitions. Why?

Can spark dataframe (scala) be converted to dataframe in pandas (python)

The Dataframe is created using scala api for SPARK
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
I want to convert this to Pandas Dataframe
PySpark provides .toPandas() to convert a spark dataframe to pandas but there is no equivalent for scala(that I can find)
Please help me in this regard.
To convert a Spark DataFrame into a Pandas DataFrame, you can enable spark.sql.execution.arrow.enabled to true and then read/create a DataFrame using Spark and then convert it to Pandas DataFrame using Arrow
Enable spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Create DataFrame using Spark like you did:
val someDF = spark.createDataFrame()
Convert the same to a pandas DataFrame
result_pdf = someDF.select("*").toPandas()
The above commands run using Arrow, because of the config spark.sql.execution.arrow.enabled set to true
Hope this helps!
In Spark DataFrame is just abstraction above data, most common sources of data are files from file system. When you convert dataframe in PySpark to Pandas format, PySpark just convert PySpark abstraction above data to another abstraction from another python framework. If you want made conversion in Scala between Spark and Pandas you can't do that because Pandas is Python library for work with data but spark is not and you will have some difficulties with Python and Scala integration. The best simple things you can do here:
Write dataframe to file system on scala Spark
Read data from file system using Pandas.

Fastest way to load multiple numpy arrays into spark rdd?

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)
For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.

How to collect spark dataframe at each executor node?

My application reads a large parquet file and performs some data extractions to arrive at a smallish spark dataframe object. All the contents of this dataframe must be present at each executor node for the next phase of the computation. I know that I can do this by collect-broadcast, as in this pyspark snippet
sc = pyspark.SparkContext()
sqlc = HiveContext(sc)
# --- register hive tables and generate spark dataframe
spark_df = sqlc.sql('sql statement')
# collect spark dataframe contents into a Pandas dataframe at the driver
global_df = spark_df.toPandas()
# broadcast Pandas dataframe to all the executor nodes
sc.broadcast(global_df)
I was just wondering: is there a more efficient method for doing this? It would seem that this pattern makes the driver node into a bottleneck.
It depends on what you need to do with your small dataframe. If you need to join it with large one, then Spark can optimize such joins broadcasting small dataframe automatically. The max size of dataframe that can be broadcasted is configured by spark.sql.autoBroadcastJoinThreshold option, as described in documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options