How to collect spark dataframe at each executor node? - apache-spark-sql

My application reads a large parquet file and performs some data extractions to arrive at a smallish spark dataframe object. All the contents of this dataframe must be present at each executor node for the next phase of the computation. I know that I can do this by collect-broadcast, as in this pyspark snippet
sc = pyspark.SparkContext()
sqlc = HiveContext(sc)
# --- register hive tables and generate spark dataframe
spark_df = sqlc.sql('sql statement')
# collect spark dataframe contents into a Pandas dataframe at the driver
global_df = spark_df.toPandas()
# broadcast Pandas dataframe to all the executor nodes
sc.broadcast(global_df)
I was just wondering: is there a more efficient method for doing this? It would seem that this pattern makes the driver node into a bottleneck.

It depends on what you need to do with your small dataframe. If you need to join it with large one, then Spark can optimize such joins broadcasting small dataframe automatically. The max size of dataframe that can be broadcasted is configured by spark.sql.autoBroadcastJoinThreshold option, as described in documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

Related

Why does pySpark crash when using Apache Arrow for string types?

In an attempt to get some outlier plots on large datasets I need to convert a spark DataFrame to pandas. Turing to Apache Arrow a simple run is crashing my pyspark console when casting x as string (it works fine without the cast), why?
Using Python version 3.8.9 (default, Apr 10 2021 15:47:22)
Spark context Web UI available at http://6d0b1018a45a:4040
Spark context available as 'sc' (master = local[*], app id = local-1621164597906).
SparkSession available as 'spark'.
>>> import time
>>> from pyspark.sql.functions import rand
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.appName("Console_Test").getOrCreate()
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/05/16 11:31:03 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> a_df = spark.range(1 << 25).toDF("id").withColumn("x", rand())
>>> a_df = a_df.withColumn("id", F.col("id").cast("string"))
>>> start_t = time.time()
>>> a_pd = a_df.toPandas()
Killed
#
Additionally I noticed that options such as spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")are seemingly without effect as the web ui shows records of significantly more than 5000 being assigned to the tasks.
Any indication on how to resolve the pyspark console crash or more directly render large scatter plots would be highly appreciated - I have (unsuccessfully) tried to find a way to apply Table.to_pandas(split_blocks=True, self_destruct=True)but did not get the able structure from a spark DataFrame.
You try to convert 33.5 mio (2^25) rows into a Pandas dataframe. This will lead to an OutOfMemoryError, as all data will be transfered to the Spark driver.
A way to find outliers would be to calculate the histogram for the column x and then filter down a_df to the relevant bins in Spark before creating the Pandas dataframe:
hist = a_df.select("x").rdd.flatMap(lambda x: x).histogram(10) #create 10 bins
hist is a tuple of two arrays: the first array contains the boundaries of the bins and the second array contains the numbers of elements in each bin:
([1.7855041778425118e-08,
0.1000000152099446,
0.20000001256484742,
0.30000000991975023,
0.40000000727465307,
0.5000000046295558,
0.6000000019844587,
0.6999999993393615,
0.7999999966942644,
0.8999999940491672,
0.99999999140407],
[3355812,
3356891,
3352364,
3352438,
3357564,
3356213,
3354933,
3355144,
3357241,
3355832])
rand creates uniformly distributed randon numbers, so the histogram in this case is not very interesting. But for real world distributions, the histogram will be useful.

What is the fastest way to return one row from a big pyspark dataframe or koalas dataframe in databricks?

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:
mdf.path_info = mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])
After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.
mdf.head()
display(mdf)
mdf.take([1])
mdf.iloc[0]
also tried converting into a spark dataframe and then trying:
df = mdf.to_spark()
df.show(1)
df.rdd.takeSample(False, 1, seed=0)
df.first()
The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)
Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.
As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:
https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
df.sample(False, 0.1, seed=0)
You could cache the computation result after you convert to spark dataframe and then call the action.
df = mdf.to_spark()
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache()
df.show(1)
You may want to free up the memory used for caching with:
df.unpersist()

Create Spark DataFrame from Pandas DataFrames inside RDD

I'm trying to convert a Pandas DataFrame on each worker node (an RDD where each element is a Pandas DataFrame) into a Spark DataFrame across all worker nodes.
Example:
def read_file_and_process_with_pandas(filename):
data = pd.read(filename)
"""
some additional operations using pandas functionality
here the data is a pandas dataframe, and I am using some datetime
indexing which isn't available for spark dataframes
"""
return data
filelist = ['file1.csv','file2.csv','file3.csv']
rdd = sc.parallelize(filelist)
rdd = rdd.map(read_file_and_process_with_pandas)
The previous operations work, so I have an RDD of Pandas DataFrames. How can I convert this then into a Spark DataFrame after I'm done with the Pandas processing?
I tried doing rdd = rdd.map(spark.createDataFrame), but when I do something like rdd.take(5), i get the following error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o103.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Is there a way to convert Pandas DataFrames in each worker node into a distributed DataFrame?
See this question: https://stackoverflow.com/a/51231046/7964197
I've had to deal with the same problem, which seems quite common (reading many files using pandas, e.g. excel/pickle/any other non-spark format, and converting the resulting RDD into a spark dataframe)
The supplied code adds a new method on the SparkSession that uses pyarrow to convert the pd.DataFrame objects into arrow record batches which are then directly converted to a pyspark.DataFrame object
spark_df = spark.createFromPandasDataframesRDD(prdd) # prdd is an RDD of pd.DataFrame objects
For large amounts of data, this is orders of magnitude faster than converting to an RDD of Row() objects.
Pandas dataframes can not direct convert to rdd.
You can create a Spark DataFrame from Pandas
spark_df = context.createDataFrame(pandas_df)
Reference: Introducing DataFrames in Apache Spark for Large Scale Data Science

Better solution to use any message broker for spark dataframe

I'm running an algorithm to tag on a mongo field and based on that i am adding new field to that document. As my collection count is around 1 million therefore updating and insertion is taking so much time.
Sample data:
{id:'a1',content:'some text1'}
{id:'a2',content:'some text2'}
python code:
docs= db.col.find({})
for doc in docs:
out = do_operation(doc['content']) //do_operation is my algorithm
doc["tag"]=out
db.col.update(id:doc['id'],$set:{'Tag_flag':TRUE})
db.col2.insert(doc)
Whereas I have used spark dataframes to increase speed but spark dataframes are taking much memory and throws memory error.
(configuration : 4 core and 16gb RAM on a single cluster of hadoop)
df = //loading mongodata to a dataframe
df1 = df.withColumn('tag',df.content)
output = []
for doc in df.rdd.collect():
out = do_operation(doc['content'])
output.append(out)
df2 = spark.createDataFrame(output)
final_df = df1.join(df2, df1._id == df2._id , 'inner')
//and finally inserting this dataframe into new collection.
I need to optimize my sparkcode so that i can speedup with less memory.
Can I use any message broker like Kafka, RabbitMQ or Reddis in between mongo & spark.
Will it be helpful?

Is dataframe created using toPandas() method is distributed across the spark cluster?

I am reading a CSV through
data=sc.textFile("filename")
Df = Sparksql.create dataframe()
Pdf = Df.toPandas ()
Now is Pdf distributed across the spark cluster or it resides in the environment of host machine??
No.
As it says in the PySpark source code of DataFrame:
.. note:: This method should only be used if the resulting Pandas's DataFrame is expected
to be small, as all the data is loaded into the driver's memory.