Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values - dataframe

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
In my Spark configuration I already tried to increase the aforementioned parameter:
spark = (SparkSession
.builder
.appName("TV segmentation - dataprep for scoring")
.config("spark.executor.memory", "25G")
.config("spark.driver.memory", "40G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.maxExecutors", "12")
.config("spark.driver.maxResultSize", "3g")
.config("spark.kryoserializer.buffer.max.mb", "2047mb")
.config("spark.rpc.message.maxSize", "1000mb")
.getOrCreate())
I also tried to repartition my dataframe using:
dfscoring=dfscoring.repartition(100)
but still I keep on getting the same error.
My environment: Python 3.5, Anaconda 5.0, Spark 2
How can I avoid this error ?

i'm in same trouble, then i solve it.
the cause is spark.rpc.message.maxSize if default set 128M, you can change it when launch a spark client, i'm work in pyspark and set the value to 1024, so i write like this:
pyspark --master yarn --conf spark.rpc.message.maxSize=1024
solve it.

I had the same issue and it wasted a day of my life that I am never getting back. I am not sure why this is happening, but here is how I made it work for me.
Step 1: Make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Turned out that python in worker(2.6) had a different version than in driver(3.6). You should check if environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I fixed it by simply switching my kernel from Python 3 Spark 2.2.0 to Python Spark 2.3.1 in Jupyter. You may have to set it up manually. Here is how to make sure your PySpark is set up correctly https://mortada.net/3-easy-steps-to-set-up-pyspark.html
STEP 2: If that doesn't work, try working around it:
This kernel switch worked for DFs that I haven't added any columns to:
spark_df -> panda_df -> back_to_spark_df .... but it didn't work on the DFs where I had added 5 extra columns. So what I tried and it worked was the following:
# 1. Select only the new columns:
df_write = df[['hotel_id','neg_prob','prob','ipw','auc','brier_score']]
# 2. Convert this DF into Spark DF:
df_to_spark = spark.createDataFrame(df_write)
df_to_spark = df_to_spark.repartition(100)
df_to_spark.registerTempTable('df_to_spark')
# 3. Join it to the rest of your data:
final = df_to_spark.join(data,'hotel_id')
# 4. Then write the final DF.
final.write.saveAsTable('schema_name.table_name',mode='overwrite')
Hope that helps!

I had the same problem but using Watson studio. My solution was:
sc.stop()
configura=SparkConf().set('spark.rpc.message.maxSize','256')
sc=SparkContext.getOrCreate(conf=configura)
spark = SparkSession.builder.getOrCreate()
I hope it help someone...

I had faced the same issue while converting the sparkDF to pandasDF.
I am working on Azure-Databricks , first you need to check the memory set in the spark config using below -
spark.conf.get("spark.rpc.message.maxSize")
Then we can increase the memory-
spark.conf.set("spark.rpc.message.maxSize", "500")

For those folks, who are looking for AWS Glue script pyspark based way of doing this. The below code snippet might be useful
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark import SparkConf
myconfig=SparkConf().set('spark.rpc.message.maxSize','256')
#SparkConf can be directly used with its .set property
sc = SparkContext(conf=myconfig)
glueContext = GlueContext(sc)
..
..

Related

What is the fastest way to return one row from a big pyspark dataframe or koalas dataframe in databricks?

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:
mdf.path_info = mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])
After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.
mdf.head()
display(mdf)
mdf.take([1])
mdf.iloc[0]
also tried converting into a spark dataframe and then trying:
df = mdf.to_spark()
df.show(1)
df.rdd.takeSample(False, 1, seed=0)
df.first()
The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)
Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.
As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:
https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
df.sample(False, 0.1, seed=0)
You could cache the computation result after you convert to spark dataframe and then call the action.
df = mdf.to_spark()
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache()
df.show(1)
You may want to free up the memory used for caching with:
df.unpersist()

How to run pandas-Koalas progam suing spark-submit(windows)?

I have pandas data frame(sample program), converted koalas dataframe, now I am to execute on spark cluster(windows standalone), when i try from command prompt as
spark-submit --master local hello.py, getting error ModuleNotFoundError: No module named 'databricks'
import pandas as pd
from databricks import koalas as ks
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
kdf = ks.from_pandas(df)
print(kdf)
What should I change so that I can make use of spark cluster features. My actual program written in pandas does many things, I want to make use of spark cluster to see performance improvements.
You should install koalas via the cluster's admin UI (Libraries/PyPI), if you run pip install koalas on the cluster, it won't work.

How to run pure pandas code in spark and see activity from spark webUI?

Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
print(df)
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas

pyspark memory consumption is very low

I am using anaconda python and installed pyspark on top of it. In the pyspark program, I am using the dataframe as the data structure. The program goes like this:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("data")
df = spark_session.sql("select field1, field2 from data group by field1")
df.write.csv("result.csv")
While this works but it is slow and the memory usage is very low (~2GB). There is much more physical memory installed.
I tried to increase the memory usage by:
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
But it does not seem to help at all.
Any ways to speedup the program? Especially, how to fully utilize the system memory?
Thanks!
You can either use configuration for your session:
conf = SparkConf()
conf.set(spark.executor.memory', '16g')
spark_session = SparkSession.builder \
.config(conf=conf) \
.appName('test') \
.getOrCreate()
Or run the script with spark-submit:
spark-sumbit --conf spark.executor.memory=16g yourscript.py
You should also probably set the spark.driver.memory to something reasonable.
Hope this helps, good luck!

Creating a Dataframe with a schema in the Spark workers side, in a Spark Streaming app

I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!
Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations