pyspark memory consumption is very low - dataframe

I am using anaconda python and installed pyspark on top of it. In the pyspark program, I am using the dataframe as the data structure. The program goes like this:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("data")
df = spark_session.sql("select field1, field2 from data group by field1")
df.write.csv("result.csv")
While this works but it is slow and the memory usage is very low (~2GB). There is much more physical memory installed.
I tried to increase the memory usage by:
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
But it does not seem to help at all.
Any ways to speedup the program? Especially, how to fully utilize the system memory?
Thanks!

You can either use configuration for your session:
conf = SparkConf()
conf.set(spark.executor.memory', '16g')
spark_session = SparkSession.builder \
.config(conf=conf) \
.appName('test') \
.getOrCreate()
Or run the script with spark-submit:
spark-sumbit --conf spark.executor.memory=16g yourscript.py
You should also probably set the spark.driver.memory to something reasonable.
Hope this helps, good luck!

Related

Memory Leak - After every request hit on Flask API running in a container

I have a flask app running in a container on EC2. On starting the container, the docker stats gave memory usage close to 48MB. After making the first API call (reading a 2gb file from s3), the usage rises to 5.72GB. Even after completion of the api call, the usage does not go down.
On hitting the request, the usage goes up by around twice the file size and after a few requests, the server starts giving the memory error
Also, on running the same Flask app without the container, we do not see any such increment in memory utilized.
Output of "docker stats <container_id>" before hitting the API-
Output of "docker stats <container_id>" after hitting the API
Flask app (app.py) contains-
import os
import json
import pandas as pd
import flask
app = flask.Flask(__name__)
#app.route('/uploadData', methods=['POST'])
def test():
json_input = flask.request.args.to_dict()
s3_path = json_input['s3_path']
# reading file directly from s3 - without downloading
df = pd.read_csv(s3_path)
print(df.head(5))
#clearing df
df = None
return json_input
#app.route('/healthcheck', methods=['GET'])
def HealthCheck():
return "Success"
if __name__ == '__main__':
app.run(host="0.0.0.0", port='8898')
Docker contains-
FROM python:3.7.10
RUN apt-get update -y && apt-get install -y python-dev
# We copy just the requirements.txt first to leverage Docker cache
COPY . /app_abhi
WORKDIR /app_abhi
EXPOSE 8898
RUN pip3 install flask boto3 pandas fsspec s3fs
CMD [ "python","-u", "app.py" ]
I tried reading the file directly from S3 as well as downloading the file and then reading it but it did not work.
Any leads in getting this memory utilization down to the initial consumption would be a great help!
You can try following possible solutions:
Update the dtype of the columns :
Pandas (by default) try to infer dtypes of the datatype of columns when it creates a dataframe. Certain data types can result in large memory allocation. You can reduce it by updating the dtypes of such columns. e.g. update integer columns to pd.np.int8 and float columns to pd.np.float16. Refer this : Pandas/Python memory spike while reading 3.2 GB file
Read data in Chunks :
You can read data into a chunk size say and perform the required processing on the chunk and then moving on to the new chunk. This way you will not be storing the entire data into memory. Although reading data into chunks can be slower as compared to reading whole data at once, but it is memory efficient.
Try using new library : Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed. But you might not find a lot of built-in pandas operations in Dask. https://docs.dask.org/en/latest/dataframe.html
The memory growth is almost certainly caused by constructing the dataframe.
df = None doesn't return that memory to the operating system, though it does return memory to the heap managed within the process. There's an explanation for that in How do I release memory used by a pandas dataframe?
I had a similar problem (see question Google Cloud Run: script requires little memory, yet reaches memory limit)
Finally, I was able to solve it by adding
import gc
...
gc.collect()

How to run pandas-Koalas progam suing spark-submit(windows)?

I have pandas data frame(sample program), converted koalas dataframe, now I am to execute on spark cluster(windows standalone), when i try from command prompt as
spark-submit --master local hello.py, getting error ModuleNotFoundError: No module named 'databricks'
import pandas as pd
from databricks import koalas as ks
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
kdf = ks.from_pandas(df)
print(kdf)
What should I change so that I can make use of spark cluster features. My actual program written in pandas does many things, I want to make use of spark cluster to see performance improvements.
You should install koalas via the cluster's admin UI (Libraries/PyPI), if you run pip install koalas on the cluster, it won't work.

How to run pure pandas code in spark and see activity from spark webUI?

Does any one has idea how to run pandas program on spark standalone cluster machine(windows)? the program developed using pycharm and pandas?
Here the issue is i am able to run from command prompt using spark-submit --master spark://sparkcas1:7077 project.py and getting results. but the activity(status) I am not seeing # workers and also Running Application status and Completed application status from spark web UI: :7077
in the pandas program I just included only one statement " from pyspark import SparkContext
import pandas as pd
from pyspark import SparkContext
# reading csv file from url
workbook_loc = "c:\\2020\Book1.xlsx"
df = pd.read_excel(workbook_loc, sheet_name='Sheet1')
# converting to dict
print(df)
What could be the issue?
Pandas code runs only on the driver and no workers are involved in this. So there is no point of using pandas code inside spark.
If you are using spark 3.0 you can run your pandas code distributed by converting the spark df as koalas

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
In my Spark configuration I already tried to increase the aforementioned parameter:
spark = (SparkSession
.builder
.appName("TV segmentation - dataprep for scoring")
.config("spark.executor.memory", "25G")
.config("spark.driver.memory", "40G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.maxExecutors", "12")
.config("spark.driver.maxResultSize", "3g")
.config("spark.kryoserializer.buffer.max.mb", "2047mb")
.config("spark.rpc.message.maxSize", "1000mb")
.getOrCreate())
I also tried to repartition my dataframe using:
dfscoring=dfscoring.repartition(100)
but still I keep on getting the same error.
My environment: Python 3.5, Anaconda 5.0, Spark 2
How can I avoid this error ?
i'm in same trouble, then i solve it.
the cause is spark.rpc.message.maxSize if default set 128M, you can change it when launch a spark client, i'm work in pyspark and set the value to 1024, so i write like this:
pyspark --master yarn --conf spark.rpc.message.maxSize=1024
solve it.
I had the same issue and it wasted a day of my life that I am never getting back. I am not sure why this is happening, but here is how I made it work for me.
Step 1: Make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Turned out that python in worker(2.6) had a different version than in driver(3.6). You should check if environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I fixed it by simply switching my kernel from Python 3 Spark 2.2.0 to Python Spark 2.3.1 in Jupyter. You may have to set it up manually. Here is how to make sure your PySpark is set up correctly https://mortada.net/3-easy-steps-to-set-up-pyspark.html
STEP 2: If that doesn't work, try working around it:
This kernel switch worked for DFs that I haven't added any columns to:
spark_df -> panda_df -> back_to_spark_df .... but it didn't work on the DFs where I had added 5 extra columns. So what I tried and it worked was the following:
# 1. Select only the new columns:
df_write = df[['hotel_id','neg_prob','prob','ipw','auc','brier_score']]
# 2. Convert this DF into Spark DF:
df_to_spark = spark.createDataFrame(df_write)
df_to_spark = df_to_spark.repartition(100)
df_to_spark.registerTempTable('df_to_spark')
# 3. Join it to the rest of your data:
final = df_to_spark.join(data,'hotel_id')
# 4. Then write the final DF.
final.write.saveAsTable('schema_name.table_name',mode='overwrite')
Hope that helps!
I had the same problem but using Watson studio. My solution was:
sc.stop()
configura=SparkConf().set('spark.rpc.message.maxSize','256')
sc=SparkContext.getOrCreate(conf=configura)
spark = SparkSession.builder.getOrCreate()
I hope it help someone...
I had faced the same issue while converting the sparkDF to pandasDF.
I am working on Azure-Databricks , first you need to check the memory set in the spark config using below -
spark.conf.get("spark.rpc.message.maxSize")
Then we can increase the memory-
spark.conf.set("spark.rpc.message.maxSize", "500")
For those folks, who are looking for AWS Glue script pyspark based way of doing this. The below code snippet might be useful
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark import SparkConf
myconfig=SparkConf().set('spark.rpc.message.maxSize','256')
#SparkConf can be directly used with its .set property
sc = SparkContext(conf=myconfig)
glueContext = GlueContext(sc)
..
..

Is there a pyarrow equivalent of the chunksize argument in pandas.read_csv?

I am looking to process a large file(5 gb) in RAM but am getting an out of memory error. Is there a way to process the parquet file in chunks like there is in pandas.read_csv?
import pyarrow.parquet as pq
def main():
df = pq.read_table('./data/train.parquet').to_pandas()
main()
There is not yet, but there are issues open about adding this option (see https://issues.apache.org/jira/browse/ARROW-3771, others). Note that memory use will be significantly improved in the upcoming 0.12 release.
In the meantime, you can use pyarrow.parquet.ParquetFile and its read_row_group method to read one row group at a time.