Creating a Dataframe with a schema in the Spark workers side, in a Spark Streaming app - pandas

I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!

Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations

Related

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
In my Spark configuration I already tried to increase the aforementioned parameter:
spark = (SparkSession
.builder
.appName("TV segmentation - dataprep for scoring")
.config("spark.executor.memory", "25G")
.config("spark.driver.memory", "40G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.maxExecutors", "12")
.config("spark.driver.maxResultSize", "3g")
.config("spark.kryoserializer.buffer.max.mb", "2047mb")
.config("spark.rpc.message.maxSize", "1000mb")
.getOrCreate())
I also tried to repartition my dataframe using:
dfscoring=dfscoring.repartition(100)
but still I keep on getting the same error.
My environment: Python 3.5, Anaconda 5.0, Spark 2
How can I avoid this error ?
i'm in same trouble, then i solve it.
the cause is spark.rpc.message.maxSize if default set 128M, you can change it when launch a spark client, i'm work in pyspark and set the value to 1024, so i write like this:
pyspark --master yarn --conf spark.rpc.message.maxSize=1024
solve it.
I had the same issue and it wasted a day of my life that I am never getting back. I am not sure why this is happening, but here is how I made it work for me.
Step 1: Make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Turned out that python in worker(2.6) had a different version than in driver(3.6). You should check if environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I fixed it by simply switching my kernel from Python 3 Spark 2.2.0 to Python Spark 2.3.1 in Jupyter. You may have to set it up manually. Here is how to make sure your PySpark is set up correctly https://mortada.net/3-easy-steps-to-set-up-pyspark.html
STEP 2: If that doesn't work, try working around it:
This kernel switch worked for DFs that I haven't added any columns to:
spark_df -> panda_df -> back_to_spark_df .... but it didn't work on the DFs where I had added 5 extra columns. So what I tried and it worked was the following:
# 1. Select only the new columns:
df_write = df[['hotel_id','neg_prob','prob','ipw','auc','brier_score']]
# 2. Convert this DF into Spark DF:
df_to_spark = spark.createDataFrame(df_write)
df_to_spark = df_to_spark.repartition(100)
df_to_spark.registerTempTable('df_to_spark')
# 3. Join it to the rest of your data:
final = df_to_spark.join(data,'hotel_id')
# 4. Then write the final DF.
final.write.saveAsTable('schema_name.table_name',mode='overwrite')
Hope that helps!
I had the same problem but using Watson studio. My solution was:
sc.stop()
configura=SparkConf().set('spark.rpc.message.maxSize','256')
sc=SparkContext.getOrCreate(conf=configura)
spark = SparkSession.builder.getOrCreate()
I hope it help someone...
I had faced the same issue while converting the sparkDF to pandasDF.
I am working on Azure-Databricks , first you need to check the memory set in the spark config using below -
spark.conf.get("spark.rpc.message.maxSize")
Then we can increase the memory-
spark.conf.set("spark.rpc.message.maxSize", "500")
For those folks, who are looking for AWS Glue script pyspark based way of doing this. The below code snippet might be useful
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark import SparkConf
myconfig=SparkConf().set('spark.rpc.message.maxSize','256')
#SparkConf can be directly used with its .set property
sc = SparkContext(conf=myconfig)
glueContext = GlueContext(sc)
..
..

Better solution to use any message broker for spark dataframe

I'm running an algorithm to tag on a mongo field and based on that i am adding new field to that document. As my collection count is around 1 million therefore updating and insertion is taking so much time.
Sample data:
{id:'a1',content:'some text1'}
{id:'a2',content:'some text2'}
python code:
docs= db.col.find({})
for doc in docs:
out = do_operation(doc['content']) //do_operation is my algorithm
doc["tag"]=out
db.col.update(id:doc['id'],$set:{'Tag_flag':TRUE})
db.col2.insert(doc)
Whereas I have used spark dataframes to increase speed but spark dataframes are taking much memory and throws memory error.
(configuration : 4 core and 16gb RAM on a single cluster of hadoop)
df = //loading mongodata to a dataframe
df1 = df.withColumn('tag',df.content)
output = []
for doc in df.rdd.collect():
out = do_operation(doc['content'])
output.append(out)
df2 = spark.createDataFrame(output)
final_df = df1.join(df2, df1._id == df2._id , 'inner')
//and finally inserting this dataframe into new collection.
I need to optimize my sparkcode so that i can speedup with less memory.
Can I use any message broker like Kafka, RabbitMQ or Reddis in between mongo & spark.
Will it be helpful?

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

Is dataframe created using toPandas() method is distributed across the spark cluster?

I am reading a CSV through
data=sc.textFile("filename")
Df = Sparksql.create dataframe()
Pdf = Df.toPandas ()
Now is Pdf distributed across the spark cluster or it resides in the environment of host machine??
No.
As it says in the PySpark source code of DataFrame:
.. note:: This method should only be used if the resulting Pandas's DataFrame is expected
to be small, as all the data is loaded into the driver's memory.