Better solution to use any message broker for spark dataframe - redis

I'm running an algorithm to tag on a mongo field and based on that i am adding new field to that document. As my collection count is around 1 million therefore updating and insertion is taking so much time.
Sample data:
{id:'a1',content:'some text1'}
{id:'a2',content:'some text2'}
python code:
docs= db.col.find({})
for doc in docs:
out = do_operation(doc['content']) //do_operation is my algorithm
doc["tag"]=out
db.col.update(id:doc['id'],$set:{'Tag_flag':TRUE})
db.col2.insert(doc)
Whereas I have used spark dataframes to increase speed but spark dataframes are taking much memory and throws memory error.
(configuration : 4 core and 16gb RAM on a single cluster of hadoop)
df = //loading mongodata to a dataframe
df1 = df.withColumn('tag',df.content)
output = []
for doc in df.rdd.collect():
out = do_operation(doc['content'])
output.append(out)
df2 = spark.createDataFrame(output)
final_df = df1.join(df2, df1._id == df2._id , 'inner')
//and finally inserting this dataframe into new collection.
I need to optimize my sparkcode so that i can speedup with less memory.
Can I use any message broker like Kafka, RabbitMQ or Reddis in between mongo & spark.
Will it be helpful?

Related

What is the fastest way to return one row from a big pyspark dataframe or koalas dataframe in databricks?

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:
mdf.path_info = mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])
After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.
mdf.head()
display(mdf)
mdf.take([1])
mdf.iloc[0]
also tried converting into a spark dataframe and then trying:
df = mdf.to_spark()
df.show(1)
df.rdd.takeSample(False, 1, seed=0)
df.first()
The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)
Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.
As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:
https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
df.sample(False, 0.1, seed=0)
You could cache the computation result after you convert to spark dataframe and then call the action.
df = mdf.to_spark()
# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache()
df.show(1)
You may want to free up the memory used for caching with:
df.unpersist()

How to parallelize groupby() in dask?

I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

Creating a Dataframe with a schema in the Spark workers side, in a Spark Streaming app

I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!
Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations

How to collect spark dataframe at each executor node?

My application reads a large parquet file and performs some data extractions to arrive at a smallish spark dataframe object. All the contents of this dataframe must be present at each executor node for the next phase of the computation. I know that I can do this by collect-broadcast, as in this pyspark snippet
sc = pyspark.SparkContext()
sqlc = HiveContext(sc)
# --- register hive tables and generate spark dataframe
spark_df = sqlc.sql('sql statement')
# collect spark dataframe contents into a Pandas dataframe at the driver
global_df = spark_df.toPandas()
# broadcast Pandas dataframe to all the executor nodes
sc.broadcast(global_df)
I was just wondering: is there a more efficient method for doing this? It would seem that this pattern makes the driver node into a bottleneck.
It depends on what you need to do with your small dataframe. If you need to join it with large one, then Spark can optimize such joins broadcasting small dataframe automatically. The max size of dataframe that can be broadcasted is configured by spark.sql.autoBroadcastJoinThreshold option, as described in documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

Is dataframe created using toPandas() method is distributed across the spark cluster?

I am reading a CSV through
data=sc.textFile("filename")
Df = Sparksql.create dataframe()
Pdf = Df.toPandas ()
Now is Pdf distributed across the spark cluster or it resides in the environment of host machine??
No.
As it says in the PySpark source code of DataFrame:
.. note:: This method should only be used if the resulting Pandas's DataFrame is expected
to be small, as all the data is loaded into the driver's memory.