Querying redshift via spark-redshift is not fast - apache-spark-sql

I connect to Redshift via pyspark using pyspark-redshift, i.e.
sparkConf = SparkConf()
sc = SparkContext(conf=sparkConf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'AWS_KEY_ID')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'AWS_KEY')
sql_context = SQLContext(sc)
sql_context.getConf("spark.sql.shuffle.partitions", u"5")
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://example.redshift.amazonaws.com:5439/agcdb?user=user&password=pwd") \
.option("dbtable", "table_name") \
.option('forward_spark_s3_credentials',True) \
.option("tempdir", "s3n://bucket") \
.load()
When I compare the run time of a query, e.g. 300k rows on pyspark and Redshift directly I find no difference.
I read that the configuration spark.sql.shuffle.partitions should be changed to less than the default=200 depending on the size of the dataframe.
What are the important configurations I should check/ which people saw making a difference?

Related

Pyspark df.count() Why does it work with only one executor?

i am trying to read data from kafka and I want to get the count of it.
It takes a long time because it works with only one executor. how can i increase it?
spark = SparkSession.builder.appName('oracle_read_test') \
.config("spark.driver.memory", "30g") \
.config("spark.driver.maxResultSize", "64g") \
.config("spark.executor.cores", "10") \
.config("spark.executor.instances", "15") \
.config('spark.executor.memory', '30g') \
.config('num-executors', '20') \
.config('spark.yarn.executor.memoryOverhead', '32g') \
.config("hive.exec.dynamic.partition", "true") \
.config("orc.compress", "ZLIB") \
.config("hive.merge.smallfiles.avgsize", "40000000") \
.config("hive.merge.size.per.task", "209715200") \
.config("dfs.blocksize", "268435456") \
.config("hive.metastore.try.direct.sql", "true") \
.config("spark.sql.orc.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.sql.sources.partitionOverwriteMode","dynamic") \
.getOrCreate()
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers","localhost:9092") \
.option("includeHeaders","true") \
.option("subscribe","test") \
.load()
df.count()
How many partitions does your topic have? If only one, then you cannot have more executors.
Otherwise, --num-executors exists as a flag to spark-submit.
Also, this code only counts the records returned in one batch, not the entire topic. Counting the entire topic would take even longer.

Structured streaming multiple row to pandas udf

I'm writing a structured streaming job that receives data from eventhubs.
After some preparation, I apply a pandas_udf function on each row to create a new column with a prediction from a pickle model.
I'm experiencing a serious problem: sometimes the input for the pandas_udf is a group of row and not a single row (as expected). This leads me to an error:
RuntimeError: Result vector from pandas_udf was not the required length: expected 2, got 1
This happens because the pandas_udf receives more than one row (in this case 2).
How could this be possible? Shouldn't the .withColumn be executed row-wise on each row?
Here is my code:
dfInt = spark \
.readStream \
.load() \
.selectExpr("cast (body as string) as json") \
.select(from_json("json",schema).alias("data")) \
.withColumn("k", expr("uuid()")) \
.select("key", explode("data.features").alias("feat")) \
.select("feat.*", "key") \
.groupBy("k") \
.agg(*expressions) \
.drop("k") \
.na.drop() \
.withColumn("prediction", predict( (F.struct([col(x) for x in (features)]))))
The pandas_udf is the following:
#pandas_udf(FloatType())
def predict(x):
return pd.Series(pickle_model.predict_proba(x)[0][1])
Actually the problems seems to be before the withColumn call with the udf, because more row are coming from the previous step.
The groupBy aggregation returns a singular row, because the key on which I make the group by is unique.
Do you know which is the reason for that?
In this case you are using a SCALAR pandas_udf, which takes as input a pandas Series and returns a pandas.Series of the same size. I don't know the exact details on the internals but my understanding is that each executor will convert your column (F.struct([col(x) for x in (features)])) into a pandas.Series for the Dataframe partition the executor is currently processing and apply the function on the series. A partition consists of many rows, therefore you cannot assume that the series is only of length one. You need to make sure that you are keeping all your predicted proba for all your rows. You can probably do something like this (assuming you are indeed only interested in keeping the probability of class 1):
#pandas_udf(FloatType())
def predict(x):
return pd.Series(pickle_model.predict_proba(x)[:,1])

pyspark memory consumption is very low

I am using anaconda python and installed pyspark on top of it. In the pyspark program, I am using the dataframe as the data structure. The program goes like this:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("data")
df = spark_session.sql("select field1, field2 from data group by field1")
df.write.csv("result.csv")
While this works but it is slow and the memory usage is very low (~2GB). There is much more physical memory installed.
I tried to increase the memory usage by:
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
But it does not seem to help at all.
Any ways to speedup the program? Especially, how to fully utilize the system memory?
Thanks!
You can either use configuration for your session:
conf = SparkConf()
conf.set(spark.executor.memory', '16g')
spark_session = SparkSession.builder \
.config(conf=conf) \
.appName('test') \
.getOrCreate()
Or run the script with spark-submit:
spark-sumbit --conf spark.executor.memory=16g yourscript.py
You should also probably set the spark.driver.memory to something reasonable.
Hope this helps, good luck!

Creating a Dataframe with a schema in the Spark workers side, in a Spark Streaming app

I have developed a spark streaming app where I have data stream of json strings.
sc = SparkContext("local[*]", "appname")
sc.setLogLevel("WARN")
sqlContext = sql.SQLContext(sc)
#batch width in time
stream = StreamingContext(sc, 5)
stream.checkpoint("checkpoint")
# mqtt setup
brokerUrl = "tcp://localhost:1883"
topic = "test"
# mqtt stream
DS = MQTTUtils.createStream(stream, brokerUrl, topic)
# transform DStream to be able to read json as a dict
jsonDS = kvs.map(lambda v: json.loads(v))
#create SQL-like rows from the json
sqlDS = jsonDS.map(lambda x: Row(a=x["a"], b=x["b"], c=x["c"], d=x["d"]))
#in each batch do something
sqlDS.foreachRDD(doSomething)
# run
stream.start()
stream.awaitTermination()
def doSomething(time,rdd):
data = rdd.toDF().toPandas()
This code above is working as expected: I receive some jsons in a stringified manner and I can convert each batch to a dataframe, also converting it to a Pandas DataFrame.
So far so good.
The problem comes if I want to add a different schema to the DataFrame.
The method toDF() assumes a schema=None in the following function: sqlContext.createDataFrame(rdd, schema).
If I try to access sqlContext from inside doSomething(), obviosuly it is not defined. If I try to make it available there with a global variable I get the typical error that it cannot be serialized.
I have also read the sqlContext can only be used in the Spark Driver and not in the workers.
So the question is: how is the toDF() working in the first place, as it needs the sqlContext? And how can I add a schema to it (hopefully without changing the source)?
Creating the DataFrame in the driver doesnt seem to be an option because I cannot serialize it to the workers.
Maybe I am not seeing this properly.
Thanks a lot in advance!
Answering my own question...
define the following:
def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
globals()["sparkSessionSingletonInstance"] = SparkSession \
.builder \
.config(conf=sparkConf) \
.getOrCreate()
return globals()["sparkSessionSingletonInstance"]
and then from the worker just call:
spark = getSparkSessionInstance(rdd.context.getConf())
taken from DataFrame and SQL Operations

Pyspark DataFrame does not quote data at save

I am trying to save a file to hdfs using com.databricks.spark.csv package, but it does not quote my data, although i define it.
What i am doing wrong?
df.write.format('com.databricks.spark.csv').mode('overwrite').option("header", "false").option("quote","\"").save(output_path)
I am calling using --packages com.databricks:spark-csv_2.10:1.5.0
output:
john,doo,male
expected:
"john","doo","male"
In Spark >= 2.X you should use the option quoteAll:
df.write \
.format('com.databricks.spark.csv') \
.mode('overwrite') \
.option("header", "false") \
.option("quoteAll","true") \
.save(output_path)