count and collect operations are taking much time on empty spark dataframe - dataframe

I am creating an empty spark data frame with spark.createDataFrame([], schema) and then I am adding rows from lists, but accessing the data frame ( count-collect) are taking too much time from usual on this dataframe.
function dataframe.count() on 1000 rows on a data frame created from Csv files is taking 300 ms but on the empty data frame created from schema it's taking 4 seconds.
Here this difference is coming from?
schema = StructType([StructField('Average_Power',FloatType(),True),
StructField('Average_Temperature',FloatType(),True),
StructField('ClientId',StringType(),True),])
df = df_event_spark = spark.createDataFrame([], schema)
df.count()
Is there any way to create an empty spark data frame more optimized way?

No, this is the recommended way in pyspark.
There is a certain overehead in staring up a Spark App or running things from a Notebook. 100 rows is tiny, but the overhead of running a Spark App etc. is relatively large w.r.t. to such a small volume.

Related

write large pyspark dataframe to s3 very slow

This question is relevant to my previous question at aggregate multiple columns in sql table as json or array
I post some updates/follow-up questions here because I got a new problem.
I would like to query a table on presto database from pyspark hive and create a pyspark dataframe based on it. I have to save the dataframe to s3 faster and then read it as parquet (or any other formats as long as it can be read/written fast) from s3 efficiently.
In order to keep the size as small as possible, I have aggregated some columns into a json object.
The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):
id. cat_name. cat_desc. obj_name. obj_desc. obj_num
1. furniture living office desk 4 corners 1.5.
1 furniture. living office. chair. 4 legs. 0.8
1. furniture. restroom. tub. white wide. 2.7
1. cloth. fashion. T-shirt. black large. 1.1
I have aggregated some columns to json object.
aggregation_cols = ['cat_name','cat_desc','obj_name','obj_description', 'obj_num'] # they are all string
df_temp = df.withColumn("cat_obj_metadata", F.to_json(F.struct([x for x in aggregation_cols]))).drop(*agg_cols)
df_temp_agg = df_temp.groupBy('id').agg(F.collect_list('cat_obj_metadata').alias('cat_obj_metadata'))
df_temp_agg.cache()
df_temp_agg.printSchema()
# df_temp_agg.count() # this cost a very long time but still cannot return result so I am not sure how large it is.
df_temp_agg.repartition(1024) # not sure what optimal one should be?
df_temp_agg.write.parquet(s3_path, mode='overwrite') # this cost a long time (> 12 hours) but no return.
I work on a m4.4xlarge with 4 nodes and all cores look not busy.
I also checked the s3 bucket, no folder created at "s3_path".
For other small dataframe, I can see the "s3_path" can be created when "write.parquet()" is run. But, for this large dataframe, nothing fodlers or files are created on "s3_path".
Because the
df_meta_agg.write.parquet()
never returns, I am. not sure what errors could happen here on spark cluster or on s3.
Anybody could help me about this ? thanks

When i need to persist dataframe

I am curious to know when i need to persist my dataframe in spark and when not. Cases:-
If i need data from file ( Do i need to persist it? if i apply repetitive count like:-
val df=spark.read.json("file://root/Download/file.json")
df.count
df.count
Do i need to persist df?? because according to me it should store df in memory after first count and use same df in second count. Record in file is 4 , Because when i practically check it , it read file again and again, So why spark doesn't store it in memory
Second question is in spark read is an action or transformation?
DataFrames by design are immutable, so every transformation done on them would create a new data frame altogether. A spark pipeline generally involves multiple transformations leading to multiple data frames being created. If spark stores all of these data frames, the memory requirement would be huge. So spark leaves the responsibility of persisting data frames to the user. Whichever data frame you are planning on re using, you can persist them and later unpersist them when done.
I don't think we can define spark read as an action or a transformation. Action/Transformation is applied over a data frame. To identify the difference, you should remember that the transformation operation will return a new dataframe while action will return some value/s.

Using dask to read data from Hive

I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)

spark works too slow on dataframe generated by mongodb spark connector

I used mongodb spark connector generated a dataframe from mongodb
val df1 = df.filter(df("dev.app").isNotNull).select("dev.app").limit(100)
It's a big collection, so I limit the row to 100.
when I use
df1.show()
it works fast.
But when I use
df1.count
to see the fist row of df1
the result is enter image description here
it is too slow.
Can anybody give me some suggestions?
I think you should try to tweak spark.sql.shuffle.partitions configuration. you may very small data but you are creating too many partitions by default it is 200
see this for info

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

I am new in apache spark sql in scala.
How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte. I am looking for scala solution.
This is actually kind of a tricky problem. Spark SQL uses columnar data Storage so thinking of individual row sizes isn't super natural. We can of course call .rdd on from there you can filter the resulting RDD using the techniques as from Calculate size of Object in Java to determine the object size, and then you can take your RDD of Rows and convert it back to a DataFrame using your SQLContext.