Spark group by several fields several times on the same RDD - optimization

My data is stored in csv format, and the headers are given in the column_names variable.
I wrote the following code to read it into a python dictionary RDD
rdd=sc.textFile(hdfs_csv_dir)\
.map(lambda x: x.split(','))\
.filter(lambda row: len(row)==len(column_names))\
.map(lambda row: dict([(column,row[index]) for index,column in enumerate(column_names)]))
Next, I wrote a function that counts the combinations of column values given the column names
import operator
def count_by(rdd,cols=[]):
'''
Equivalent to:
SELECT col1, col2, COUNT(*) FROM MX3 GROUP BY col1, col2;
But the number of columns can be more than 2
'''
counts=rdd.map(lambda x: (','.join([str(x[c]) for c in cols]), 1))\
.reduceByKey(operator.add)\
.map(lambda t:t[0].split(',')+[t[1]])\
.collect()
return counts
I am running count_by several times, with a lot of different parameters on the same rdd.
What is the best way to optimize the query and make it run faster ?

First, you should cache the RDD (by calling cachedRdd = rdd.cache()) before passing it multiple times into count_by, to prevent Spark from loading it from disk for each operation. Operating on a cached RDD means data will be loaded to memory upon first use (first call to count_by), then read from memory for following calls.
You should also consider using Spark DataFrame API instead of the lower-level RDD API, since:
You seem to articulate your intentions using SQL, and DataFrame API allows you to actually use such a dialect
When using DataFrames, Spark can perform some extra optimizations since it has a better understanding of what you are trying to do, is it can design the best way to achieve it. SQL-like dialect is declerative - you only say what you want, not how to get it, which gives Spark more freedom to optimize

Related

Improving for loop over groupby keys of a pandas dataframe and concatenating

I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Is there an idiomatic way to cache Spark dataframes?

I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:
The following is similar but not exact logic to what I'm trying to accomplish:
df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)
# Thinking about adding the following line
special_rows.cache()
def f1(df):
new_df_1 = df.withColumn('foo', lit(0))
return new_df_1
def f2(df):
new_df_2 = df.withColumn('foo', lit(1))
return new_df_2
new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)
Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.
My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?
there is no action called until my final write to parquet.
and
Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.
Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.
You might want to repartition after running special_rows = df.filter(col('special') > 0). There can be a large number of empty partitions after running a filtering operation, as explained here.
The new_df_1 will make cache special_rows which will be reused by new_df_2 here new_df_1.union(new_df_2). That's not necessarily a performance optimization. Caching is expensive. I've seen caching slow down a lot of computations, even when it's being used in a textbook manner (i.e. caching a DataFrame that gets reused several times downstream).
Counting does not necessarily make sure the data is cached. Counts avoid scanning rows whenever possible. They'll use the Parquet metadata when they can, which means they don't cache all the data like you might expect.
You can also "cache" data by writing it to disk. Something like this:
df.filter(col('special') > 0).repartition(500).write.parquet("some_path")
special_rows = spark.read.parquet("some_path")
To summarize, yes, the DataFrame will be cached in this example, but it's not necessarily going to make your computation run any faster. It might be better to have no cache or to "cache" by writing data to disk.

Performance Between SQL and withColumn

Suppose I create the following dataframe:
dt = pd.DataFrame(np.array([[1,5],[2,12],[4,17]]),columns=['a','b'])
df = spark.createDataFrame(dt)
I want to create a third column, c, that is the sum of these two columns. I have the following two ways to do so.
The withColumn() method in Spark:
df1 = df.withColumn('c', df.a + df.b)
Or using sql:
df.createOrReplaceTempView('mydf')
df2 = spark.sql('select *, a + b as c from mydf')
While both yield the same results, which method is computationally faster?
Also, how does sql compare to a spark user defined function?
While both yield the same results, which method is computationally faster?
Look at the execution plans:
df1.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#4L]
#+- Scan ExistingRDD[a#0L,b#1L]
df2.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#8L]
#+- Scan ExistingRDD[a#0L,b#1L]
Since these are the same, the two methods are identical.
Generally speaking, there is no computational advantage of using either withColumn or spark-sql over the other. If the code is written properly, the underlying computations will be identical.
There may be some cases where it's easier to express something using spark-sql, for example if you wanted to use a column value as a parameter to a spark function.
Also, how does sql compare to a spark user defined function?
Take a look at this post: Spark functions vs UDF performance?

Using dask to read data from Hive

I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)