Suppose I create the following dataframe:
dt = pd.DataFrame(np.array([[1,5],[2,12],[4,17]]),columns=['a','b'])
df = spark.createDataFrame(dt)
I want to create a third column, c, that is the sum of these two columns. I have the following two ways to do so.
The withColumn() method in Spark:
df1 = df.withColumn('c', df.a + df.b)
Or using sql:
df.createOrReplaceTempView('mydf')
df2 = spark.sql('select *, a + b as c from mydf')
While both yield the same results, which method is computationally faster?
Also, how does sql compare to a spark user defined function?
While both yield the same results, which method is computationally faster?
Look at the execution plans:
df1.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#4L]
#+- Scan ExistingRDD[a#0L,b#1L]
df2.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#8L]
#+- Scan ExistingRDD[a#0L,b#1L]
Since these are the same, the two methods are identical.
Generally speaking, there is no computational advantage of using either withColumn or spark-sql over the other. If the code is written properly, the underlying computations will be identical.
There may be some cases where it's easier to express something using spark-sql, for example if you wanted to use a column value as a parameter to a spark function.
Also, how does sql compare to a spark user defined function?
Take a look at this post: Spark functions vs UDF performance?
Related
I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
final_result.append(small_df)
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).
I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:
The following is similar but not exact logic to what I'm trying to accomplish:
df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)
# Thinking about adding the following line
special_rows.cache()
def f1(df):
new_df_1 = df.withColumn('foo', lit(0))
return new_df_1
def f2(df):
new_df_2 = df.withColumn('foo', lit(1))
return new_df_2
new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)
Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.
My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?
there is no action called until my final write to parquet.
and
Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.
Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.
You might want to repartition after running special_rows = df.filter(col('special') > 0). There can be a large number of empty partitions after running a filtering operation, as explained here.
The new_df_1 will make cache special_rows which will be reused by new_df_2 here new_df_1.union(new_df_2). That's not necessarily a performance optimization. Caching is expensive. I've seen caching slow down a lot of computations, even when it's being used in a textbook manner (i.e. caching a DataFrame that gets reused several times downstream).
Counting does not necessarily make sure the data is cached. Counts avoid scanning rows whenever possible. They'll use the Parquet metadata when they can, which means they don't cache all the data like you might expect.
You can also "cache" data by writing it to disk. Something like this:
df.filter(col('special') > 0).repartition(500).write.parquet("some_path")
special_rows = spark.read.parquet("some_path")
To summarize, yes, the DataFrame will be cached in this example, but it's not necessarily going to make your computation run any faster. It might be better to have no cache or to "cache" by writing data to disk.
I'm just wondering if this spark code
val df = spark.sql("select * from db.table").filter(col("field")=value)
is as efficient as this one:
val df = spark.sql("select * from db.table where field=value")
In the first bloc are we loading all hive data to the RAM or is spark smart enough to filter those values in hive during the execution of the generated DAG
Thanks in advance!
Whether we apply filter through DataFrame functions or Spark SQL on a dataframe or its view , they both will result in same physical plan (it is a plan according to which a spark job is actually executed across a cluster).
The reason behind this is Apache Spark's Catalyst optimiser. It is an in-built feature of Spark which turns input SQL queries or DataFrame transformations into a logical and cost optimised physical plan.
You can also have a look at this databricks link to understand it more clearly. Further, we can check this physical plan using .explain function (Caution: .explain's output should be read opposite to conventional way as its last line represents the start of physical plan and first line represents the end of physical plan.)
you dont use same functions, but internaly it's same.
you can use explain() to check the logical plan :
spark.sql("select * from db.table").filter(col("field")=value).explain()
spark.sql("select * from db.table where field=value").explain()
in the first case you use a mixte between spark SQL and Dataset api with the .filter(col("field")=value)
in the second case you are pure sql
Problem Statement
I have a large (e.g. 2GB) dataframe indexed by short strings (e.g. FAM227A, PVT1, TAT). It represents a database.
I also have a list of short "query" stings.
I'd like to compute some statistics on the slice of the dataframe defined by my list of query strings. Essentially, I'd like to query the "database" and then do some statistics. I'm wondering what the pandas way of doing this is. I can think up a couple ways:
database.merge(query, how='right', left_index=true, right_on=0)
something involving loc like database.loc[query]
something involving where
something else?
My requirements are that I want this to be maximally fast, ideally not copy the data (i.e. return a view), and particular to my case is the fact that the my queries will often be for > 90% of the keys in the DB.
Example data
db = pd.DataFrame(['FAM227A', 'PVT1', 'TAT'], columns=['name']).set_index('name')
q = ['FAM227A']
# ... your query operation here ...
# ... returns a dataframe with the info from DB about the keys from query
My data is stored in csv format, and the headers are given in the column_names variable.
I wrote the following code to read it into a python dictionary RDD
rdd=sc.textFile(hdfs_csv_dir)\
.map(lambda x: x.split(','))\
.filter(lambda row: len(row)==len(column_names))\
.map(lambda row: dict([(column,row[index]) for index,column in enumerate(column_names)]))
Next, I wrote a function that counts the combinations of column values given the column names
import operator
def count_by(rdd,cols=[]):
'''
Equivalent to:
SELECT col1, col2, COUNT(*) FROM MX3 GROUP BY col1, col2;
But the number of columns can be more than 2
'''
counts=rdd.map(lambda x: (','.join([str(x[c]) for c in cols]), 1))\
.reduceByKey(operator.add)\
.map(lambda t:t[0].split(',')+[t[1]])\
.collect()
return counts
I am running count_by several times, with a lot of different parameters on the same rdd.
What is the best way to optimize the query and make it run faster ?
First, you should cache the RDD (by calling cachedRdd = rdd.cache()) before passing it multiple times into count_by, to prevent Spark from loading it from disk for each operation. Operating on a cached RDD means data will be loaded to memory upon first use (first call to count_by), then read from memory for following calls.
You should also consider using Spark DataFrame API instead of the lower-level RDD API, since:
You seem to articulate your intentions using SQL, and DataFrame API allows you to actually use such a dialect
When using DataFrames, Spark can perform some extra optimizations since it has a better understanding of what you are trying to do, is it can design the best way to achieve it. SQL-like dialect is declerative - you only say what you want, not how to get it, which gives Spark more freedom to optimize