Avoiding shuffle on GROUP BY in Spark SQL [duplicate] - sql

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?

Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.

I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?

Related

Dask not recovering partitions from simple (non-Hive) Parquet files

I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so:
import pandas as pd
import dask.dataframe as dd
import fastparquet
##### Generate random data to Simulate Process creating a Parquet file ######
test_df = pd.DataFrame(data=np.random.randn(10000, 2), columns=['data1', 'data2'])
test_df['time'] = pd.bdate_range('1/1/2000', periods=test_df.shape[0], freq='1S')
# some grouping column
test_df['name'] = np.random.choice(['jim', 'bob', 'jamie'], test_df.shape[0])
##### Write to partitioned parquet file, hive and simple #####
fastparquet.write('test_simple.parquet', data=test_df, partition_on=['name'], file_scheme='simple')
fastparquet.write('test_hive.parquet', data=test_df, partition_on=['name'], file_scheme='hive')
# now check partition sizes. Only Hive version works.
assert test_df.name.nunique() == dd.read_parquet('test_hive.parquet').npartitions # works.
assert test_df.name.nunique() == dd.read_parquet('test_simple.parquet').npartitions # !!!!FAILS!!!
My goal here is to be able to quickly filter and process individual partitions in parallel using dask, something like this:
df = dd.read_parquet('test_hive.parquet')
df.map_partitions(<something>) # operate on each partition
I'm fine with using the Hive-style Parquet directory, but I've noticed it takes significantly longer to operate on compared to directly reading from a single parquet file.
Can someone tell me the idiomatic way to achieve this? Still fairly new to Dask/Parquet so apologies if this is a confused approach.
Maybe it wasn't clear from the docstring, but partitioning by value simply doesn't happen for the "simple" file type, which is why it only has one partition.
As for speed, reading the data in one single function call is fastest when the data are so small - especially if you intend to do any operation such as nunique which will require a combination of values from different partitions.
In Dask, every task incurs an overhead, so unless the amount of work being done by the call is large compared to that overhead, you can lose out. In addition, disk access is not generally parallelisable, and some parts of the computation may not be able to run in parallel in threads if they hold the GIL. Finally, the partitioned version contains more parquet metadata to be parsed.
>>> len(dd.read_parquet('test_hive.parquet').name.nunique())
12
>>> len(dd.read_parquet('test_simple.parquet').name.nunique())
6
TL;DR: make sure your partitions are big enough to keep dask busy.
(note: the set of unique values is already apparent from the parquet metadata, it shouldn't be necessary to load the data at all; but Dask doesn't know how to do this optimisation since, after all, some of the partitions may contain zero rows)

Finding the latest version of each row using Dask with Parquet files and partition_on?

How can I make sure that am able to retain the latest version of a row (based on unique constraints) with Dask using Parquet files and partition_on?
The most basic use case is that I want to query a database for all rows where updated_at > yesterday and partition the data based on the created_at_date (meaning that there can be multiple dates which have been updated, and these files already exist most likely).
└───year=2019
└───month=2019-01
2019-01-01.parquet
2019-01-02.parquet
So I want to be able to combine my new results from the latest query and the old results on disk, and then retain the latest version of each row (id column).
I currently have Airflow operators handling the following logic with Pandas and it achieves my goal. I was hoping to accomplish the same thing with Dask without so much custom code though:
Partition data based on specified columns and save files for each partition (common example would be using the date or month column to create files 2019-01-01.parquet or 2019-12.parquet
Example:
df_dict = {k: v for k, v in df.groupby(partition_columns)}
Loop through each partition and check if the file name exists. If there is already a file with the same name, read that file as a separate dataframe and concat the two dataframes
Example:
part = df_dict[partition]
part= pd.concat([part, existing], sort=False, ignore_index=True, axis='index')
Sort the dataframes and drop duplicates based on a list of specified columns (unique constraints sorted by file_modified_timestamp or updated_at columns typically to retain the latest version of each row)
Example:
part = part.sort_values([sort_columns], ascending=True).drop_duplicates(unique_constraints, keep='last')
The end result is that my partitioned file (2019-01-01.parquet) has now been updated with the latest values.
I can't think of a way to use the existing parquet methods of a dataframe to do what you are after, but assuming your dask dataframe is reasonably partitioned, you could do the exact same set of steps within a map_partitions call. This means you pass the constituent pandas dataframes to the function, which acts on them. So long as the data in each partition is non-overlapping, you will do ok.

How to define partitions to Dataframe in pyspark?

Suppose I read a parquet file as a Dataframe in pyspark, how can I specify how many partitions it must be?
I read the parquet file like this -
df = sqlContext.read.format('parquet').load('/path/to/file')
How may I specify the number of partitions to be used?

Parallelize pyspark 2.2.0 dataframe partitioned write to S3

Starting to work with pyspark and run into a bottleneck I have created with my code:
I'm "grouping by" pyspark 2.2.0 dataframe into partitions by drive_id
and writing each partition (group) into its own location on S3.
I need it to define Athena table on S3 location partitioned by drive_id - this allows me to read data very efficiently if queried by drive_id.
#df is spark dataframe
g=df.groupBy(df.drive_id)
rows=sorted(g.count().collect())
#each row is a parition
for row in rows:
w=df.where((col("drive_id") == row.drive_id))
w.write.mode('append').parquet("s3n://s3bucket/parquet/drives/"+str(table)+"/drive_id="+str(row.drive_id) )
The problem is that the loop makes processing serial and writes drive partitions only one by one.
Obviously this doesn't scale well because single partition write task is quite small and parallelizing it doesn't give much.
How do I replace the loop with single write command that will write all partitions into different locations ins a single operation?
This operation should parallelize to run on spark workers, not driver.
I figured out the answer - surprisingly simple.
dataframe.write.parquet has optional parameter partitionBy(names_of_partitioning_columns).
So no need in the "group by" and no need in the loop:
using the single line:
df.write.partitionBy(drive_id).parquet("s3n://s3bucket/dir")
creates partitions in standard hive format "s3n://s3bucket/dir/drive_id=123"

How do I use a compressed columnar store in Spark SQL?

Objective:
I'd like to user Spark on a sparse dataset. I understand that SparkSQL now supports columnar data stores (I believe via SchemaRDD). I've been told that compression of the columnar store is implemented but currently turned off by default.
I can make sure that Spark is store my my dataset as a compressed, in memory, columnar store?
What I've Tried:
At the Spark Summit, someone told me that I have to turn on compression as follows:
conf.set("spark.sql.inMemoryStorage.compressed", "true")
However, doing so doesn't seem to make any difference in my memory footprint.
The following are snippets of my test code:
case class Record(i: Int, j: Int)
...
val conf = new SparkConf().setAppName("Simple Application")
conf.set("spark.sql.inMemoryStorage.compressed", "true")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val records = // create an RDD of 1M Records
val table = createSchemaRDD(records)
table.cache
In one case, I create records so that all the values of i and j are unique. In this case, I see that 89.4MB are used.
In a second case, I create records so that most of the values of i and j are 0. (Roughly 99.9% of the entries are 0). In this case, I see that 43.0MB are used.
I expected a much higher compression ratio. Is there something I should do differently?
Thanks for the help.
The setting you want to use in Spark 1.0.2 is:
spark.sql.inMemoryColumnarStorage.compressed
Just set it to "true". I do it in my conf/spark-defaults.conf.
Just verified that this yields smaller memory footprint.
sqlContext.cacheTable is needed. .cache will not cache the table with the in-memory columnar store.