where does DataFrameWriter.bucketBy() stores the data? - apache-spark-sql

I'm trying to use the DataFrameWriter.bucketBy() method to bucket the output by given columns. But where exactly the output data will be stored?
is it stored in Memory or is it possible to store it in file system?
Code:
>>> (df.write.format('parquet')
... .bucketBy(100, 'year', 'month')
... .mode("overwrite")
... .saveAsTable('bucketed_table'))

saveAsTable will always store the dataframe into HDFS as Table.

Related

Concat Pandas DF with CSV File

I want to concat 2 data-frames into one df and save as one csv considering that the first dataframe is in csv file and huge so i dont want to load it in memory. I tried the df.to_csv with append mode but it doesnt behave like df.concat in regards to different columns (comparing and combining columns). Anyone knows how to concat a csv and a df ? Basically csv and df can have different columns so the output csv should have only one header along with all columns and proper respective rows.
You can use Dask DataFrame to do this operation lazily. It'll load your data into memory, but do so in small chunks. Make sure to keep the partition size (blocksize) reasonable -- based on your overall memory capacity.
import dask.dataframe as dd
ddf1 = dd.read_csv("data1.csv", blocksize=25e6)
ddf2 = dd.read_csv("data2.csv", blocksize=25e6)
new_ddf = dd.concat([ddf1, ddf2])
new_ddf.to_csv("combined_data.csv")
API docs: read_csv, concat, to_csv

Convert multiple CSVs to single partitioned parquet dataset

I have a set of CSV files, each for one year of data, with YEAR column in each. I want to convert them into single parquet dataset, partitioned by year, for later use in pandas. The problem is that dataframe with all years combined is too large to fit in memory. Is it possible to write parquet partitions iteratively, one by one?
I am using fastparquet as engine.
Simplified code example. This code blows up memory usage and crashes.
df = []
for year in range(2000, 2020):
df.append(pd.read_csv(f'{year}.csv'))
df = pd.concat(df)
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
I tried to write years one by one, like so.
for year in range(2000, 2020):
df = pd.read_csv(f'{year}.csv')
df.to_parquet('all_years.pq', partition_cols=['YEAR'])
The data files are all there in their respective YEAR=XXXX directories, but when I try to read such a dataset, I only get the last year. Maybe it is possible to fix the parquet metadata after writing separate partitions?
I think I found a way to do it using fastparquet.writer.merge() function. Parquet files are written one by one for each year, leaving out the YEAR column and giving them appropriate names, and then the merge() function creates top level _metadata file.
The code below is a gist, as I leave out many details from my concrete use case.
years = range(2000, 2020)
for year in years:
df = pd.read_csv(f'{year}.csv').drop(columns=['YEAR'])
df.to_parquet(f'all_years.pq/YEAR={year}')
fastparquet.writer.merge([f'all_years.pq/YEAR={y}' for y in years])
df_all = pd.read_parquet('all_years.pq')

How to write filenames based on a dask dataframe column?

I have a dask dataframe that I would like to save to s3. Each row in the dataframe as a "timestamp" column. I would like to partition the paths in s3 based on the dates in that timestamp column, so the output in s3 looks like this:
s3://....BUCKET_NAME/data/date=2019-01-01/part1.json.gz
s3://....BUCKET_NAME/data/date=2019-01-01/part2.json.gz
...
...
s3://....BUCKET_NAME/data/date=2019-05-01/part1.json.gz
Is this possible in dask? I can only find the name_function in the output that expects an integer as an input, and setting the column as an index doesnt add the index as part of the output filenames.
It's actually easy to achieve, as long as you are happy to save it as parquet, using partition_on. You should rename your folder from data to data.parquet if you want to read with dask.
df.to_parquet("s3://BUCKET_NAME/data.parquet/", partition_on=["timestamp"])
Not sure if it's the only or optimal way but you should be able to do it with groupby-apply, as in:
df.groupby('timestamp').apply(write_partition)
where write_partition is a function that takes a Pandas dataframe for a single timestamp and writes it to S3. Make sure you check the docs of apply as there are some gotchas (providing meta, full shuffling if the groupby column is not in the index, function called once per partition-group pair instead of once per group).

Avoiding shuffle on GROUP BY in Spark SQL [duplicate]

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.
Questions:
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
If it knows, then there will be no shuffle in the window function. True?
If it does not know, It will do a shuffle in the window function. True?
If it does not know, how do I tell Spark the data is already partitioned by the right column?
How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.
In practice:
If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
I am answering my own question for future reference what worked.
Following suggestion of #user8371915, bucketBy works!
I am saving my DataFrame df:
df.write
.bucketBy(250, "userid")
.saveAsTable("myNewTable")
Then when I need to load this table:
val df2 = spark.sql("SELECT * FROM myNewTable")
val w = Window.partitionBy("userid")
val df3 = df2.withColumn("newColumnName", sum(col("someColumn")).over(w)
df3.explain
I confirm that when I do window functions on df2 partitioned by userid there is no shuffle! Thanks #user8371915!
Some things I learned while investigating it
myNewTable looks like a normal parquet file but it is not. You could read it normally with spark.read.format("parquet").load("path/to/myNewTable") but the DataFrame created this way will not keep the original partitioning! You must use spark.sql select to get correctly partitioned DataFrame.
You can look inside the table with spark.sql("describe formatted myNewTable").collect.foreach(println). This will tell you what columns were used for bucketing and how many buckets there are.
Window functions and joins that take advantage of partitioning often require also sort. You can sort data in your buckets at the write time using .sortBy() and the sort will be also preserved in the hive table. df.write.bucketBy(250, "userid").sortBy("somColumnName").saveAsTable("myNewTable")
When working in local mode the table myNewTable is saved to a spark-warehouse folder in my local Scala SBT project. When saving in cluster mode with mesos via spark-submit, it is saved to hive warehouse. For me it was located in /user/hive/warehouse.
When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible.
If you want to save your table to specific database, do spark.sql("USE your database") before bucketing.
Update 05-02-2018
I encountered some problems with spark bucketing and creation of Hive tables. Please refer to question, replies and comments in Why is Spark saveAsTable with bucketBy creating thousands of files?

How to define partitions to Dataframe in pyspark?

Suppose I read a parquet file as a Dataframe in pyspark, how can I specify how many partitions it must be?
I read the parquet file like this -
df = sqlContext.read.format('parquet').load('/path/to/file')
How may I specify the number of partitions to be used?