Spark Dataframe order preservation .Does calling the save operation on orderBy dataframe preserves ordering - apache-spark-sql

I ran some test cases from a spark shell . The statement that i executed were of the form .
read.orderBy($"p_int".asc ).write.format("com.databricks.spark.csv").save(“file:///tmp/output.txt”)
The content in the output directory seems to always be sorted. however I cannot find any documentation in spark that even related to any guarantees provided by either the DataFrameWriter in terms of preserving partition order or row order.
The question is can i always expect the data in the target file to be sorted ?and please add any link to proper documentation.

If you coalesce to 1 partition before saving, the output will be sorted. Be careful thought, when reading back the .csv in spark, if in your spark config spark.default.parallelism is more than 1, ordering will be lost.

Related

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.