Saving single parquet file on spark at only one partition while using multi partitions - dataframe

I'm trying to save my dataframe parquet file as a single file instead of multi parts files (?? I don't know the exact term for this, pardon me. I'm stupid in English)
here is how I made my df
val df = spark.read.format("jdbc").option("url","jdbc:postgresql://ipaddr:port/address").option("header","true").load()
and I have 2 ip addresses where I run different master / work servers.
for ex,
ip : 10.10.10.1 runs 1 master server
ip : 10.10.10.2 runs 2 work servers (and these work servers work for ip1)
and I'm trying to save my df as parquet file only in master server (ip1)
normally I would save the file with (I use spark on master server using scala)
df.write.parquet("originals.parquet")
then the df gets saved as parquet in both servers (obviously since it's spark)
but from now I'm trying to save the df as one single file, while keeping the multi process for better speed yet saving the file only on one side of the server
so I've tried using
df.coalesce(1).write.parquet("test.parquet")
and
df.coalesce(1).write.format("parquet").mode("append").save("testestest.parquet")
but the result for both are same as the original write.parquet resulting the df being saved in both servers.
I guess it's because of my lack of understanding in using spark and also the function of df, coalesce..
I was told that using coalesce or partition will help me saving the file only on one server, but I want to know why is it still being saved in both servers. Is it how it's supposed to be and is it me who understood wrongly while studying the use of coalesce?? Or is it my way of writing the scala query that made coalesce not working effectively..?
I've also found that using pandas are okay to save the df into one file but my df is very very large and also I want a fast result so I don't think I'm supposed to use pandas for big files like my df (correct me if i'm worng, please.)
Also I don't quite understand how people explain 'partition' and 'coalesce' are different cause 'coalesce' is minizing the movement of the files, can somebody explain it to me in easier way please?
To resume: Why is my use of coalesce / partition to save a paquet file into only one partition not working? (Is saving the file on one partition not possible at all? I just realized, maybe using coalesce/partition is just to save 1 parquet file in EACH partition AND NOT at ONE partition as I want)

Related

How do I use awswrangler to read only the first few N rows of a parquet file stored in S3?

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!
You can do that with awswrangler using S3 Select. For example:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

pandas to_parquet: cleaning up an existing directory before writing

I would like to understand how to write afresh into an existing parquet store.
I am currently writing a pandas dataframe to a parquet directory as follows:
df = pandas.DataFrame({...})
df.to_parquet('/datastore/data1/', engine='pyarrow', partition=['date'])
However, if I read this data back, add a few columns and write it back, it gets written into a new file into the same sub-directories (i.e. /datastore/data1/date1/).
How can I delete the original data before writing into it? (or should I just delete the whole directory structure prior to writing?). I would like to think there is a simpler way of doing this, rather than remembering to call a remove before every to_parquet.
Any suggestions would be helpful. Thanks!

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.