pandas to_parquet: cleaning up an existing directory before writing - pandas

I would like to understand how to write afresh into an existing parquet store.
I am currently writing a pandas dataframe to a parquet directory as follows:
df = pandas.DataFrame({...})
df.to_parquet('/datastore/data1/', engine='pyarrow', partition=['date'])
However, if I read this data back, add a few columns and write it back, it gets written into a new file into the same sub-directories (i.e. /datastore/data1/date1/).
How can I delete the original data before writing into it? (or should I just delete the whole directory structure prior to writing?). I would like to think there is a simpler way of doing this, rather than remembering to call a remove before every to_parquet.
Any suggestions would be helpful. Thanks!

Related

How append data to Parquet file with save dataframe from Polars

I have Polars df and I want to save it into Parquet file. And next df too, and next.
Code df.write_parquet("path.parquet") is only rewriting it. How I can do it in Polars?
Polars does not support appending to Parquet files, and most tools do not, see for example this SO post.
Your best bet would be to cast the dataframe to an Arrow table using .to_arrow(), and use pyarrow.dataset.write_dataset. In particular, see the comment on the parameter existing_data_behavior. Still, that requires organizing your data in partitions, which effectively means you have a separate parquet file per partition, stored in the same directory. So each df you have, becomes its own parquet file, and you abstract away from that on the read. Polars does not support writing partitions as far as I'm aware. There is support for reading though, see the source argument in pl.read_parquet.

Saving single parquet file on spark at only one partition while using multi partitions

I'm trying to save my dataframe parquet file as a single file instead of multi parts files (?? I don't know the exact term for this, pardon me. I'm stupid in English)
here is how I made my df
val df = spark.read.format("jdbc").option("url","jdbc:postgresql://ipaddr:port/address").option("header","true").load()
and I have 2 ip addresses where I run different master / work servers.
for ex,
ip : 10.10.10.1 runs 1 master server
ip : 10.10.10.2 runs 2 work servers (and these work servers work for ip1)
and I'm trying to save my df as parquet file only in master server (ip1)
normally I would save the file with (I use spark on master server using scala)
df.write.parquet("originals.parquet")
then the df gets saved as parquet in both servers (obviously since it's spark)
but from now I'm trying to save the df as one single file, while keeping the multi process for better speed yet saving the file only on one side of the server
so I've tried using
df.coalesce(1).write.parquet("test.parquet")
and
df.coalesce(1).write.format("parquet").mode("append").save("testestest.parquet")
but the result for both are same as the original write.parquet resulting the df being saved in both servers.
I guess it's because of my lack of understanding in using spark and also the function of df, coalesce..
I was told that using coalesce or partition will help me saving the file only on one server, but I want to know why is it still being saved in both servers. Is it how it's supposed to be and is it me who understood wrongly while studying the use of coalesce?? Or is it my way of writing the scala query that made coalesce not working effectively..?
I've also found that using pandas are okay to save the df into one file but my df is very very large and also I want a fast result so I don't think I'm supposed to use pandas for big files like my df (correct me if i'm worng, please.)
Also I don't quite understand how people explain 'partition' and 'coalesce' are different cause 'coalesce' is minizing the movement of the files, can somebody explain it to me in easier way please?
To resume: Why is my use of coalesce / partition to save a paquet file into only one partition not working? (Is saving the file on one partition not possible at all? I just realized, maybe using coalesce/partition is just to save 1 parquet file in EACH partition AND NOT at ONE partition as I want)

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several files base on the year and month combination (i.e. Data_YYYYMM.parquet.snappy) that will all be in a directory.
My current process reads in the daily csv that I need to append, reads in the historical parquet file with pyarrow and converts to pandas, concats the new and historical data in pandas (pd.concat([df_daily_csv, df_historical_parquet])) and then writes back to a single parquet file. Every few weeks the schema of the data can change (i.e. a new column). With my current method this is not an issue since the concat in pandas can handle the different schemas and I overwriting it each time.
By switching to this new setup I am worried about having inconsistent schemas between months and then being unable to read in data over multiple months. I have tried this already and gotten errors due to non matching schemas. I thought might be able to specify this with the schema parameter in pyarrow.parquet.Dataset. From the doc it looks like it takes a type of pyarrow.parquet.Schema. When I try using this I get AttributeError: module 'pyarrow.parquet' has no attribute 'Schema'. I also tried taking the schema of a pyarrow Table (table.schema) and passing that to the schema parameter but got an error msg (sry I forget error right now and cant connect workstation right now so cant reproduce error - I will update with this info when I can).
I've seen some mention of schema normalization in the context of the broader Arrow/Datasets project but I'm not sure if my use case fits what that covers and also the Datasets feature is experimental so I dont want to use it in production.
I feel like this is a pretty common use case and I wonder if I am missing something or if parquet isn't meant for schema changes over time like I'm experiencing. I've considered investigating the schema of the new file and comparing vs historical and then if there is change deserializing, updating schema, and reserializing every file in the dataset but I'm really hoping to avoid that.
So my questions are:
Will using a pyarrow parquet Dataset (or something else in the pyarrow API) allow me to read in all of the data in multiple parquet files even if the schema is different? To be specific, my expectation is that the new column would be appended and the values prior to when this column were available would be null). If so, how do you do this?
If the answer to 1 is no, is there another method or library for handling this?
Some resources I've been going through.
https://arrow.apache.org/docs/python/dataset.html
https://issues.apache.org/jira/browse/ARROW-2659
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.

ETL file loading: files created today, or files not already loaded?

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.