I have a dask dataframe that I would like to save to s3. Each row in the dataframe as a "timestamp" column. I would like to partition the paths in s3 based on the dates in that timestamp column, so the output in s3 looks like this:
s3://....BUCKET_NAME/data/date=2019-01-01/part1.json.gz
s3://....BUCKET_NAME/data/date=2019-01-01/part2.json.gz
...
...
s3://....BUCKET_NAME/data/date=2019-05-01/part1.json.gz
Is this possible in dask? I can only find the name_function in the output that expects an integer as an input, and setting the column as an index doesnt add the index as part of the output filenames.
It's actually easy to achieve, as long as you are happy to save it as parquet, using partition_on. You should rename your folder from data to data.parquet if you want to read with dask.
df.to_parquet("s3://BUCKET_NAME/data.parquet/", partition_on=["timestamp"])
Not sure if it's the only or optimal way but you should be able to do it with groupby-apply, as in:
df.groupby('timestamp').apply(write_partition)
where write_partition is a function that takes a Pandas dataframe for a single timestamp and writes it to S3. Make sure you check the docs of apply as there are some gotchas (providing meta, full shuffling if the groupby column is not in the index, function called once per partition-group pair instead of once per group).
Related
I want to save some Polars dataframes into one file, and next I want to request data from file with filter by timestamp (datetime) column. I don't need to take all the file in memory, but only filtered part.
I see, Polars API list has Feather/IPC and Parquet files that can do it in a theory, but I don't know how to read this files in Polars with a filter by data.
Before for Pandas I used hdf5 format and it was very clear, but I have not expirience with that new formats for me. Maybe you can help me how to make it most effective.
I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.
Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html
How can I make sure that am able to retain the latest version of a row (based on unique constraints) with Dask using Parquet files and partition_on?
The most basic use case is that I want to query a database for all rows where updated_at > yesterday and partition the data based on the created_at_date (meaning that there can be multiple dates which have been updated, and these files already exist most likely).
└───year=2019
└───month=2019-01
2019-01-01.parquet
2019-01-02.parquet
So I want to be able to combine my new results from the latest query and the old results on disk, and then retain the latest version of each row (id column).
I currently have Airflow operators handling the following logic with Pandas and it achieves my goal. I was hoping to accomplish the same thing with Dask without so much custom code though:
Partition data based on specified columns and save files for each partition (common example would be using the date or month column to create files 2019-01-01.parquet or 2019-12.parquet
Example:
df_dict = {k: v for k, v in df.groupby(partition_columns)}
Loop through each partition and check if the file name exists. If there is already a file with the same name, read that file as a separate dataframe and concat the two dataframes
Example:
part = df_dict[partition]
part= pd.concat([part, existing], sort=False, ignore_index=True, axis='index')
Sort the dataframes and drop duplicates based on a list of specified columns (unique constraints sorted by file_modified_timestamp or updated_at columns typically to retain the latest version of each row)
Example:
part = part.sort_values([sort_columns], ascending=True).drop_duplicates(unique_constraints, keep='last')
The end result is that my partitioned file (2019-01-01.parquet) has now been updated with the latest values.
I can't think of a way to use the existing parquet methods of a dataframe to do what you are after, but assuming your dask dataframe is reasonably partitioned, you could do the exact same set of steps within a map_partitions call. This means you pass the constituent pandas dataframes to the function, which acts on them. So long as the data in each partition is non-overlapping, you will do ok.
I have a pandas dataframe containing 100 million tweets.
I have extracted URL's from the data and have currently stored it as a list in pandas column:
Dataframe
I want to run analysis on these URL's (like sorting by domain name,finding out what type of user posted which domains).
Is it possible to store like this:
Custom
where the URL column is pandas series with dynamic size so i can easily process? Otherwise what would be the best way to store the urls for efficiency while applying pandas operations and speed?
yes if you concat strings with \n like 'url1\nurl2\nurl3'
if you have list of url, you can use join:
listurl = ['url1','url2','url3']
print('\n'.join(listurl))