Pyspark write a DataFrame to csv files in S3 with a custom name

Pyspark write a DataFrame to csv files in S3 with a custom name - amazon-s3

I am writing files to an S3 bucket with code such as the following:
df.write.format('csv').option('header','true').mode("append").save("s3://filepath")
This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:
part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv
Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:
part-00019-my-output.csv

You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.
You'd have to use AWS SDK to rename those files.
P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.
df.coalesce(1).write.format('csv')...

Related

How to filter s3 path while reading data from s3 using pyspark

I have a s3 folder structure like this:
bucketname/20211127123456/.parquet files
bucketname/20211127456789/.parquet files
bucketname/20211126123455/.parquet files
bucketname/20211126746352/.parquet files
bucketname/20211124123455/.parquet files
bucketname/20211124746352/.parquet files
Basically for each day there are two folders and inside that I have multiple parquet files which I want to read.
Let's say I want to read all files from the folders for 27th and 26th Nov.
Right now I have boto3 function which is giving me a python list that includes all parquet files complete s3 path which has 20211126 and 20211127 in the s3 path and that list I am passing to spark.read. Is there any better way to achieve this?

Yes, you should be partitioning your data based on date. Then your spark queries would only need to include date parameters and only the files related to that date would be read for the query.
Here's an example of how that works with Athena; It will work with Glue and Spark too.

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?

Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.

You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md

This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

creating a single parquet file in s3 pyspark job

I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 .
using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data .
what are ways to create one single file in s3 but without compromising on performance ?

coalesce(1) or repartition(1) will put all your data on 1 partition (with a shuffle step when you use repartition compare to coalesce). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.
That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.
Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :
you acquire your files with python
you concatenate your files as one
you upload that one file on S3.
It works well for CSV, not that well for non-sequential file type.

save a csv file into s3 bucket from pypark dataframe

I would like to save the content of a spark dataframe into a csv file in s3 bucket:
df_country.repartition(1).write.csv('s3n://bucket/test/csv/a',sep=",",header=True,mode='overwrite')
the problem that it creaate a file with a name : part-00000-fc644e84-7579-48.
Is there any way to fix the name of this file. For example test.csv?
Thanks
Best

This is not possible since every partition in the job will create its own file and must follow a strict convention to avoid naming conflicts. The recommended solution is to rename the file after it is created.
Also, if you know you are only writing one file per path.
Ex. s3n://bucket/test/csv/a. Then it doesn't really matter what the name of the file is, simply read in all the contents of that unique directory name.
Sources:
1. Specifying the filename when saving a DataFrame as a CSV
2. Spark dataframe save in single file on hdfs location

Naming a Parquet File in Glue JOB

How to assign a predefined name to a parquet files in a AWS glue job ?
For example after my job runs a parquet file gets stored in the specific folder with a name like:
part-00000-fc95461f-00da-437a-9396-93c7ea473720.snappy.parquet,
part-00000-tc95431f-00ds-437b-9396-93c7ea473720.snappy.parquet
I want the file to be stored in Predefined or a structured format like :
part-00000-12Jan2018.snappy.parquet,
part-00000-13Jan2018.snappy.parquet
etc.

Due to the nature of how spark works, we can't name the files to our liking at present.
An alternate approach would be to rename the files as soon as they are written to s3/data lake.
I found these answers to be helpful.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark write a DataFrame to csv files in S3 with a custom name - amazon-s3

Related

How to filter s3 path while reading data from s3 using pyspark

Copy and Merge files to another S3 bucket

creating a single parquet file in s3 pyspark job

save a csv file into s3 bucket from pypark dataframe

Naming a Parquet File in Glue JOB

Categories

Resources