save a csv file into s3 bucket from pypark dataframe - dataframe

I would like to save the content of a spark dataframe into a csv file in s3 bucket:
df_country.repartition(1).write.csv('s3n://bucket/test/csv/a',sep=",",header=True,mode='overwrite')
the problem that it creaate a file with a name : part-00000-fc644e84-7579-48.
Is there any way to fix the name of this file. For example test.csv?
Thanks
Best

This is not possible since every partition in the job will create its own file and must follow a strict convention to avoid naming conflicts. The recommended solution is to rename the file after it is created.
Also, if you know you are only writing one file per path.
Ex. s3n://bucket/test/csv/a. Then it doesn't really matter what the name of the file is, simply read in all the contents of that unique directory name.
Sources:
1. Specifying the filename when saving a DataFrame as a CSV
2. Spark dataframe save in single file on hdfs location

Related

Parquet file with more than one schema

I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this:
table-1,table-2,table-3
0, {data for table-1} {dat for table-2} {data for table-3}
I read the parquet file format and it looks like a single parquet file has a single schema.
Does parquet support more than one schema in a single file?
No, the Parquet format only supports a single schema per file. This schema is written into the footer of the file and accounts for all sections of the file. You could probably reread the CSV file into pandas and save that as a Parquet file, but ultimately you will be better off when you save each table as a separate file. The latter should also be much more performant and space-efficient.

Pyspark write a DataFrame to csv files in S3 with a custom name

I am writing files to an S3 bucket with code such as the following:
df.write.format('csv').option('header','true').mode("append").save("s3://filepath")
This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:
part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv
Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:
part-00019-my-output.csv
You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.
You'd have to use AWS SDK to rename those files.
P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.
df.coalesce(1).write.format('csv')...

Is there any problems with saving parquet as a single file and no directory

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.
If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.
In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.

Naming a Parquet File in Glue JOB

How to assign a predefined name to a parquet files in a AWS glue job ?
For example after my job runs a parquet file gets stored in the specific folder with a name like:
part-00000-fc95461f-00da-437a-9396-93c7ea473720.sn​appy.parquet,
part-00000-tc95431f-00ds-437b-9396-93c7ea473720.sn​appy.parquet
I want the file to be stored in Predefined or a structured format like :
part-00000-12Jan2018.sn​appy.parquet,
part-00000-13Jan2018.sn​appy.parquet
etc.
Due to the nature of how spark works, we can't name the files to our liking at present.
An alternate approach would be to rename the files as soon as they are written to s3/data lake.
I found these answers to be helpful.

Cannot load backup data from GCS to BigQuery

My backup table has 3 files: 2 ending with .backup_info and one folder with another folder containing 10 CSV files. What would be format of the URL which will specify the backup file location?
I'm trying below and every time I get a file not found error.
gs://bucket_name/name_of_the_file_which_ended_with_backup_info.info
When you go to look at your file from your backup, it should have a structure like this:
Buckets/app-id-999999999-backups
And the filenames should look like:
2017-08-20T02:05:19Z_app-id-999999999_data.json.gz
Therefore the path will be:
gs://app-id-9999999999-backups/2017-08-20T02:05:19Z_app-id-9999999999_data.json.gz
Make sure you do not include the word "Buckets", I am guess that is the confusion.