Naming a Parquet File in Glue JOB - amazon-s3

How to assign a predefined name to a parquet files in a AWS glue job ?
For example after my job runs a parquet file gets stored in the specific folder with a name like:
part-00000-fc95461f-00da-437a-9396-93c7ea473720.sn​appy.parquet,
part-00000-tc95431f-00ds-437b-9396-93c7ea473720.sn​appy.parquet
I want the file to be stored in Predefined or a structured format like :
part-00000-12Jan2018.sn​appy.parquet,
part-00000-13Jan2018.sn​appy.parquet
etc.

Due to the nature of how spark works, we can't name the files to our liking at present.
An alternate approach would be to rename the files as soon as they are written to s3/data lake.
I found these answers to be helpful.

Related

Parquet file with more than one schema

I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this:
table-1,table-2,table-3
0, {data for table-1} {dat for table-2} {data for table-3}
I read the parquet file format and it looks like a single parquet file has a single schema.
Does parquet support more than one schema in a single file?
No, the Parquet format only supports a single schema per file. This schema is written into the footer of the file and accounts for all sections of the file. You could probably reread the CSV file into pandas and save that as a Parquet file, but ultimately you will be better off when you save each table as a separate file. The latter should also be much more performant and space-efficient.

How to dynamically create table in Snowflake getting schema from parquet file which stored in AWS

Could you help me to load a couple of parquet files to Snowflake.
I've got about 250 parquet-files which stored in AWS stage.
250 files = 250 different tables.
I'd like to dynamically load them into Snowflake tables.
So, I need:
Get schema from parquet file... I've read that I could get the schema from parquet file using parquet-tools (Apache).
Create table using schema from the parquet file
Load data from parquet-file to this table.
Could anyone help me how to do that? Does exist the most efficient way to realize it? (by using GUI Snowflake, for example). Can't find it.
Thanks.
If the schema of the files is same you can put them in a single stage and use the Infer-Schema function. This will give you the schema of the parquet files.
https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
In case all files have different schema then I'm afraid you have to infer the schema on each file.

Pyspark write a DataFrame to csv files in S3 with a custom name

I am writing files to an S3 bucket with code such as the following:
df.write.format('csv').option('header','true').mode("append").save("s3://filepath")
This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:
part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv
Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:
part-00019-my-output.csv
You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.
You'd have to use AWS SDK to rename those files.
P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.
df.coalesce(1).write.format('csv')...

Is there any problems with saving parquet as a single file and no directory

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.
If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.
In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.

save a csv file into s3 bucket from pypark dataframe

I would like to save the content of a spark dataframe into a csv file in s3 bucket:
df_country.repartition(1).write.csv('s3n://bucket/test/csv/a',sep=",",header=True,mode='overwrite')
the problem that it creaate a file with a name : part-00000-fc644e84-7579-48.
Is there any way to fix the name of this file. For example test.csv?
Thanks
Best
This is not possible since every partition in the job will create its own file and must follow a strict convention to avoid naming conflicts. The recommended solution is to rename the file after it is created.
Also, if you know you are only writing one file per path.
Ex. s3n://bucket/test/csv/a. Then it doesn't really matter what the name of the file is, simply read in all the contents of that unique directory name.
Sources:
1. Specifying the filename when saving a DataFrame as a CSV
2. Spark dataframe save in single file on hdfs location