I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this:
table-1,table-2,table-3
0, {data for table-1} {dat for table-2} {data for table-3}
I read the parquet file format and it looks like a single parquet file has a single schema.
Does parquet support more than one schema in a single file?
No, the Parquet format only supports a single schema per file. This schema is written into the footer of the file and accounts for all sections of the file. You could probably reread the CSV file into pandas and save that as a Parquet file, but ultimately you will be better off when you save each table as a separate file. The latter should also be much more performant and space-efficient.
Related
Could you help me to load a couple of parquet files to Snowflake.
I've got about 250 parquet-files which stored in AWS stage.
250 files = 250 different tables.
I'd like to dynamically load them into Snowflake tables.
So, I need:
Get schema from parquet file... I've read that I could get the schema from parquet file using parquet-tools (Apache).
Create table using schema from the parquet file
Load data from parquet-file to this table.
Could anyone help me how to do that? Does exist the most efficient way to realize it? (by using GUI Snowflake, for example). Can't find it.
Thanks.
If the schema of the files is same you can put them in a single stage and use the Infer-Schema function. This will give you the schema of the parquet files.
https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html
In case all files have different schema then I'm afraid you have to infer the schema on each file.
I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.
If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.
In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.
I would like to save the content of a spark dataframe into a csv file in s3 bucket:
df_country.repartition(1).write.csv('s3n://bucket/test/csv/a',sep=",",header=True,mode='overwrite')
the problem that it creaate a file with a name : part-00000-fc644e84-7579-48.
Is there any way to fix the name of this file. For example test.csv?
Thanks
Best
This is not possible since every partition in the job will create its own file and must follow a strict convention to avoid naming conflicts. The recommended solution is to rename the file after it is created.
Also, if you know you are only writing one file per path.
Ex. s3n://bucket/test/csv/a. Then it doesn't really matter what the name of the file is, simply read in all the contents of that unique directory name.
Sources:
1. Specifying the filename when saving a DataFrame as a CSV
2. Spark dataframe save in single file on hdfs location
How to assign a predefined name to a parquet files in a AWS glue job ?
For example after my job runs a parquet file gets stored in the specific folder with a name like:
part-00000-fc95461f-00da-437a-9396-93c7ea473720.snappy.parquet,
part-00000-tc95431f-00ds-437b-9396-93c7ea473720.snappy.parquet
I want the file to be stored in Predefined or a structured format like :
part-00000-12Jan2018.snappy.parquet,
part-00000-13Jan2018.snappy.parquet
etc.
Due to the nature of how spark works, we can't name the files to our liking at present.
An alternate approach would be to rename the files as soon as they are written to s3/data lake.
I found these answers to be helpful.
I have a small file (2MB). I created a external hive table over this file (stored as textfile). I created another table (stored as ORC) and copied the data from the previous table. When I checked the size of data in ORC table, it was more than 2MB.
ORC is a compressed file format, so shouldn't the data size be less?
As of Hive 0.14, users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
It's because your source file is too small. ORC has complex structure with internal indexes, headers, footers, postscript, compressing codecs also add some structures, etc, etc.
See this for details: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
All these supporting structures consume more space than the data. For such small file you really do not need to store min/max values for columns, do not need blum filters, etc since your file may fit in memory. The best storage for this case is text file uncompressed. You can also try just to gzip your source file and check it's size. Too small gzipped file may be bigger than uncompressed. The bigger the file the more benefit from compressing and using orc will be.