parquet file codec conversion - hive

I have a parquet file which has compression codec BROTLI. BROTLI is not supported by trino
Therefore, I need to convert it to a supported codec which is GZIP, SNAPPY,.. Conversion doesn't seem straight forward or at least i could not find any python library which does it. Please share your ideas or strategies for this codec conversion.

You should be able to do this with pyarrow. It can brotli-compressed Parquet files.
import pyarrow.parquet as pq
table = pq.read_table(<filename>)
pq.write_table(table, <filename)
This will save it as a snappy-compressed file by default. You can specify different compression schemes using the compression keyword argument.

Related

Control the compression level when writing Parquet files using Polars in Rust

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);
This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.
If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)

Writing Spark RDD as Gzipped file in Amazon s3

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions.
The below function correctly saves the output rdd in s3 but not in gzipped format.
output_rdd.saveAsTextFile("s3://<name-of-bucket>/")
The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)
output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)
Please guide me with the correct way to do this.
You need to specify the output format as well.
Try this:
output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
You can use any of the Hadoop-supported compression codecs:
gzip:
org.apache.hadoop.io.compress.GzipCodec
bzip2:
org.apache.hadoop.io.compress.BZip2Codec
LZO:
com.hadoop.compression.lzo.LzopCodec

MIME-type for tgz

I found several defintions for a gzipped tarbar (.tgz or .tar.gz), which is the correct one, is there one?
application/x-gtar (Wikipedia, Some bugtracker)
application/x-tar-gz (Forum, Python)
Didn't found a match in the official list.
Try following RegExp for all MIME type compressed files
application/(zip|gzip|tgz|tar|tar+gzip|x-tgz|x-tar|x-gtar|x-tar-gz|x-gzip|x-tar+gzip|x-xz-compressed-tar)

Does libhdfs c/c++ api support read/write compressed file

I have found somebody talks libhdfs does not support read/write gzip file at about 2010.
I download the newest hadoop-2.0.4 and read hdfs.h. There is also no compressing arguments.
Now I am wondering if it supports reading compressed file now?
If it not, how can I make a patch for the libhdfs and make it work?
Thanks in advance.
Best Regards
Haiti
As I have known, libhdfs only uses JNI to access the HDFS. If you are familiar with HDFS Java API, libhdfs is just a wrapper of org.apache.hadoop.fs.FSDataInputStream. So it can not read compressed files directly now.
I guess that you want to access the file in the HDFS by C/C++. If so, you can use libhdfs to read the raw file, and use the zip/unzip C/C++ library to decompress the content. The compressed file format is the same. For example, if the files are compressed by lzo, then you can use lzo library to decompress them.
But if the file is a sequence file, then you may need to use the JNI to access them as they are Hadoop special file. I have seen Impala does the similar work before. But it's not out-of-the-box.
Thanks for the reply. Use libhdfs to read raw file, then use zlib to inflate the content. This can work. The file used gzip. I used the codes like these.
z_stream gzip_stream;
gzip_stream.zalloc = (alloc_func)0;
gzip_stream.zfree = (free_func)0;
gzip_stream.opaque = (voidpf)0;
gzip_stream.next_in = buf;
gzip_stream.avail_in = readlen;
gzip_stream.next_out = buf1;
gzip_stream.avail_out = 4096 * 4096;
ret = inflateInit2(&gzip_stream, 16 + MAX_WBITS);
if (ret != Z_OK) {
printf("deflate init error\n");
}
ret = inflate(&gzip_stream, Z_NO_FLUSH);
ret = inflateEnd(&gzip_stream);
printf("the buf \n%s\n", buf1);
return buf;