Control the compression level when writing Parquet files using Polars in Rust - dataframe

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);

This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.
If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

Related

parquet file codec conversion

I have a parquet file which has compression codec BROTLI. BROTLI is not supported by trino
Therefore, I need to convert it to a supported codec which is GZIP, SNAPPY,.. Conversion doesn't seem straight forward or at least i could not find any python library which does it. Please share your ideas or strategies for this codec conversion.
You should be able to do this with pyarrow. It can brotli-compressed Parquet files.
import pyarrow.parquet as pq
table = pq.read_table(<filename>)
pq.write_table(table, <filename)
This will save it as a snappy-compressed file by default. You can specify different compression schemes using the compression keyword argument.

pypi sas7bdat to_data_frame taking too long for large data(5 GB)

I have a 5GB SAS file and the requirment is to create parquet file in Hadoop. I am using SAS7BDAT library and using following approach which is taking more then 5 hours in creating pandas dataframe when running pyspark on client mode. Curious to know if there is any better way of doing the same.
I know there is saurfang package available which is more efficient in this case, but we do not want to use any 3rd part software.
f = sas7bdat.SAS7BDAT(str(source_file))
pandas_df = f.to_data_frame()
spark_df = spark.createDataFrame(pandas_df)
del pandas_df
spark_df.write.save(dest_file,format='parquet', mode='Overwrite')
Please use Spark to read the file, not Pandas
https://github.com/saurfang/spark-sas7bdat/blob/master/README.md#python-api
Add this to your packages
saurfang:spark-sas7bdat:2.1.0-s_2.11
Note, I've not personally used this, I only search for "SAS 7B DAT + Spark". If you have issues, please report here
https://github.com/saurfang/spark-sas7bdat/issues/

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)

Compressing StringIO data to read with pandas?

I have been using pandas pd.read_sql_query to read a decent chunk of data into memory each day in order to process it (add columns, calculations, etc to about 1GB of data). This has cause my computer to freeze a few times though so today I tried using psql to create a .csv file. I then zipped that file (.xz) and read it with pandas.
Overall, it was a lot smoother and it made me think about automating the process. Is it possible to replace saving a .csv.xz file and instead copying the data directly to memory while still compressing it (ideally)?
buf = StringIO()
from_curs = from_conn.cursor()
from_curs.copy_expert("COPY table where row_date = '2016-10-17' TO STDOUT WITH CSV HEADER", buf)
(is it possible to compress this?)
buf.seek(0)
(read the buf with pandas to process it)

BZip2 compressed input for Apache Flink

I have a wikipedia dump compressed with bzip2 (downloaded from http://dumps.wikimedia.org/enwiki/), but I don't want to unpack it: I want to process it while decompressing on the fly.
I know that it's possible to do it in plain Java (see e.g. Java - Read BZ2 file and uncompress/parse on the fly), but I was wondering how do it in Apache Flink? What I probably need is something like https://github.com/whym/wikihadoop but for Flink, not Hadoop.
It is possible to read compressed files in the following formats in Apache Flink:
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.DeflateCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.Lz4Codec
org.apache.hadoop.io.compress.SnappyCodec
As you can see from the package names, Flink does this using Hadoop's InputFormats.
This is an example for reading gz files using Flink's Scala API:
(You need at least Flink 0.8.1)
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val job = new JobConf()
val hadoopInput = new TextInputFormat()
FileInputFormat.addInputPath(job, new Path("/home/robert/Downloads/cawiki-20140407-all-titles.gz"))
val lines = env.createHadoopInput(hadoopInput, classOf[LongWritable], classOf[Text], job)
lines.print
env.execute("Read gz files")
}
Apache Flink has only build-in support for .deflate files. Adding support for more compression codecs is easy to do, but hasn't been done yet.
Using HadoopInputFormats with Flink doesn't cause any performance loss. Flink has build-in serialization support for Hadoop's Writable types.