BZip2 compressed input for Apache Flink - bzip2

I have a wikipedia dump compressed with bzip2 (downloaded from http://dumps.wikimedia.org/enwiki/), but I don't want to unpack it: I want to process it while decompressing on the fly.
I know that it's possible to do it in plain Java (see e.g. Java - Read BZ2 file and uncompress/parse on the fly), but I was wondering how do it in Apache Flink? What I probably need is something like https://github.com/whym/wikihadoop but for Flink, not Hadoop.

It is possible to read compressed files in the following formats in Apache Flink:
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.DeflateCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.Lz4Codec
org.apache.hadoop.io.compress.SnappyCodec
As you can see from the package names, Flink does this using Hadoop's InputFormats.
This is an example for reading gz files using Flink's Scala API:
(You need at least Flink 0.8.1)
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val job = new JobConf()
val hadoopInput = new TextInputFormat()
FileInputFormat.addInputPath(job, new Path("/home/robert/Downloads/cawiki-20140407-all-titles.gz"))
val lines = env.createHadoopInput(hadoopInput, classOf[LongWritable], classOf[Text], job)
lines.print
env.execute("Read gz files")
}
Apache Flink has only build-in support for .deflate files. Adding support for more compression codecs is easy to do, but hasn't been done yet.
Using HadoopInputFormats with Flink doesn't cause any performance loss. Flink has build-in serialization support for Hadoop's Writable types.

Related

Control the compression level when writing Parquet files using Polars in Rust

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);
This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.
If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

ORC format - PIG - dependent libraries

I understand to write into ORC format + snappy compression (pig script),
using OrcStorage('-c SNAPPY')
I need your help, what is the SET command or necessary library I need to include to enable storing result dataset into ORC format?
Please help.
Subra
Check what pig version are you using.
ORC storage is available from pig14 as a build in function.
Check the examples:
https://pig.apache.org/docs/r0.14.0/func.html#OrcStorage
UPDATE
This pig just works fine:
data = LOAD 'SO/date.txt' USING PigStorage(' ') AS (ts:chararray);
STORE data INTO 'orc/snappy' using OrcStorage('-c SNAPPY');
data_orc = LOAD 'orc/snappy' using OrcStorage('-c SNAPPY');
DUMP data_orc;
You don't even need to register the kryo jar, because that not used directly from the pig so it will be optimized out, but you use it via reflection so you have to add the kryo jar to the classpath, like this:
pig -latest -useHCatalog -cp ./kryo-2.24.0.jar orc.pig

Apache Spark to S3 upload performance Issue

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...
Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.
I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.
Has anyone encountered this issue ?
Update 1
How can I increase the number of threads that does this move process ?
Any suggestion is highly appreciated...
Thanks
This was fixed with SPARK-3595 (https://issues.apache.org/jira/browse/SPARK-3595). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark).
I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.
ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
",");
counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
TextOutputFormat.class);
Make sure that you create more output files . more number of smaller files will make upload faster.
API details
saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit
Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.

Celery result encoding in Redis

I am calling tasks in celery with a RabbitMQ broker on an Ubuntu box, but just getting set up using Redis as the result backend. I can find task results, but they look like ""\x80\x02}q\x01(U\x06statusq\x02U\aSUCCESSq\x03U\ttracebackq\x04NU\x06resultq\x05}q\x06(X\x06\x00\x00\x00result}q\a(X\x06\x00\x00\x00statusK\x01X\r\x00\x00\x00total_resultsM\xf4\x01X\a\x00\x00\x00matches]q\b(}q\t(X\a\x00\x00\x00players]q\n(}q\x0b(X\a\x00\x00\x00hero_idK\x15X\n\x00\x00\x00account_idI4294967295\nX\x0b\x00\x00\x00player_slotK\x00u}q\x0c(X\a\x00\x00\x00hero_idK\x0cX\n\x00\x00\x00account_idI4294967295\nX\x0b\x00\x00\x00player_slotK\x01u}q\r(X\a\x00\x00\x00hero_idK\x1bX\n\x00\x00\x00account_i...."
My default celery encoding is ASCII, and Redis does not appear to have an encoding specified in its base conf.
utils.encoding.default_encoding()
'ascii'
How should I go about turning this text into something meaningful? I cannot tell how this is encoded on sight; any suggested decodings to try?
The result is pickled by default into a utf-8 string (see task serializers). You can inspect the payload manually with:
import pickle
s = "\x80\x02}q..."
obj = pickle.loads(s)
print obj
pickle is generally fine unless you are operating in a polyglot environment, and then JSON or msgpack are fine solutions.

Does libhdfs c/c++ api support read/write compressed file

I have found somebody talks libhdfs does not support read/write gzip file at about 2010.
I download the newest hadoop-2.0.4 and read hdfs.h. There is also no compressing arguments.
Now I am wondering if it supports reading compressed file now?
If it not, how can I make a patch for the libhdfs and make it work?
Thanks in advance.
Best Regards
Haiti
As I have known, libhdfs only uses JNI to access the HDFS. If you are familiar with HDFS Java API, libhdfs is just a wrapper of org.apache.hadoop.fs.FSDataInputStream. So it can not read compressed files directly now.
I guess that you want to access the file in the HDFS by C/C++. If so, you can use libhdfs to read the raw file, and use the zip/unzip C/C++ library to decompress the content. The compressed file format is the same. For example, if the files are compressed by lzo, then you can use lzo library to decompress them.
But if the file is a sequence file, then you may need to use the JNI to access them as they are Hadoop special file. I have seen Impala does the similar work before. But it's not out-of-the-box.
Thanks for the reply. Use libhdfs to read raw file, then use zlib to inflate the content. This can work. The file used gzip. I used the codes like these.
z_stream gzip_stream;
gzip_stream.zalloc = (alloc_func)0;
gzip_stream.zfree = (free_func)0;
gzip_stream.opaque = (voidpf)0;
gzip_stream.next_in = buf;
gzip_stream.avail_in = readlen;
gzip_stream.next_out = buf1;
gzip_stream.avail_out = 4096 * 4096;
ret = inflateInit2(&gzip_stream, 16 + MAX_WBITS);
if (ret != Z_OK) {
printf("deflate init error\n");
}
ret = inflate(&gzip_stream, Z_NO_FLUSH);
ret = inflateEnd(&gzip_stream);
printf("the buf \n%s\n", buf1);
return buf;