Does libhdfs c/c++ api support read/write compressed file - api

I have found somebody talks libhdfs does not support read/write gzip file at about 2010.
I download the newest hadoop-2.0.4 and read hdfs.h. There is also no compressing arguments.
Now I am wondering if it supports reading compressed file now?
If it not, how can I make a patch for the libhdfs and make it work?
Thanks in advance.
Best Regards
Haiti

As I have known, libhdfs only uses JNI to access the HDFS. If you are familiar with HDFS Java API, libhdfs is just a wrapper of org.apache.hadoop.fs.FSDataInputStream. So it can not read compressed files directly now.
I guess that you want to access the file in the HDFS by C/C++. If so, you can use libhdfs to read the raw file, and use the zip/unzip C/C++ library to decompress the content. The compressed file format is the same. For example, if the files are compressed by lzo, then you can use lzo library to decompress them.
But if the file is a sequence file, then you may need to use the JNI to access them as they are Hadoop special file. I have seen Impala does the similar work before. But it's not out-of-the-box.

Thanks for the reply. Use libhdfs to read raw file, then use zlib to inflate the content. This can work. The file used gzip. I used the codes like these.
z_stream gzip_stream;
gzip_stream.zalloc = (alloc_func)0;
gzip_stream.zfree = (free_func)0;
gzip_stream.opaque = (voidpf)0;
gzip_stream.next_in = buf;
gzip_stream.avail_in = readlen;
gzip_stream.next_out = buf1;
gzip_stream.avail_out = 4096 * 4096;
ret = inflateInit2(&gzip_stream, 16 + MAX_WBITS);
if (ret != Z_OK) {
printf("deflate init error\n");
}
ret = inflate(&gzip_stream, Z_NO_FLUSH);
ret = inflateEnd(&gzip_stream);
printf("the buf \n%s\n", buf1);
return buf;

Related

Control the compression level when writing Parquet files using Polars in Rust

I found that by default polars' output Parquet files are around 35% larger than Parquet files output by Spark (on the same data). Spark uses snappy for compression by default and it doesn't help if I switch ParquetCompression to snappy in polars. I wonder is this due to polars use a more conservative compression ratio? Is there any way to control the compression level of Parquet files in polars? I checked the doc of polars, it seems that only Zstd accept a ZstdLevel (not even sure whether it is compression level).
Below is my code to write a DataFrame to a Parquet file using the snappy compression.
let f = File::create("j.parquet").expect("Unable to create the file j.parquet!");
let mut bfw = BufWriter::new(f);
let pw = ParquetWriter::new(bfw).with_compression(ParquetCompression::Snappy);
pw.finish(&mut df);
This is not (yet) possible in rust polars. It will likely be in next release of arrow2 and then we can implement it in polars as well.
If you want that functionality in python polars you can leverage pyarrow for this purpose. polars has zero copy interop with pyarrow.

Why does extracting an archive in Flutter show files not in the archive that are prefixed with _.?

I have a tar + gzipped file I download and decompress/extract in a Flutter app. The extraction code looks like this:
final gzDecoder = GZipDecoder();
final tar = await gzDecoder.decodeBytes(file.readAsBytesSync());
final tarDecoder = TarDecoder();
final archive = tarDecoder.decodeBytes(tar);
for (final file in archive) {
print(file)
...
When I print out all the files in the archive like above, I see things like:
./question_7815.mp3
./._question_7814.mp3
where the original archive only has ./question_7815.mp3 (not the file prefixed with a ._.
Furthermore, when printing the file size (print(file.size)) I see that the files prefixed with ._ are not the same size, so they do in fact appear to be different files, and they are much smaller.
Anyone know why this happens and potentially how to prevent it?
That's the Apple Double format, so that tar file is almost certainly originally coming from a Mac. The underscore file contains extended attribute information. You don't necessarily need to prevent it. You can just ignore those files, or exclude them during extraction. It is possible to not include them when tarring on the Mac side as well with the --no-mac-metadata option to tar.

How does numpy handle mmap's over npz files?

I have a case where I would like to open a compressed numpy file using mmap mode, but can't seem to find any documentation about how it will work under the covers. For example, will it decompress the archive in memory and then mmap it? Will it decompress on the fly?
The documentation is absent for that configuration.
The short answer, based on looking at the code, is that archiving and compression, whether using np.savez or gzip, is not compatible with accessing files in mmap_mode. It's not just a matter of how it is done, but whether it can be done at all.
Relevant bits in the np.load function
elif isinstance(file, gzip.GzipFile):
fid = seek_gzip_factory(file)
...
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp)
...
if mmap_mode:
return format.open_memmap(file, mode=mmap_mode)
Look at np.lib.npyio.NpzFile. An npz file is a ZIP archive of .npy files. It loads a dictionary(like) object, and only loads the individual variables (arrays) when you access them (e.g. obj[key]). There's no provision in its code for opening those individual files inmmap_mode`.
It's pretty obvious that a file created with np.savez cannot be accessed as mmap. The ZIP archiving and compression is not the same as the gzip compression addressed earlier in the np.load.
But what of a single array saved with np.save and then gzipped? Note that format.open_memmap is called with file, not fid (which might be a gzip file).
More details on open_memmap in np.lib.npyio.format. Its first test is that file must be a string, not an existing file fid. It ends up delegating the work to np.memmap. I don't see any provision in that function for gzip.

How can I allow more file extensions with drupal file uploads?

I've got a module that has to let users upload files and everything works as long as the files are in the standard array of allowed extensions. I've tried using file_validate_extensions, but this doesn't seem to change anything.
This is the code I'm using to upload now (the docx extension is added to the standard drupal allowed ones, but it doesn't seem to get picked up):
$fid = $form_state['values']['attachment'];
$file = file_load($fid);
if($file != null){
file_validate_extensions($file, "jpg jpeg gif png txt doc xls pdf ppt pps odt ods odp docx");
$file->status = FILE_STATUS_PERMANENT;
file_save($file);
}
I just looked to this Drupal API, and it seems that you can use the function "file_save_upload" (with $validator as an array of valid extension), this get the file in a temporary state. And then, you have to call "file_save" to make it permanent.

Convert pdf from version 1.1 to 1.4 (or higher)

How can I convert pdf files from version 1.1 to 1.4 (or higher)?
Actually I need some sort of command line tool for batch converting or some API to be able to convert dynamically severall documents.
Use Ghostscript tool.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o output.pdf input.pdf
Pdf 1.1 is forward compatible with pdf 1.4. Everything in pdf 1.1 will work with pdf 1.4 - it's guaranteed by the spec. Let's assume that you've got some justifiable reason why this is not good enough for you (let's assume, for example, that you have a non-spec compliant tool that consumes PDF and explodes on any file version less that 1.4).
We can focus on the main syntactic differences between versions.
All PDF files have a header somewhere in the first 1024 bytes. In most cases, it's the very first line, but that's not guaranteed (I'm looking at you GhostScript!). The header looks like this in PDF 1.1:
%PDF-1.1
in PDF 1.4, it looks like this:
%PDF-1.4
So in theory, all you need is a tool that will look in the first 1024 bytes for a file for "%PDF-1.1" and change it to "%PDF-1.4". You could use sed, perl, etc to do something like that for you. You could write it in C and you would be tempted to do something like this:
#define PDFHEADERSIZE 1024
bool ChangeFileToNewPdfVersion(char *file)
{
char *replacePoint = NULL;
FILE *fp = fopen(file, "rw");
char buf[PDFHEADERSIZE + 1];
buf[PDFHEADERSIZE] = '\0';
if (fread(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fseek(fp, 0, SEEK_SET);
if ((replacePoint = strstr(buf, "%PDF-1.1")) == NULL) { fclose(fp); return false; }
replacePoint[7] = '4';
if (fwrite(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fflush(fp);
fclose(fp);
return;
}
which will work in most sane cases. It will not work if the file starts, for example, with 0 bytes, which would serve as null terminators in the block of data.
A better choice (really) would be to cobble up a simple state machine to find %PDF-1. by reading 1 byte at a time until it either finds it or passes 1017 (1024 less the header length), then reads the next byte, if it's a '1', it seeks back a byte and writes a '4'.
The only other thing you would need to worry about is that PDF 1.4 suggests that the document catalog should contain a Version key with the file version. Since this is defined as optional in the spec, you are safe to ignore it.
So this will solve your problem. I do not, however, believe that you should need to do this. Really.
You should take some time to read part of the PDF spec, specifically section I.2 about version numbers and compatibility.
I just had this problem. Trying to submit some PDF's to a finanicial institution. "We only support PDF 1.4 or newer". Apparently our HP scanner creates version 1.3 PDF's.
I opened the PDF file with Notepad++ and changed the 3 to a 4 and saved it. It was that simple.
It's the very first part of the file and it's in plain text.
Another option for a small number of pdf files is to open them in Chrome or other browser then save as PDF or print to PDF. In my case, using Chrome, it saved to a newer pdf version and the bank accepted it