I have a gzipped file that I need to read and decompress in my application. I just read through the zlib manual, and it appears that the zlib function are able to operate via memory buffers, but the gzip interface is all file-based.
What is the most common method of dealing with gzipped files like this? Do I need to handle the gzip file format myself, pull out the deflated data, and pass it to the zlib functions?
Note: The reason file-based will not work is because the file is in an archive on a read-only medium, so I can't extract the file first and use the gzip functions from zlib. This is an embedded Linux system.
You need to "read through" the zlib manual again, this time reading through it. inflateInit2() has an option to decompress gzip streams.
Related
Can i concatenate two already gzipped files (using gzip) and then gunzip them?
As of today, I download the gzipped files from remote servers, gunzip them individually and then cat them to merge.
Looking to make things faster here by merging the gzipped files and then gunzipping.
Yes. The concatenation of gzip streams is also a valid gzip stream. The result of gunzipping is the concatenation of the uncompressed data.
You could have just tried it.
I can decompress a small GZip file in memory but there are memory limitations on the cloud box that this will run on. I can get around this by doing it in chunks (~32 k). Is there an easy way to split up a GZip compressed file without reading through it?
Thanks,
Marc
Yes, you can use zlib to read a gzip file in chunks. No, you cannot split a gzip file without decoding it.
When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs
How to create a gzip stream using zlib? any code available?
Here is gzip file format. What you need is to output the member header followed by zlib-compressed bytes.
I hope that's enough...
Is it possible to send pre-compressed files that are contained within an EARfile? More specifically, the jsp and js files within the WAR file. I am using Apache HTTP as the web server and although it is simple to turn on the deflate module and set it up to use a pre-compressed version of the files, I would like to apply this to files that are contained within an EAR file that is deployed to JBoss. The reason being that the content is quite static and compressing it on the fly each time is quite costly in terms of cpu time.
Quite frankly, I am not entirely familiar with how JBoss deploys these EAR files and 'serves' them. The gist of what I want to do is pre-compress the files contained inside the war so that when they are requested they are sent back to the client with gzip for Content-Encoding.
In theory, you could compress them before packging them in the EAR, and then serve them up with a custom controller which adds the http header to the response which tells the client they're compressed, but that seems like a lot of effort to go to.
When you say that on-the-fly compression is quite costly, have you actually measured it? Have you tried requesting a large number of uncompressed pages, measured the cpu usage, then tied it again with compressed pages? I think you may be over-estimating the impact. It uses quite low-intensity stream compression, designed to use little CPU resources.
You need to be very sure that you have a real performance problem before going to such lengths to mitigate it.
I don't frequent this site often and I seem to have left this thread hanging. Sorry about that. I did succeed in getting compression to my javascript and css files. What I did was I precompress them in the ant build process using the gzip. I then had to spoof the name to get rid of the gzip extension. So I had foo.js and compressed it into foo.js.gzip. I renamed this foo.js.gzip to foo.js and this is the file that gets packaged into the WAR file. So that handles the precompression part. To get this file served up properly, we just have to tell the browser that this file is compressed, via the content-encoding header of the http response. This was done via a output filter that is applied to files that matched the *.js extension (some Java/JBoss, WEB-INF/web.xml if it helps. I'm not too familiar with this so sorry guys).