How can I decompress a gzip stream with zlib? - gzip

Gzip format files (created with the gzip program, for example) use the "deflate" compression algorithm, which is the same compression algorithm as what zlib uses. However, when using zlib to inflate a gzip compressed file, the library returns a Z_DATA_ERROR.
How can I use zlib to decompress a gzip file?

To decompress a gzip format file with zlib, call inflateInit2 with the windowBits parameter as 16+MAX_WBITS, like this:
inflateInit2(&stream, 16+MAX_WBITS);
If you don't do this, zlib will complain about a bad stream format. By default, zlib creates streams with a zlib header, and on inflate does not recognise the different gzip header unless you tell it so. Although this is documented starting in version 1.2.1 of the zlib.h header file, it is not in the zlib manual. From the header file:
windowBits can also be greater than 15 for optional gzip decoding. Add
32 to windowBits to enable zlib and gzip decoding with automatic header
detection, or add 16 to decode only the gzip format (the zlib format will
return a Z_DATA_ERROR). If a gzip stream is being decoded, strm->adler is
a crc32 instead of an adler32.

python
zlib library supports:
RFC 1950 (zlib compressed format)
RFC 1951 (deflate compressed format)
RFC 1952 (gzip compressed format)
The python zlib module will support these as well.
choosing windowBits
But zlib can decompress all those formats:
to (de-)compress deflate format, use wbits = -zlib.MAX_WBITS
to (de-)compress zlib format, use wbits = zlib.MAX_WBITS
to (de-)compress gzip format, use wbits = zlib.MAX_WBITS | 16
See documentation in http://www.zlib.net/manual.html#Advanced (section inflateInit2)
examples
test data:
>>> deflate_compress = zlib.compressobj(9, zlib.DEFLATED, -zlib.MAX_WBITS)
>>> zlib_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS)
>>> gzip_compress = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16)
>>>
>>> text = '''test'''
>>> deflate_data = deflate_compress.compress(text) + deflate_compress.flush()
>>> zlib_data = zlib_compress.compress(text) + zlib_compress.flush()
>>> gzip_data = gzip_compress.compress(text) + gzip_compress.flush()
>>>
obvious test for zlib:
>>> zlib.decompress(zlib_data)
'test'
test for deflate:
>>> zlib.decompress(deflate_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(deflate_data, -zlib.MAX_WBITS)
'test'
test for gzip:
>>> zlib.decompress(gzip_data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
zlib.error: Error -3 while decompressing data: incorrect header check
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
'test'
the data is also compatible with gzip module:
>>> import gzip
>>> import StringIO
>>> fio = StringIO.StringIO(gzip_data)
>>> f = gzip.GzipFile(fileobj=fio)
>>> f.read()
'test'
>>> f.close()
automatic header detection (zlib or gzip)
adding 32 to windowBits will trigger header detection
>>> zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
'test'
>>> zlib.decompress(zlib_data, zlib.MAX_WBITS|32)
'test'
using gzip instead
For gzip data with gzip header you can use gzip module directly; but please remember that under the hood, gzip uses zlib.
fh = gzip.open('abc.gz', 'rb')
cdata = fh.read()
fh.close()

The structure of zlib and gzip is different. zlib uses RFC 1950 and gzip uses RFC 1952,
so have different headers but the rest have the same structure and follows the RFC 1951.

Related

How to decode a .csv .gzip file containing tweets?

I'm trying to do a twitter sentiment analysis and my dataset is a couple of .csv.gzip files.
This is what I did to convert them to all to one dataframe.
(I'm using google colab, if that has anything to do with the error, filename or something)
apr_files = [file[9:] for file in csv_collection if re.search(r"04+", file)]
apr_files
Output:
['0428_UkraineCombinedTweetsDeduped.csv.gzip',
'0430_UkraineCombinedTweetsDeduped.csv.gzip',
'0401_UkraineCombinedTweetsDeduped.csv.gzip']
temp_list = []
for file in apr_files:
print(f"Reading in {file}")
# unzip and read in the csv file as a dataframe
temp = pd.read_csv(file, compression="gzip", header=0, index_col=0)
# append dataframe to temp list
temp_list.append(temp)
Error:
Reading in 0428_UkraineCombinedTweetsDeduped.csv.gzip
Reading in 0430_UkraineCombinedTweetsDeduped.csv.gzip
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (15) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)
Reading in 0401_UkraineCombinedTweetsDeduped.csv.gzip
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-5cba3ca01b1e> in <module>()
3 print(f"Reading in {file}")
4 # unzip and read in the csv file as a dataframe
----> 5 tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
6 # append dataframe to temp list
7 tmp_df_list.append(tmp_df)
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 8048: invalid start byte
I assumed that this error might be because the tweets contain multiple characters (like emoji, non-english characters, etc.).
I just switched to Jupyter Notebook, and It worked fine there.
As of now, I don't know what was the issue with Google Colab though.

Pandas: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

community. I want to open a CSV using pandas and perform analysis on it. Please, help as I am not able to open the CSV itself. I tried opening it with UTF-8, Latin-1, and ISO-8859-1 encoding. It didn't work.
CODE:
csv_file3='COVID-19-geographic-disbtribution-worldwide.csv'
with open(csv_file3,'rt')as f:
data = csv.reader(f)
j=0
for row in data:
j+=1
ERROR:
Traceback (most recent call last):
File "analysisofcases.py", line 87, in <module>
for row in data:
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte
This is the CSV that I want to open.
This is my code and the error when I ran the code.** Please check and see what the problem is**
Try this,check this standard encodings as well.
data = pd.read_csv("COVID-19-geographic-disbtribution-worldwide.csv", encoding = 'unicode_escape', engine ='python')

Pandas read_csv failing on gzipped file with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I am trying to process a series of .gz (gzipped) files. I would swear that they were reading successfully earlier when I first started debugging other parts of the code, but I can't swear to that. I switched to an uncompressed test file, so I could see what was causing some of the type conversions to fail. Once I got that debugged and I went to try processing the real gzipped files, I started getting errors. I would appreciate any ideas on what the problem might be and/or how to go about investigating it further.
I have stripped it down to the following code:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
filename = './small_test.csv.gz'
names = ['string_var','int_var','float_var','date_var']
types = {'string_var': 'string','int_var':'int64','float_var':'float64','date_var':'string'}
with open(filename) as csvfile:
print(filename)
# df = pd.read_csv(csvfile,names=names,header=0,dtype=types)
# df = pd.read_csv(csvfile,compression='gzip')
df = pd.read_csv(csvfile)
print(df.info(verbose=True))
I have tried just specifying the file and defaulting everything, specifying the file and the compression, and doing what I really need to do, which is specifying the names and types as well. I have also tried all those combinations on my full data set. They all fail in the same way with the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I found other questions on stackoverflow suggesting it was an encoding problem. I have the proper .gz extension that read_csv uses to infer, and I also explicitly specified it. The stack trace (below) shows it is getting into the gzip routine. The file -I command properly identifies the compressed file as gzip:
small_test.csv.gz: application/x-gzip; charset=binary
and the text file as ASCII:
small_test.csv: text/plain; charset=us-ascii so that doesn't appear to be the problem.
based on the above, I also tried encoding='ascii' and encoding='us-ascii'. They failed int the same way.
There was another one where they didn't have the .gz extension, so it was gzipped and it was trying to read it as uncompressed, but that is not my issue. If I unzip the file it works fine. If I rezip it it fails. gzcat and gzip work just fine on all the files, so I don't think it is a corruption issue.
In case it is useful, here is the test file:
"string_var","int_var","float_var","date_var"
a,1,1.0,"2020-01-01 21:20:19"
b,2,2.0,"2019-10-31 00:00:00"
c,3,3.0,"1969-06-22 12:00:00"
And finally, this is the entire stack trace:
Traceback (most recent call last):
File "./test_read_csv.py", line 14, in <module>
df = pd.read_csv(csvfile,compression='gzip',encoding='us-ascii')
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 719, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2062, in pandas._libs.parsers.raise_parser_error
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 463, in read
if not self._read_gzip_header():
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 406, in _read_gzip_header
magic = self._fp.read(2)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 91, in read
self.file.read(size-self._length+read)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Well, after digging through the Pandas code with a ton of help from my colleague, we figured this out. Here is the short version: If you want to open a gzipped file and pass it to read_csv(), you have to open it in binary AND specify the compression:
with open(filename, 'rb') as csvfile:
df = pd.read_csv(csvfile,compression='gzip')
Letting read_csv() do the open also works:
read_csv(filename) #filename is a string ending in .gz
The primary problem is that I did not open the file in binary. Since I did not, csvfile had a default encoding of UTF-8. So, here are the scenarios:
with open(filename) as csvfile: # Not binary
read_csv(csvfile): Pandas uses a text parser, which fails because the file is gzipped
read_csv(csvfile, compression='gzip'): This is what I worked on most. It did get down into gzip (which was what was so confusing) and then called read_header, but since the file handle was set to be UTF-8, it was again using the text reader and failed.
with open(filename, 'rb') as csvfile:
read_csv(csvfile): This still fails. This time it fails because the default for compression is 'infer', BUT if you read the doc closely, 'infer' only works if it is "path like". It infers based on the file extension which it didn't have because it was passed a file handle, not a string representation of the path. This ends up being identical to the read_csv(csvfile) case above when it wasn't opened in binary.
read_csv(csvfile, compression='gzip'): This is what works. The file is binary, so doesn't use a UTF reader and it is explicitly told that it is gzipped so it calls the gzip library
I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing encoding="utf-8" with encoding = "ISO-8859-1" will solve the problem.
df = pd.read_csv(csv_file_or_csv.gz_file, encoding = "ISO-8859-1")

How to download and open only the first block of a bzip2 from S3?

I have a large bzip2 compressed file on S3 and I'm only interested in its first line. How can I read the first line(s) without downloading and decompressing the entire file?
import boto3
import io
import bz2
s3 = boto3.resource('s3')
s3_object = s3.Object("bucket-name", "path/file.bz2")
f_bz2 = s3_object.get(Range=f"bytes=0-100000")["Body"].read()
io_bz2 = io.BytesIO(f_bz2)
lines = []
with bz2.BZ2File(io_bz2, "r") as f:
while True:
lines.append(f.readline())
The compression block size for bzip2 ranges between 100kb and 900kb. Above code assumes 100kb.
In the end an exception is thrown:
EOFError: Compressed file ended before the end-of-stream marker was reached

Trouble reading NCES IPEDS csv file with pandas

ran into a trouble downloading and reading csv files provided by US Department of Education National Center for Education Statistics. Below is code that should run for folks that might be interested in helping me troubleshoot.
import requests, zipfile, io
# First example shows that the code can work. Works fine on years 2005
# and earlier.
url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2005_Data_Stata.zip'
r_zip_file_2005 = requests.get(url, stream=True)
z_zip_file_2005 = zipfile.ZipFile(io.BytesIO(r_zip_file_2005.content))
z_zip_file_2005.extractall('.')
csv_2005_df = pd.read_csv('hd2005_data_stata.csv')
# Second example shows that something changed in the CSV files after
# 2005 (or seems to have changed).
url = 'https://nces.ed.gov/ipeds/datacenter/data/HD2006_Data_Stata.zip'
r_zip_file_2006 = requests.get(url, stream=True)
z_zip_file_2006 = zipfile.ZipFile(io.BytesIO(r_zip_file_2006.content))
z_zip_file_2006.extractall('.')
csv_2006_df = pd.read_csv('hd2006_data_stata.csv')
For 2006 Python raises:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 18: invalid start byte
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-26-b26a150e37ee> in <module>()
----> 1 csv_2006_df = pd.read_csv('hd2006_data_stata.csv')
Any tips on how to overcome this?
Only took 7 months... Figured my answer. Wasn't rocket science.
csv_2006_df = pd.read_csv('hd2006_data_stata.csv',
encoding='ISO-8859-1')