How to stream and read from a .tar.gz boto3 in S3? - amazon-s3

On S3 there is a JSON file with the following format:
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
It is compressed, in .tar.gz format, and its unzipped size is ~30GB, therefore I would like to read it in a streaming fashion.
Using the aws cli, I managed to locally do so with the following command:
aws s3 cp s3://${BUCKET_NAME}/${FILE_NAME}.tar.gz - | gunzip -c -
However, I would like to do it natively in python 3.8.
Merging various solutions online, I tried the following strategies:
1. Uncompressing in-memory file [not working]
import boto3, gzip, json
from io import BytesIO
s3 = boto3.resource('s3')
key = 'FILE_NAME.tar.gz'
streaming_iterator = s3.Object('BUCKET_NAME', key).get()['Body'].iter_lines()
first_line = next(streaming_iterator)
gzipline = BytesIO(first_line)
gzipline = gzip.GzipFile(fileobj=gzipline)
print(gzipline.read())
Which raises
EOFError: Compressed file ended before the end-of-stream marker was reached
2. Using the external library smart_open [partially working]
import boto3
for line in open(
f's3://${BUCKET_NAME}/${FILE_NAME}.tar.gz',
mode="rb",
transport_params={"client": boto3.client('s3')},
encoding="raw_unicode_escape",
compression=".gz"
):
print(line)
This second solution works discretely well for ASCII characters, but for some reason it also turns non ASCII characters into garbage; e.g.,
input: \xe5\x9b\xbe\xe6\xa0\x87\xe3\x80\x82
output: å\x9b¾æ\xa0\x87ã\x80\x82
expected output: 图标。
This leads me to think that the encoding I put is wrong, but I literally tried every encoding present in this page and the only ones that don't lead to an Exception are raw_unicode_escape, unicode_escape and palmos (?), but they all produce garbage.
Any suggestion is welcomed, thanks in advance.

The return from a call to get_object() is a StreamingBody object, which as the name implies will allow you to read from the object in a streaming fashion. However, boto3 does not support seeking on this file object.
While you can pass this object to a tarfile.open call, you need to be careful. There are two caveats. First, you'll need to tell tarfile that you're passing it a non-seekable streaming object using the | character in the open string, and you can't do anything that would trigger a seek, such as attempt to get a list of files first, then operate on these files.
Putting it all together is fairly straight forward, you just need to open a object using boto3, then process each file in the tar file in turn:
# Use boto3 to read the object from S3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket='example-bucket', Key='path/to/example.tar.gz')
obj = resp['Body']
# Open the tar file, the "|" is important, as it instructs
# tarfile that the fileobj is non-seekable
with tarfile.open(fileobj=obj, mode='r|gz') as tar:
# Enumerate the tar file objects as we extract data
for member in tar:
with tar.extractfile(member) as f:
# Read each row in turn and decode it
for row in f:
row = json.loads(row)
# Just print out the filename and results in this demo
print(member.name, row)

Related

Julia load dataframe from s3 csv file

I'm having trouble finding an example to follow online for this simple use-case:
Load a CSV file from an s3 object location to julia DataFrame.
Here is what I tried that didn't work:
using AWSS3, DataFrames, CSV
filepath = S3Path("s3://muh-bucket/path/data.csv")
CSV.File(filepath) |> DataFrames # fails
# but I am able to stat the file
stat(filepath)
#=
Status( mode = -rw-rw-rw-,
...etc
size = 2141032 (2.0M),
blksize = 4096 (4.0K),
blocks = 523,
mtime = 2021-09-01T23:55:26,
...etc
=#
I can also read the file to a string object locally:
data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."
My AWS config is in order, I can access the object from julia locally.
How to I get this into a dataframe?
Thanks to help from the nice people on julia slack channel (#data).
bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))
typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})
df = CSV.read(bytes, DataFrame)
Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.
Helpful SO post for getting AWS configs in order

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

How to read parquet files from aws s3 bucket and save them as jsons in jupyter

I am working with in a jupyter notebook with python. I am trying to read all the parquet files within a folder in an aws s3 bucket, and save them as jsons in a folder in my jupyter directory. I have the following code, but I believe it is just reading them, and I would like to save them as jsons. Thank you!
bucketname = 'my-bucket'
bucket = response.Bucket(bucketname)
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
If I understand your question correctly, you want to download the file to your file system instead of loading in memory. Here is an example code snippet that does the job.
bucketname = 'my-bucket'
bucket = response.Bucket(bucketname)
for obj in bucket.objects.all():
obj.Object().download_file('<specify-the-local-filename>')
You can find the docs here.
The parquet pip module, will do just that: https://pypi.org/project/parquet/. They have an example as well, copied here for quick ref:
import parquet
import json
## assuming parquet file with two rows and three columns:
## foo bar baz
## 1 2 3
## 4 5 6
with open("test.parquet") as fo:
# prints:
# {"foo": 1, "bar": 2}
# {"foo": 4, "bar": 5}
for row in parquet.DictReader(fo, columns=['foo', 'bar']):
print(json.dumps(row))

(InternalError) when calling the SelectObjectContent operation in boto3

I have a series of files that are in JSON that need to be split into multiple files to reduce their size. One issue is that the files are extracted using a third party tool and arrive as a JSON object on a single line.
I can use S3 select to process a small file (say around 300Mb uncompressed) but when I try and use a larger file - say 1Gb uncompressed (90Mb gzip compressed) I get the following error:
[ERROR] EventStreamError: An error occurred (InternalError) when calling the SelectObjectContent operation: We encountered an internal error. Please try again.
The query that I am trying to run is:
select count(*) as rowcount from s3object[*][*] s
I can't run the query from the console because the file is larger than 128Mb but the code that is performing the operation is as follows:
def execute_select_query(bucket, key, query):
"""
Runs a query against an object in S3.
"""
if key.endswith("gz"):
compression = "GZIP"
else:
compression = "NONE"
LOGGER.info("Running query |%s| against s3://%s/%s", query, bucket, key)
return S3_CLIENT.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={"JSON": {"Type": "DOCUMENT"}, "CompressionType": compression},
OutputSerialization={'JSON': {}},
)

In Google collab I get IOPub data rate exceeded

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
An IOPub error usually occurs when you try to print a large amount of data to the console. Check your print statements - if you're trying to print a file that exceeds 10MB, its likely that this caused the error. Try to read smaller portions of the file/data.
I faced this issue while reading a file from Google Drive to Colab.
I used this link https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/io.ipynb
and the problem was in this block of code
# Download the file we just uploaded.
#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
#Remove this print statement
#print('Downloaded file contents are: {}'.format(downloaded.read()))
I had to remove the last print statement since it exceeded the 10MB limit in the notebook - print('Downloaded file contents are: {}'.format(downloaded.read()))
Your file will still be downloaded and you can read it in smaller chunks or read a portion of the file.
The above answer is correct, I just commented the print statement and the error went away. just keeping it here so someone might find it useful. Suppose u are reading a csv file from google drive just import pandas and add pd.read_csv(downloaded) it will work just fine.
file_id = 'FILEID'
import io
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
_, done = downloader.next_chunk()
downloaded.seek(0)
pd.read_csv(downloaded);
Maybe this will help..
from via sv1997
IOPub Error on Google Colaboratory in Jupyter Notebook
IoPub Error is occurring in Colab because you are trying to display the output on the console itself(Eg. print() statements) which is very large.
The IoPub Error maybe related in print function.
So delete or annotate the print function. It may resolve the error.
%cd darknet
!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
!sed -i 's/GPU=0/GPU=1/' Makefile
!sed -i 's/CUDNN=0/CUDNN=1/' Makefile
!sed -i 's/CUDNN_HALF=0/CUDNN_HALF=1/' Makefile
!apt update
!apt-get install libopencv-dev
its important to update your make file. and also, keep your input file name correct