How to download csv file from S3 bucket into numpy array - numpy

I have a csv file in an AWS S3 bucket. How do I download the CSV and assign it to a numpy array?
[Using python 3.6/boto3]
I've tried various forms including:
s3 = boto3.resource('s3', region_name=region)
obj = s3.Object(bucket, key)
with io.BytesIO(obj.get()["Body"].read()) as f:
# rewind the file
f.seek(0)
arr_data = numpy.load(f)
arr_data = numpy.genfromtxt('https://BUCKETNAME.s3-eu-west-1.amazonaws.com/folder/infile.csv',dtype='str',delimiter=',')
This also doesn't work
Essentially I'm trying to replicate in S3:
arr_data = np.genfromtxt('path...input.csv',dtype='str',delimiter=',')

I was able to convert a csv to a numpy array using pandas in-between... not sure if that's what you're looking for. But here's how I did it:
import pandas as pd
import numpy as np
data_location = 's3://<path>'
data = pd.read_csv(data_location)
data_numpy = data.value.values.reshape(-1,1)

Related

Write CSV to HDFS from stream with pyarrow upload

I am trying to save a Pandas DataFrame to HDFS in CSV format using pyarrow upload method, but the CSV file saved is empty. The code example can be found below.
import io
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"x": [1, 2, 3]})
buf = io.StringIO()
df.to_csv(buf)
hdfs = pa.hdfs.connect()
hdfs.upload("path/to/hdfs/test.csv", buf)
When I check the contents of test.csv on HDFS it is empty. What did I do wrong? Thanks.
You need to call buf.seek(0) before uploading.
Basically you need to rewind to the begining of the buffer otherwise hdfs thinks there's nothing to upload:
>>> buf.read()
''
>>> buf.seek(0)
0
>>> buf.read()
',x\n0,1\n1,2\n2,3\n'
>>> buf.read()
''

AWS S3 and Sagemaker: No such file or directory

I have created an S3 bucket 'testshivaproject' and uploaded an image in it. When I try to access it in sagemaker notebook, it throws an error 'No such file or directory'.
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
# Define IAM role
role = get_execution_role()
my_region = boto3.session.Session().region_name # set the region of the instance
print("success :"+my_region)
Output: success :us-east-2
role
Output: 'arn:aws:iam::847047967498:role/service-role/AmazonSageMaker-ExecutionRole-20190825T121483'
bucket = 'testprojectshiva2'
data_key = 'ext_image6.jpg'
data_location = 's3://{}/{}'.format(bucket, data_key)
print(data_location)
Output: s3://testprojectshiva2/ext_image6.jpg
test = load_img(data_location)
Output: No such file or directory
There are similar questions raised (Load S3 Data into AWS SageMaker Notebook) but did not find any solution?
Thanks for using Amazon SageMaker!
I sort of guessed from your description, but are you trying to use the Keras load_img function to load images directly from your S3 bucket?
Unfortunately, the load_img function is designed to only load files from disk, so passing an s3:// URL to that function will always return a FileNotFoundError.
It's common to first download images from S3 before using them, so you can use boto3 or the AWS CLI to download the file before calling load_img.
Alternatively, since the load_img function simply creates a PIL Image object, you can create the PIL object directly from the data in S3 using boto3, and not use the load_img function at all.
In other words, you could do something like this:
from PIL import Image
s3 = boto3.client('s3')
test = Image.open(BytesIO(
s3.get_object(Bucket=bucket, Key=data_key)['Body'].read()
))
Hope this helps you out in your project!
You may use the following code to pull in a CSV file into sagemaker.
import pandas as pd
bucket='your-s3-bucket'
data_key = 'your.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location)
alternative formatting for data_location variable:
data_location = f's3://{bucket}/{data_key}'

AWS Sagemaker: AttributeError: module 'pandas' has no attribute 'core'

Let me prefix this by saying I'm very new to tensorflow and even newer to AWS Sagemaker.
I have some tensorflow/keras code that I wrote and tested on a local dockerized Jupyter notebook and it runs fine. In it, I import a csv file as my input.
I use Sagemaker to spin up a jupyter notebook instance with conda_tensorflow_p36. I modified the pandas.read_csv() code to point to my input file, now hosted on a S3 bucket.
So I changed this line of code from
import pandas as pd
data = pd.read_csv("/input.csv", encoding="latin1")
to this
import pandas as pd
data = pd.read_csv("https://s3.amazonaws.com/my-sagemaker-bucket/input.csv", encoding="latin1")
and I get this error
AttributeError: module 'pandas' has no attribute 'core'
I'm not sure if it's a permissions issue. I read that as long as I name my bucket with the string "sagemaker" it should have access to it.
Pull our data from S3 for example:
import boto3
import io
import pandas as pd
# Set below parameters
bucket = '<bucket name>'
key = 'data/training/iris.csv'
endpointName = 'decision-trees'
# Pull our data from S3
s3 = boto3.client('s3')
f = s3.get_object(Bucket=bucket, Key=key)
# Make a dataframe
shape = pd.read_csv(io.BytesIO(f['Body'].read()), header=None)

Generating a NetCDF from a text file

Using Python can I open a text file, read it into an array, then save the file as a NetCDF?
The following script I wrote was not successful.
import os
import pandas as pd
import numpy as np
import PIL.Image as im
path = 'C:\path\to\data'
grb = [[]]
for fn in os.listdir(path):
file = os.path.join(path,fn)
if os.path.isfile(file):
df = pd.read_table(file,skiprows=6)
grb.append(df)
df2 = pd.np.array(grb)
#imarray = im.fromarray(df2) ##cannot handle this data type
#imarray.save('Save_Array_as_TIFF.tif')
i once used xray or xarray (they renamed them selfs) to get a NetCDF file into an ascii dataframe... i just googled and appearantly they have a to_netcdf function
import xarray and it allows you to treat dataframes just like pandas.
so give this a try:
df.to_netcdf(file_path)
xarray slow to save netCDF

Loading multiple files from Google Cloud Storage into a single Pandas Dataframe

I have been trying to write a function that loads multiple files from a Google Cloud Storage bucket into a single Pandas Dataframe, however I cannot seem to make it work.
import pandas as pd
from google.datalab import storage
from io import BytesIO
def gcs_loader(bucket_name, prefix):
bucket = storage.Bucket(bucket_name)
df = pd.DataFrame()
for shard in bucket.objects(prefix=prefix):
fp = shard.uri
%gcs read -o $fp -v tmp
df.append(read_csv(BytesIO(tmp))
return df
When I try to run it says:
undefined variable referenced in command line: $fp
Sure, here's an example:
https://colab.research.google.com/notebook#fileId=0B7I8C_4vGdF6Ynl1X25iTHE4MGc
This notebook shows the following:
Create two random CSVs
Upload both CSV files to a GCS bucket
Uses the GCS Python API to iterate over files in the bucket. And,
Merge each file into a single Pandas DataFrame.