Not able to read HDF5 file present in S3 in sagemaker notebook instance - amazon-s3

My directory structure looks like this: bucket-name/training/file.hdf5
I tried reading this file in sagemaker notebook instance by this code cell:
bucket='bucket-name'
data_key = 'training/file.hdf5'
data_location = 's3://{}/{}'.format(bucket, data_key)
hf = h5py.File(data_location, 'r')
But it gives me error:
Unable to open file (unable to open file: name = 's3://bucket-name/training/file.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I have also tried pd.read_hdf(data_location) but was not succesfull.
Trying to read a csv file into dataframe from same key doesnt throw error.
Any help is appreciated. Thanks

Thanks for asking the question here!
Your file is on the remote storage service Amazon S3. The string data_location is not the name of a local file, hence your data reader cannot open it. It order to read the S3 file, you have 2 options:
use a library that can read files from S3. It seems that h5py can do that, if you specify driver='ros3'
alternatively, you can also bring the file from S3 to your machine, and then read it from the machine. For example using the AWS CLI to bring the file from S3 to local aws s3 cp s3://<your bucket>/<your file on s3> /home/ec2-user/SageMaker/ then File(data_location='/home/ec2-user/SageMaker/your-file-name.hdf5') should work

Related

How to upload large files from local pc to DBFS?

I am trying to learn spark SQL in databricks and want to work with the Yelp dataset; however, the file is too large to upload to DBFS from UI. Thanks, Filip
There are several approaches to that:
Use Databricks CLI's dbfs command to upload local data to DBFS.
Download dataset directly from notebook, for example by using %sh wget URL, and unpacking the archive to DBFS (either by using /dbfs/path/... as destination, or using dbutils.fs.cp command to copy files from driver node to DBFS)
Upload files to AWS S3, Azure Data Lake Storage, Google Storage or something like, and accessing data there.
Upload the file you want to load in Databricks to google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
I hope this helps

Cannot Locate the file in AWS EMR Notebook

I have been trying to use some .txt, .csv files in an EMR Notebook but I cannot locate them.
I am trying to read via:
with open('file.txt', 'r') as f:
notes = f.read()
Things I tried:
Uploaded the file by using JupyterHub UI. I can see the file but I cant read it from the path. I also checked the file using JupyterHub terminal.
Tried to read from s3 (lots of people got it working in this way):
with open('s3://<repo>/file.txt', 'r') as f:
Copied the file to hdfs and hadoop in master node (in cluster) using both: hdfs dfs and hadoop fs . File is present in both directories.
However, I have no clue how I can reach the file in EMR Notebook.
Any ideas?

How to read parquet files from tar.bz2 using pyspark on EMR Notebook?

I'm trying to read in parquet (snappy) files in a tar.bz2 format from an S3 bucket, but receive this error:
"java.io.IOException: Could not read footer for file"
when using:
df = spark.read.parquet("s3://bucket/folder/filename.tar.bz2")
I have downloaded one of the tar.bz2 files and verified that it is valid parquet within the underlying tar directory.
I also looked into using S3 Select in the boto3 package, but it looks like BZIP is only supported for CSV and JSON files currently, not parquet. I have also attempted to unzip the files back into S3 like this:
sc.install_pypi_package("boto3")
import boto3
from io import BytesIO
import tarfile
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket,Key='folder/filename.tar.bz2')
wholefile = obj['Body'].read()
fileobj = BytesIO(wholefile)
tar = tarfile.open(fileobj=fileobj)
s3.upload_fileobj(Fileobj=tar, Bucket=bucket, Key='folder/filename.parquet')
However, this simply yields a "Interpreter died:" error message without any additional information (possibly due to insufficient memory in cluster?). Is there a step that I'm missing to decompress the tar library?

What is the path for a bootstrapped file for a Pig job running in Amazon EMR

I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.
However when I try to access it in the Pig script like below:
userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray);
I get an error that the input path does not exist:
hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt
When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.
Is it different for Pig?
By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.
There are 2 ways to solve this:
Either copy the file on S3, and directly load file from s3
userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);
Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.
I would prefer first option.

Why my file is not uploading to S3 using Node.js

I am using this
https://github.com/nuxusr/Node.js---Amazon-S3
for uploading files to s3 :
in test-s3-upload.js i had commented mostly tests because they was giving some error , as my goal is to upload the file to s3 so i keep only testUploadFileToBucket() test and while running node test.js gives ok.
but when i check in s3 fox the uploaded file is not being shown.
why file is not uploaded?
Use knox instead. https://github.com/learnboost/knox
Have a look at this project and especially the bin/amazon-s3-upload.js file so you can see how we're doing it using AwsSum:
https://github.com/appsattic/node-awssum-scripts/
https://github.com/appsattic/node-awssum/
It takes a bucket name and a filename and will stream the file up to S3 for you:
$ ./amazon-s3-upload.js -b your-bucket -f the-file.txt
Hope that helps. :)