Cannot Locate the file in AWS EMR Notebook - amazon-emr

I have been trying to use some .txt, .csv files in an EMR Notebook but I cannot locate them.
I am trying to read via:
with open('file.txt', 'r') as f:
notes = f.read()
Things I tried:
Uploaded the file by using JupyterHub UI. I can see the file but I cant read it from the path. I also checked the file using JupyterHub terminal.
Tried to read from s3 (lots of people got it working in this way):
with open('s3://<repo>/file.txt', 'r') as f:
Copied the file to hdfs and hadoop in master node (in cluster) using both: hdfs dfs and hadoop fs . File is present in both directories.
However, I have no clue how I can reach the file in EMR Notebook.
Any ideas?

Related

How to upload large files from local pc to DBFS?

I am trying to learn spark SQL in databricks and want to work with the Yelp dataset; however, the file is too large to upload to DBFS from UI. Thanks, Filip
There are several approaches to that:
Use Databricks CLI's dbfs command to upload local data to DBFS.
Download dataset directly from notebook, for example by using %sh wget URL, and unpacking the archive to DBFS (either by using /dbfs/path/... as destination, or using dbutils.fs.cp command to copy files from driver node to DBFS)
Upload files to AWS S3, Azure Data Lake Storage, Google Storage or something like, and accessing data there.
Upload the file you want to load in Databricks to google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
I hope this helps

Not able to read HDF5 file present in S3 in sagemaker notebook instance

My directory structure looks like this: bucket-name/training/file.hdf5
I tried reading this file in sagemaker notebook instance by this code cell:
bucket='bucket-name'
data_key = 'training/file.hdf5'
data_location = 's3://{}/{}'.format(bucket, data_key)
hf = h5py.File(data_location, 'r')
But it gives me error:
Unable to open file (unable to open file: name = 's3://bucket-name/training/file.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I have also tried pd.read_hdf(data_location) but was not succesfull.
Trying to read a csv file into dataframe from same key doesnt throw error.
Any help is appreciated. Thanks
Thanks for asking the question here!
Your file is on the remote storage service Amazon S3. The string data_location is not the name of a local file, hence your data reader cannot open it. It order to read the S3 file, you have 2 options:
use a library that can read files from S3. It seems that h5py can do that, if you specify driver='ros3'
alternatively, you can also bring the file from S3 to your machine, and then read it from the machine. For example using the AWS CLI to bring the file from S3 to local aws s3 cp s3://<your bucket>/<your file on s3> /home/ec2-user/SageMaker/ then File(data_location='/home/ec2-user/SageMaker/your-file-name.hdf5') should work

Pyspark EMR Notebook - Unable to save file to EMR environment

This seems really basic but I can't seem to figure it out. I am working in a Pyspark Notebook on EMR and have taken a pyspark dataframe and converted it to a pandas dataframe using toPandas().
Now, I would like to save this dataframe to the local environment using the following code:
movie_franchise_counts.to_csv('test.csv')
But I keep getting a permission error:
[Errno 13] Permission denied: 'test.csv'
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 3204, in to_csv
formatter.save()
File "/usr/local/lib64/python3.6/site-packages/pandas/io/formats/csvs.py", line 188, in save
compression=dict(self.compression_args, method=self.compression),
File "/usr/local/lib64/python3.6/site-packages/pandas/io/common.py", line 428, in get_handle
f = open(path_or_buf, mode, encoding=encoding, newline="")
PermissionError: [Errno 13] Permission denied: 'test.csv'
Any help would be much appreciated.
When you are running PySpark in EMR Notebook you are connecting to EMR cluster via Apache Livy. Therefore all your variables and dataframes are stored on the cluster and when you run df.to_csv('file.csv') you are trying to save CSV on the cluster and not in your local enviroment. I've struggled a bit, but this worked for me:
Store your PySpark dataframe as temporary view: df.createOrReplaceTempView("view_name")
Load SparkMagic: %load_ext sparkmagic.magics
Select from view and use SparkMagic to load output to local (-o flag)
%%sql -o df_local --maxrows 10
SELECT * FROM view_name
Now you have your data in Pandas dataframe df_local and can save it with df_local.to_csv('file.csv')
It depends on where exactly your Kernel is running, i.e, whether it is running locally or on remote cluster. In case of EMR Notebooks, for EMR release label 5.30, 5.32+(except 6.0 and 6.1), all kernels run remotely on the attached EMR cluster, hence when you try to save file, it is actually trying to save file on the cluster and you may not have access to that directory on the cluster. For release label other than those mentioned above, kernels run locally and hence for those release label you would be able to save file locally with your code.
I believe the best way to do it would be to save to s3 directly from a pyspark dataframe like this
df.repartition(1).write.mode('overwrite').csv('s3://s3-bucket-name/folder/', header=True)
Note: You don't need a file name here since pyspark will create a file with a custom name such as part-00000-d129fe1-7721-41cd-a97e-36e076ea470e-c000.csv

Google Colab unzipped file won't appear

I have downloaded some datasets via Kaggle API into Colab. However, after unzipping them they do not appear in my directory and I can read them with pandas
As you can see the file where successfully unzip and then I unzip them again as I couldn't find them. However, they do no appear in the directory as I mentioned.
Furthermore the pd.read_csv can't read either the csv files that don't show or the csv.zip that show using compression = zip argument.
I get
FileNotFoundError: File b'/data/train.csv' does not exist
FileNotFoundError: [Errno 2] No such file or directory: 'data/train.csv.zip'
Any idea what's going on?
try unzipping them individually like
!unzip train.csv.zip
then do
train = pd.read_csv('train.csv', nrows=6000000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
I got this from this github repo, which you can follow the steps for or just import into colab then replace it with your data
https://github.com/llSourcell/Kaggle_Earthquake_challenge/blob/master/Earthquake_Challenge.ipynb
you can import .ipynb notebooks through searching for them in colab

What is the path for a bootstrapped file for a Pig job running in Amazon EMR

I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.
However when I try to access it in the Pig script like below:
userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray);
I get an error that the input path does not exist:
hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt
When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.
Is it different for Pig?
By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.
There are 2 ways to solve this:
Either copy the file on S3, and directly load file from s3
userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);
Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.
I would prefer first option.