How to upload large files from local pc to DBFS? - apache-spark-sql

I am trying to learn spark SQL in databricks and want to work with the Yelp dataset; however, the file is too large to upload to DBFS from UI. Thanks, Filip

There are several approaches to that:
Use Databricks CLI's dbfs command to upload local data to DBFS.
Download dataset directly from notebook, for example by using %sh wget URL, and unpacking the archive to DBFS (either by using /dbfs/path/... as destination, or using dbutils.fs.cp command to copy files from driver node to DBFS)
Upload files to AWS S3, Azure Data Lake Storage, Google Storage or something like, and accessing data there.

Upload the file you want to load in Databricks to google drive
from urllib.request import urlopen
from shutil import copyfileobj
my_url = 'paste your url here'
my_filename = 'give your filename'
file_path = '/FileStore/tables' # location at which you want to move the downloaded file
# Downloading the file from google drive to Databrick
with urlopen(my_url) as in_stream, open(my_filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
# check where the file has download
# in my case it is
display(dbutils.fs.ls('file:/databricks/driver'))
# moving the file to desired location
# dbutils.fs.mv(downloaded_location, desired_location)
dbutils.fs.mv("file:/databricks/driver/my_file", file_path)
I hope this helps

Related

Cannot Locate the file in AWS EMR Notebook

I have been trying to use some .txt, .csv files in an EMR Notebook but I cannot locate them.
I am trying to read via:
with open('file.txt', 'r') as f:
notes = f.read()
Things I tried:
Uploaded the file by using JupyterHub UI. I can see the file but I cant read it from the path. I also checked the file using JupyterHub terminal.
Tried to read from s3 (lots of people got it working in this way):
with open('s3://<repo>/file.txt', 'r') as f:
Copied the file to hdfs and hadoop in master node (in cluster) using both: hdfs dfs and hadoop fs . File is present in both directories.
However, I have no clue how I can reach the file in EMR Notebook.
Any ideas?

Not able to read HDF5 file present in S3 in sagemaker notebook instance

My directory structure looks like this: bucket-name/training/file.hdf5
I tried reading this file in sagemaker notebook instance by this code cell:
bucket='bucket-name'
data_key = 'training/file.hdf5'
data_location = 's3://{}/{}'.format(bucket, data_key)
hf = h5py.File(data_location, 'r')
But it gives me error:
Unable to open file (unable to open file: name = 's3://bucket-name/training/file.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
I have also tried pd.read_hdf(data_location) but was not succesfull.
Trying to read a csv file into dataframe from same key doesnt throw error.
Any help is appreciated. Thanks
Thanks for asking the question here!
Your file is on the remote storage service Amazon S3. The string data_location is not the name of a local file, hence your data reader cannot open it. It order to read the S3 file, you have 2 options:
use a library that can read files from S3. It seems that h5py can do that, if you specify driver='ros3'
alternatively, you can also bring the file from S3 to your machine, and then read it from the machine. For example using the AWS CLI to bring the file from S3 to local aws s3 cp s3://<your bucket>/<your file on s3> /home/ec2-user/SageMaker/ then File(data_location='/home/ec2-user/SageMaker/your-file-name.hdf5') should work

There is a way to read the images for my convolutional neural network directly from my desktop?

I'm training a Convolutional Neural Network using Google-Colaboratory. I have my data (images) stored in Google Drive and I'm able to use it correctly. However sometimes the process to read the images is too slow and does not work (other times the process is faster and I have no problem reading the images). In order to read the images from Google Drive I use:
from google.colab import drive
drive.mount('/content/drive')
!unzip -u "/content/drive/My Drive/the folder/files.zip"
IMAGE_PATH = '/content/drive/My Drive/the folder'
file_paths = glob.glob(path.join(IMAGE_PATH, '*.png'))
and sometimes works and other times not or it is too slow :).
Either way I would like to read my data from a folder on my desktop without using google drive but I'm not able to do this.
I'm trying the following:
IMAGE_PATH = 'C:/Users/path/to/my/folder'
file_paths = glob.glob(path.join(IMAGE_PATH, '*.png'))
But I get an error saying that the directory/file does not exist.
Google Colab cannot directly access our local machine dataset because it runs on a separate virtual machine on the cloud. We need to upload the dataset into Google Drive then we can load it into Google Colab’s runtime for model building.
For that you need to follow the steps given below:
Create a zip file of your large dataset and then upload this file in your Google Drive.
Now, open the Google Colab with the same google id and mount the Google Drive using the below code and authorize to access the drive:
from google.colab import drive
drive.mount('/content/drive')
Your uploaded zip file will be available in the Google Colab mounted drive /drive/MyDrive/ in left pane.
To read the dataset into the Google Colab, you need to unzip the folder and extract its contents into the /tmp folder using the code below.
import zipfile
import os
zip_ref = zipfile.ZipFile('/content/drive/MyDrive/train.zip', 'r') #Opens the zip file in read mode
zip_ref.extractall('/tmp') #Extracts the files into the /tmp folder
zip_ref.close()
You can check the extracted file in /drive/train folder in left pane.
Now finally you need to join the path of your dataset to use it in the Google Colab's runtime environment.
train_dataset = os.path.join('/tmp/train/') # dataset

How to read parquet files from tar.bz2 using pyspark on EMR Notebook?

I'm trying to read in parquet (snappy) files in a tar.bz2 format from an S3 bucket, but receive this error:
"java.io.IOException: Could not read footer for file"
when using:
df = spark.read.parquet("s3://bucket/folder/filename.tar.bz2")
I have downloaded one of the tar.bz2 files and verified that it is valid parquet within the underlying tar directory.
I also looked into using S3 Select in the boto3 package, but it looks like BZIP is only supported for CSV and JSON files currently, not parquet. I have also attempted to unzip the files back into S3 like this:
sc.install_pypi_package("boto3")
import boto3
from io import BytesIO
import tarfile
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket,Key='folder/filename.tar.bz2')
wholefile = obj['Body'].read()
fileobj = BytesIO(wholefile)
tar = tarfile.open(fileobj=fileobj)
s3.upload_fileobj(Fileobj=tar, Bucket=bucket, Key='folder/filename.parquet')
However, this simply yields a "Interpreter died:" error message without any additional information (possibly due to insufficient memory in cluster?). Is there a step that I'm missing to decompress the tar library?

Save files/pictures in Google Colaboratory

at the moment, I work with 400+ images and upload them with
from google.colab import files
uploaded = files.upload()
This one's working fine but I have to reupload all the images every time I leave my colaboratory. Pretty annoying because the upload takes like 5-10 minutes.
Any possibilities to prevent this? It seems like Colaboratory is saving the files only temporarily.
I have to use Google Colaboratory because I need their GPU.
Thanks in advance :)
As far as I know, there is no way to permanently store data on a Google Colab VM, but there are faster ways to upload data on Colab than files.upload().
For example you can upload your images on Google Drive once and then 1) mount Google Drive directly in your VM or 2) use PyDrive to download your images on your VM. Both of these options should be way faster than uploading your images from your local drive.
Mounting Drive in your VM
Mount Google Drive:
from google.colab import drive
drive.mount('/gdrive')
Print the contents of foo.txt located in the root directory of Drive:
with open('/gdrive/foo.txt') as f:
for line in f:
print(line)
Using PyDrive
Take a look at the first answer to this question.
First Of All Mount Your Google Drive:
# Load the Drive helper and mount
from google.colab import drive
# This will prompt for authorization.
drive.mount('/content/drive')
Result is :
Mounted at /content/drive
For Checking Directory Mounted Run this command:
# After executing the cell above, Drive
# files will be present in "/content/drive/My Drive".
!ls "/content/drive/My Drive"
Result is Something Like This:
07_structured_data.ipynb Sample Excel file.xlsx
BigQuery recipes script.ipynb
Colab Notebooks TFGan tutorial in Colab.txt
Copy of nima colab.ipynb to_upload (1).ipynb
created.txt to_upload (2).ipynb
Exported DataFrame sheet.gsheet to_upload (3).ipynb
foo.txt to_upload.ipynb
Pickle + Drive FUSE example.ipynb variables.pickle
Sample Excel file.gsheet