Saving Matplotlib Output to Blob Storage on Databricks - matplotlib

I'm trying to write matplotlib figures to the Azure blob storage using the method provided here:
Saving Matplotlib Output to DBFS on Databricks.
However, when I replace the path in the code with
path = 'wasbs://test#someblob.blob.core.windows.net/'
I get this error
[Errno 2] No such file or directory: 'wasbs://test#someblob.blob.core.windows.net/'
I don't understand the problem...

As per my research, you cannot save Matplotlib output to Azure Blob Storage directly.
You may follow the below steps to save Matplotlib output to Azure Blob Storage:
Step 1: You need to first save it to the Databrick File System (DBFS) and then copy it to Azure Blob storage.
Saving Matplotlib output to Databricks File System (DBFS): We are using the below command to save the output to DBFS: plt.savefig('/dbfs/myfolder/Graph1.png')
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
plt.close()
df.set_index('fruits',inplace = True)
df.plot.bar()
plt.savefig('/dbfs/myfolder/Graph1.png')
Step 2: Copy the file from Databricks File System to Azure Blob Storage.
There are two methods to copy file from DBFS to Azure Blob Stroage.
Method 1: Access Azure Blob storage directly
Access Azure Blob Storage directly by setting "Spark.conf.set" and copy file from DBFS to Blob Storage.
spark.conf.set("fs.azure.account.key.< Blob Storage Name>.blob.core.windows.net", "<Azure Blob Storage Key>")
Use dbutils.fs.cp to copy file from DBFS to Azure Blob Storage:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', 'wasbs://<Container>#<Storage Name>.blob.core.windows.net/Azure')
Method 2: Mount Azure Blob storage containers to DBFS
You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). The mount is a pointer to a Blob storage container, so the data is never synced locally.
dbutils.fs.mount(
source = "wasbs://sampledata#chepra.blob.core.windows.net/Azure",
mount_point = "/mnt/chepra",
extra_configs = {"fs.azure.sas.sampledata.chepra.blob.core.windows.net":dbutils.secrets.get(scope = "azurestorage", key = "azurestoragekey")})
Use dbutils.fs.cp copy the file to Azure Blob Storage Container:
dbutils.fs.cp('dbfs:/myfolder/Graph1.png', '/dbfs/mnt/chepra')
By following Method1 or Method2 you can successfully save the output to Azure Blob Storage.
For more details, refer "Databricks - Azure Blob Storage".
Hope this helps. Do let us know if you any further queries.

You can write with .savefig() directly to Azure blob storage- you just need to mount the blob container before.
The following works for me, where I had mounted the blob container as /mnt/mydatalakemount
plt.savefig('/dbfs/mnt/mydatalakemount/plt.png')
or
fig.savefig('/dbfs/mnt/mydatalakemount/fig.png')
Documentation on mounting blob container is here.

This is what I also came up with so far. In order to reload the image from blob and display it as png in a databricks notebook again I use the following code:
blob_path = ...
dbfs_path = ...
dbutils.fs.cp( blob_path, dbfs_path )
with open( dbfs_path, "rb" ) as f:
im = BytesIO( f.read() )
img = mpimg.imread( im )
imgplot = plt.imshow( img )
display( imgplot.figure )

I didn't succeed using dbutils, which cannot be correctly created.
But I did succeed by mounting the file-shares to a Linux path, like this:
https://learn.microsoft.com/en-us/azure/azure-functions/scripts/functions-cli-mount-files-storage-linux

Related

KeyError: 'ETag' while trying to load data from S3 to Sagemaker

I Unload a file of 500 MB into S3 from Redshift, instead of saving into a single file in S3 it bifurcated into several chunks and now I am trying to access it from S3 to AWS Sagemaker. While trying to read the file using Pd.read_csv and dask.dataframe.read_csv I am getting Keyerror as 'ETag'
I'm a newbie to AWS, please do help me.
are you trying to import using a bucket name with /'s in it? The top level bucket is read in with
my_bucket = s3.Bucket("data-bucket-named")
and then the subfolders can be read in with:
subfolders= "subfolder1/subfolder2/subfolder3"
csvs = []
for object_summary in my_bucket.objects.filter(Prefix=subfolders):
key=object_summary.key
if key.endswith(".csv"):
csvs.append(key)
all_data = pd.DataFrame({})
for file in csvs:
df = pd.read_csv(f's3://{"data-bucket-named"}/{file}')
add_data = pd.concat([all_df, df])
Hope that helps.

issue with reading csv file from AWS S3 with boto3

I have a csv file with the following columns:
Name Adress/1 Adress/2 City State
When I try to read this csv file from local disk I have no issue.
But when I try to read it from S3 with the below code I get error when I use io.StringIO.
When I use io.BytesIO each record displays as one column. Though the file is a ',' separated some column do contain '/n' or '/t' in it. I believe these causing the issue.
I used AWS Wrangler with no issue. But my requirement is to read this csv file with boto3
import pandas as pd
import boto3
s3 = boto3.resource('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
my_bucket = s3.Bucket(AWS_S3_BUCKET)
csv_obj=my_bucket.Object(key=key).get().get('Body').read().decode('utf16')
data= io.BytesIO(csv_obj) #io.StringIO(csv_obj)
sdf = pd.read_csv(data,delimiter=sep,names=cols, header=None,skiprows=1)
print(sdf)
Any suggestion please?
try get_object():
obj = boto3.client('s3').get_object(Bucket=AWS_S3_BUCKET, Key=key)
data = io.StringIO(obj['Body'].read().decode('utf-8'))

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

Loading local file using Cloud Functions to BigQuery returns OSError: [Errno 30] `Read-only file system:

I am trying to load a Dataframe into BigQuery. I do this as follows:
# Prepare temp file to stream from local file
temp_file = table_name + '-' + str(timestamp_in_ms())
df.to_csv(temp_file, index=None, header=True)
# Define job_config
job_config = bigquery.LoadJobConfig()
job_config.schema = schema
job_config.skip_leading_rows = 1
job_config.source_format = bigquery.SourceFormat.CSV
# Create job to load data into table
with open(temp_file, "r+b") as source_file:
load_job = client.load_table_from_file(source_file, dataset_ref.table(table_name), job_config=job_config)
This works fine in local development, however when I deploy the Cloud Function it returns the following error:
OSError: [Errno 30] Read-only file system: '{temp_file}'
This happens on the line with open(temp_file, "r+b") as source_file:
Why can it not read local files on the Cloud Function temporary storage? What went wrong?
Probably you didn't specified the folder /tmp
Local Disk
Cloud Functions provides access to a local disk mount point (/tmp)
which is known as a "tmpfs" volume in which data written to the volume
is stored in memory. There is no specific fee associated with this
however writing data to the /tmp mountpoint will consume memory
resources provisioned for the function.
As explained on: https://cloud.google.com/functions/pricing

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9