How to read large text file stored in S3 from sagemaker jupyter notebook? - amazon-s3

I have a large (25 MB approx.) CSV file stored in S3. It contains two columns. Each cell of the first column contains the file references and each cell of the second column contains a large(500 to 1000 words) body of the text. There are several thousand rows in this CSV.
I want to read it from sagemaker jupyter notebook and save it as a list of strings in memory. And then I shall use this list in my NLP models.
I am using the following code:
def load_file(bucket, key, sep=','):
client = boto3.client('s3')
obj = client.get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
text = open(data)
string_io = StringIO(data)
return pd.read_csv(string_io, sep=sep)
file = load_file("bucket", 'key',sep=',')
I am getting the following error:
OSError: [Errno 36] File name too long:

25MB is relatively small so you shouldn't have any problem with that. There are a number of different methods that you can use within a SageMaker Notebook instance. Since a SageMaker Notebook has an AWS execution role, it automatically handles credentials for you. This makes using the aws cli easy. This example will copy the file to your local system for the notebook and then you can access the file locally (relative to the notebook):
!aws s3 cp s3://$bucket/$key ./
You can find other examples of ingesting data into SageMaker Notebooks in both Studio and notebook instances in this tutorial hosted on GitHub.

Related

Kubeflow Error in handling large input file: The node was low on resource: ephemeral-storage

In Kubeflow - When input file size is really large (60 GB), I am getting 'The node was low on resource: ephemeral-storage.' It looks like kubeflow is using /tmp folder to store the files. I had following questions:
What is the best way to exchange really large files? How to avoid ephemeral-storage issue?
Will all the InputPath and OutputPath files are stored in MinIO Instance of Kubeflow? If yes, how can we purge the data from MinIO?
When data is passed between one stage of workflow to the next, does Kubeflow download file from MinIO and copy it to /tmp folder and pass InputPath to the function?
Is there a better way to pass pandas dataframe between different stages of workflow? Currently I am exporting pandas dataframe as CSV to OutputPath of the operation and reloading pandas dataframe from InputPath in the next stage.
Is there a way to use different volume for file exchange than using ephemeral storage? If yes, how I can configure it?
import pandas as pd
print("text_path:", text_path)
pd_df = pd.read_csv(text_path)
print(pd_df)
with open(text_path, 'r') as reader:
for line in reader:
print(line, end = '')

Merge small files from S3 to create a 10 Mb file

I am new to map reduce. I have a s3 bucket that gets 3000 files every minute. I am trying to use Map reduce to merge these files to make a file between size 10 -100 MB. The python code will use Mrjob and will run on aws EMR. Mrjob's documentation say, mapper_raw can be used to pass entire files to the mapper.
class MRCrawler(MRJob):
def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator
with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...
Is there a way to limit it to only read 5000 files in one run and delete those files after the reducer saves the results to S3 so that the same files are not picked in the next run.
You can do as follows:
configure SQS on the S3 bucket
have lambda which gets triggered by cron; which reads the events from the SQS and copies the relevant files into a staging folder -- you can configure this lambda to read only 5000 messages at a given time.
do all your processing on top of staging folder and once you're done with your Spark job in emr, clean the staging folder

How to load images in Google Colab notebook using Tensorflow from mounted Google drive

In a Google Colab notebook, I have my Google drive mounted and can see my files.
I'm trying to load a zipped directory that has two folders with several picture files in each.
I followed an example from the Tensorflow site that has an example on how to load pictures, but it's using a remote location.
Here's the site - https://www.tensorflow.org/tutorials/load_data/images
Here's the code from the example that works:
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)
Here's the revised code where I tried to reference the zipped directory from the mounted Google drive:
data_root_orig = tf.keras.utils.get_file(origin='/content/gdrive/My Drive/TrainingPictures/',
fname='TrainingPictures_Car', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)
I get this error:
ValueError: unknown url type: '/content/gdrive/My Drive/TrainingPictures/'
It's obviously expecting a URL instead of the path as I've provided.
I would like to know how I can load the zipped directory as provided from the Google drive.
In this case, no need to use tf.keras.utils.get_file(), Only Path is enough.
Here 2 ways to do that
First: !unzip -q 'content/gdrive/My Drive/TrainingPictures/TrainingPictures_Car.zip'
it will be unzipped on '/content/'
import pathlib
data = pathlib.Path('/content/folders_inside_zip')
count = len(list(data.glob('*/*.jpg')))
count
Second:
if archive already unzipped in google drive:
import pathlib
data = pathlib.Path('/content/gdrive/My Drive/TrainingPictures/')
count = len(list(data.glob('*.jpg')))
count
In my case it actually worked by removing all imports and libraries and just setting the path as a string. The file has to be uploaded into the google colab.
content_path = "cat.jpg"
For me it worked with file:///content/(filename)

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

Convert excel subsheet into individual CSV in Azure storage cloud using Powershell

I am writing a powershell script which should do following things:
Take Excel file from Azure blob storage folder 'A' as input
Extract subsheets from Excel and converts into the individual CSV
Transfer those CSV to same blob storage in folder 'B'
I am able to do 1 and 2nd step but in 2nd step I am getting excel file object which i have to transfer to blob storage folder 'B'.This is where i am not able to proceed. As for copyinh file to Blob has 2 methods :
Start-AzureStorageBlobCopy - this cmdlets can copy only blob but as i said i have file object.(look below for better understanding)
$wb = $E.Workbooks.Open($sf)
foreach ($ws in $wb.Worksheets)
I meant,I have $ws which is a object of excel file.
2.Set-AzureStorageBlobContent - this cmdlet requires local system file path, it means this cmdlets only upload file to blob from local system directory.
Can anyone suggest me the correct method to tackle this situation? Any help would be appreciated.