Kubeflow Error in handling large input file: The node was low on resource: ephemeral-storage - kubeflow-pipelines

In Kubeflow - When input file size is really large (60 GB), I am getting 'The node was low on resource: ephemeral-storage.' It looks like kubeflow is using /tmp folder to store the files. I had following questions:
What is the best way to exchange really large files? How to avoid ephemeral-storage issue?
Will all the InputPath and OutputPath files are stored in MinIO Instance of Kubeflow? If yes, how can we purge the data from MinIO?
When data is passed between one stage of workflow to the next, does Kubeflow download file from MinIO and copy it to /tmp folder and pass InputPath to the function?
Is there a better way to pass pandas dataframe between different stages of workflow? Currently I am exporting pandas dataframe as CSV to OutputPath of the operation and reloading pandas dataframe from InputPath in the next stage.
Is there a way to use different volume for file exchange than using ephemeral storage? If yes, how I can configure it?
import pandas as pd
print("text_path:", text_path)
pd_df = pd.read_csv(text_path)
print(pd_df)
with open(text_path, 'r') as reader:
for line in reader:
print(line, end = '')

Related

Write CSV file in append mode on Azure Databricks

Wanted to write csv file in append mode on Azure Databricks. The below code is working fine on my local machine.(Jupyter notebook)
df = pd.read_csv("/dbfs/mnt/dev/tmp/ml_p/csv_append.csv")
df+6
Ans
[1]: https://i.stack.imgur.com/sXsgH.png
when I opened the same csv file and wanted to save the file after performing the operation.
I got , OSError: [Errno 95] Operation not supported
with open('/dbfs/mnt/dev/tmp/ml_p/csv_append.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
Is there is another alternative to write the CSV file in append mode? or Can I achieve the same using pyspark.
There are some limitations on what operations could be done with files on DBFS (especially via /dbfs mount point), and you hit this limit. The workaround would be to copy file from DBFS to local file system, modify it the same as you do it, and then upload back. Copying of the file could be done with dbutils.fs commands, like:
dbutils.fs.cp("dbfs:/mnt/dev/tmp/ml_p/csv_append.csv", "file:/tmp/csv_append.csv")
df = pd.read_csv("/tmp/csv_append.csv")
df+6
with open('/tmp/csv_append.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
dbutils.fs.mv("file:/tmp/csv_append.csv","dbfs:/mnt/dev/tmp/ml_p/csv_append.csv")

How to read large text file stored in S3 from sagemaker jupyter notebook?

I have a large (25 MB approx.) CSV file stored in S3. It contains two columns. Each cell of the first column contains the file references and each cell of the second column contains a large(500 to 1000 words) body of the text. There are several thousand rows in this CSV.
I want to read it from sagemaker jupyter notebook and save it as a list of strings in memory. And then I shall use this list in my NLP models.
I am using the following code:
def load_file(bucket, key, sep=','):
client = boto3.client('s3')
obj = client.get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
text = open(data)
string_io = StringIO(data)
return pd.read_csv(string_io, sep=sep)
file = load_file("bucket", 'key',sep=',')
I am getting the following error:
OSError: [Errno 36] File name too long:
25MB is relatively small so you shouldn't have any problem with that. There are a number of different methods that you can use within a SageMaker Notebook instance. Since a SageMaker Notebook has an AWS execution role, it automatically handles credentials for you. This makes using the aws cli easy. This example will copy the file to your local system for the notebook and then you can access the file locally (relative to the notebook):
!aws s3 cp s3://$bucket/$key ./
You can find other examples of ingesting data into SageMaker Notebooks in both Studio and notebook instances in this tutorial hosted on GitHub.

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Merge small files from S3 to create a 10 Mb file

I am new to map reduce. I have a s3 bucket that gets 3000 files every minute. I am trying to use Map reduce to merge these files to make a file between size 10 -100 MB. The python code will use Mrjob and will run on aws EMR. Mrjob's documentation say, mapper_raw can be used to pass entire files to the mapper.
class MRCrawler(MRJob):
def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator
with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...
Is there a way to limit it to only read 5000 files in one run and delete those files after the reducer saves the results to S3 so that the same files are not picked in the next run.
You can do as follows:
configure SQS on the S3 bucket
have lambda which gets triggered by cron; which reads the events from the SQS and copies the relevant files into a staging folder -- you can configure this lambda to read only 5000 messages at a given time.
do all your processing on top of staging folder and once you're done with your Spark job in emr, clean the staging folder

Reading csv file from s3 using pyarrow

I want read csv file located in s3 bucket using pyarrow and convert it to parquet to another bucket.
I am facing problem in reading csv file from s3.I tried reading below code but failed.Does pyarrow support reading csv from s3 ?
from pyarrow import csv
s3_input_csv_path='s3://bucket1/0001.csv'
table=csv.read_csv(s3_input_csv_path)
This is throwing error
"errorMessage": "Failed to open local file 's3://bucket1/0001.csv', error: No such file or directory",
I know we can read csv file using boto3 and then can use pandas to convert it into data frame and finally convert to parquet using pyarrow. But in this approach pandas is also required to be added to the package that makes package size go beyond 250 mb limit for lambda when taken along with pyarrow.
Try passing a file handle to pyarrow.csv.read_csv instead of an S3 file path.
Note that future editions of pyarrow will have built-in S3 support but I am not sure of the timeline (and any answer I provide here will grow quickly out of date with the nature of StackOverflow).
import pyarrow.parquet as pq
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3_input_csv_path = f"s3://bucket1/0001.csv"
dataset = pq.ParquetDataset(s3_input_csv_path, filesystem=s3)
table = dataset.read_pandas().to_pandas()
print(table)
s3_output_csv_path = f"s3://bucket2/0001.csv"
#Wring table to another bucket
pq.write_to_dataset(table=table,
root_path=s3_output_csv_path,
filesystem=s3)
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
Example of CSV read:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Reference
Its not possible as of now. But here is a workaround, we can load data to pandas and cast it to pyarrow table
import pandas as pd
import pyarrow as pa
df = pd.read_csv("s3://your_csv_file.csv", nrows=10). #reading 10 lines
pa.Table.from_pandas(df)