Reading csv file from s3 using pyarrow - pandas

I want read csv file located in s3 bucket using pyarrow and convert it to parquet to another bucket.
I am facing problem in reading csv file from s3.I tried reading below code but failed.Does pyarrow support reading csv from s3 ?
from pyarrow import csv
s3_input_csv_path='s3://bucket1/0001.csv'
table=csv.read_csv(s3_input_csv_path)
This is throwing error
"errorMessage": "Failed to open local file 's3://bucket1/0001.csv', error: No such file or directory",
I know we can read csv file using boto3 and then can use pandas to convert it into data frame and finally convert to parquet using pyarrow. But in this approach pandas is also required to be added to the package that makes package size go beyond 250 mb limit for lambda when taken along with pyarrow.

Try passing a file handle to pyarrow.csv.read_csv instead of an S3 file path.
Note that future editions of pyarrow will have built-in S3 support but I am not sure of the timeline (and any answer I provide here will grow quickly out of date with the nature of StackOverflow).

import pyarrow.parquet as pq
from s3fs import S3FileSystem
s3 = S3FileSystem() # or s3fs.S3FileSystem(key=ACCESS_KEY_ID, secret=SECRET_ACCESS_KEY)
s3_input_csv_path = f"s3://bucket1/0001.csv"
dataset = pq.ParquetDataset(s3_input_csv_path, filesystem=s3)
table = dataset.read_pandas().to_pandas()
print(table)
s3_output_csv_path = f"s3://bucket2/0001.csv"
#Wring table to another bucket
pq.write_to_dataset(table=table,
root_path=s3_output_csv_path,
filesystem=s3)

AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
Example of CSV read:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Reference

Its not possible as of now. But here is a workaround, we can load data to pandas and cast it to pyarrow table
import pandas as pd
import pyarrow as pa
df = pd.read_csv("s3://your_csv_file.csv", nrows=10). #reading 10 lines
pa.Table.from_pandas(df)

Related

problem reading panda csv file into python

I have a very elementary csv reading program which does not work
import pandas as pd
Reading the tips.csv file
data = pd.read_csv('tips.csv')`
The error messages are long and end with tips.csv not found
Is your csv file in the same folder?

Kubeflow Error in handling large input file: The node was low on resource: ephemeral-storage

In Kubeflow - When input file size is really large (60 GB), I am getting 'The node was low on resource: ephemeral-storage.' It looks like kubeflow is using /tmp folder to store the files. I had following questions:
What is the best way to exchange really large files? How to avoid ephemeral-storage issue?
Will all the InputPath and OutputPath files are stored in MinIO Instance of Kubeflow? If yes, how can we purge the data from MinIO?
When data is passed between one stage of workflow to the next, does Kubeflow download file from MinIO and copy it to /tmp folder and pass InputPath to the function?
Is there a better way to pass pandas dataframe between different stages of workflow? Currently I am exporting pandas dataframe as CSV to OutputPath of the operation and reloading pandas dataframe from InputPath in the next stage.
Is there a way to use different volume for file exchange than using ephemeral storage? If yes, how I can configure it?
import pandas as pd
print("text_path:", text_path)
pd_df = pd.read_csv(text_path)
print(pd_df)
with open(text_path, 'r') as reader:
for line in reader:
print(line, end = '')

Can you use xr.open_mfdataset when reading files from S3 via s3fs?

I'm trying to read multiple netcdf files at once using xr.open_mfdataset from a S3 bucket, using s3fs. Is this possible?
Tried the below, which works for xr.open_dataset for a single file, but doesn't work for multiple files:
import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=False)
s3path = 's3://my-bucket/wind_data*'
store = s3fs.S3Map(root=s3path, s3=s3fs.S3FileSystem(), check=False)
data = xr.open_mfdataset(store, combine='by_coords')
I'm not sure exctly what S3Map does; the documentation from s3fs isn't specific in this.
However, I was able to create a working implementation of this within a Jupyter environment using S3FileSystem.glob() and S3FileSystem.open()
Here's a code sample:
import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=False)
# This generates a list of strings with filenames
s3path = 's3://your-bucket/your-folder/file_prefix*'
remote_files = s3.glob(s3path)
# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
# This works
data = xr.open_mfdataset(fileset, combine='by_coords')

pypi sas7bdat to_data_frame taking too long for large data(5 GB)

I have a 5GB SAS file and the requirment is to create parquet file in Hadoop. I am using SAS7BDAT library and using following approach which is taking more then 5 hours in creating pandas dataframe when running pyspark on client mode. Curious to know if there is any better way of doing the same.
I know there is saurfang package available which is more efficient in this case, but we do not want to use any 3rd part software.
f = sas7bdat.SAS7BDAT(str(source_file))
pandas_df = f.to_data_frame()
spark_df = spark.createDataFrame(pandas_df)
del pandas_df
spark_df.write.save(dest_file,format='parquet', mode='Overwrite')
Please use Spark to read the file, not Pandas
https://github.com/saurfang/spark-sas7bdat/blob/master/README.md#python-api
Add this to your packages
saurfang:spark-sas7bdat:2.1.0-s_2.11
Note, I've not personally used this, I only search for "SAS 7B DAT + Spark". If you have issues, please report here
https://github.com/saurfang/spark-sas7bdat/issues/

How to read tabular data on s3 in pyspark?

I have some tab separated data on s3 in a directory s3://mybucket/my/directory/.
Now, I am telling pyspark that I want to use \t as the delimiter to read in just one file like this:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
indata_creds = sqlContext.read.load('s3://mybucket/my/directory/onefile.txt').option("delimiter", "\t")
But it is telling me: assertion failed: No predefined schema found, and no Parquet data files or summary files found under s3://mybucket/my/directory/onefile.txt
How do I tell pyspark that this is a tab-delimited file and not a parquet file?
Or, is there an easier way to do read in these files in the entire directory all at once?
thanks.
EDIT: I am using pyspark version 1.6.1 *
The files are on s3, so I am not able to use the usual:
indata_creds = sqlContext.read.text('s3://mybucket/my/directory/')
because when I try that, I get java.io.IOException: No input paths specified in job
Anything else I can try?
Since you're using Apache Spark 1.6.1, you need spark-csv to use this code:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/onefile.txt')
That should work!
Another option is for example this answer. Instead of splitting this by the comma you could use to split it by tabs. And then load the RDD into a dataframe. However, the first option is easier and already loads it into a dataframe.
For your alternative in your comment, I wouldn't convert it to parquet files. There is no need for it except if your data is really huge and compression is necessary.
For your second question in the comment, yes it is possible to read the entire directory. Spark supports regex/glob. So you could do something like this:
indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/*.txt')
By the way, why are you not using 2.x.x? It's also available on aws.
The actual problem was that I needed to add my AWS keys to my spark-env.sh file.