How to read csv files from s3 bucket using Pyspark (in macos)? - amazon-s3

I am trying to read csv df from s3 bucket , but facing issues. Can you let me know where am I masking mistakes here ?
conf=SparkConf()
conf.setMaster('local')
conf.setAppName('sparkbasic')
sc = SparkContext.getOrCreate(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "abc")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xyz")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "mybucket/path/fileeast-1.redshift.amazonaws.com")
from pyspark.sql import SparkSession
sc = SparkSession.builder.appName('sparkbasic').getOrCreate()
This is the code where I get the error
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This is the error I get , I tried links given in stackoverflow answers , but nothing worked me so far
ava.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Maybe you can have a look to S3Fs
Given your details, maybe a configuration like that could work:
import s3fs
fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': 'fileeast-1.redshift.amazonaws.com',
"aws_access_key_id": "abc",
"aws_secret_access_key": "xyz"})
To check if you manage to interact with s3, you can try the following command (NB: change somefile.csv to an existing one)
fs.info('s3://bucket/path/file/somefile.csv')
Note that in fs.info we start the path with s3. If you do not encounter an error, you might hope the following command works:
csvDf = sc.read.csv("s3a://bucket/path/file/*.csv")
This time you have the path begins by s3a

Related

issue with reading csv file from AWS S3 with boto3

I have a csv file with the following columns:
Name Adress/1 Adress/2 City State
When I try to read this csv file from local disk I have no issue.
But when I try to read it from S3 with the below code I get error when I use io.StringIO.
When I use io.BytesIO each record displays as one column. Though the file is a ',' separated some column do contain '/n' or '/t' in it. I believe these causing the issue.
I used AWS Wrangler with no issue. But my requirement is to read this csv file with boto3
import pandas as pd
import boto3
s3 = boto3.resource('s3', aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
my_bucket = s3.Bucket(AWS_S3_BUCKET)
csv_obj=my_bucket.Object(key=key).get().get('Body').read().decode('utf16')
data= io.BytesIO(csv_obj) #io.StringIO(csv_obj)
sdf = pd.read_csv(data,delimiter=sep,names=cols, header=None,skiprows=1)
print(sdf)
Any suggestion please?
try get_object():
obj = boto3.client('s3').get_object(Bucket=AWS_S3_BUCKET, Key=key)
data = io.StringIO(obj['Body'].read().decode('utf-8'))

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup:
S3 --> Glue --> Athena
My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.
Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:
Run Crawler to read CSV files and populate Data Catalog.
Run ETL job to create Parquet file from Data Catalog.
Run a Crawler to populate Data Catalog using Parquet file.
The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?
I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.
Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:
Iterating through catalog/database/tables
The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.
After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
client = boto3.client('glue', region_name='ap-southeast-2')
databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables['TableList']
for table in tableList:
tableName = table['Name']
print '\n-- tableName: ' + tableName
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="tpc-ds-csv",
table_name=tableName,
transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={
"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
},
format="parquet",
transformation_ctx="datasink4"
)
job.commit()
Please refer to EDIT for updated info.
S3 --> Athena
Why not you use CSV format directly with Athena?
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
CSV is one of the supported formats. Also to make it efficient, you can compress multiple CSV files for faster loading.
Supported compression,
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Hope it helps.
EDIT:
Why Parquet format is more helpful than CSV?
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
S3 --> Glue --> Athena
More details on CSV to Parquet conversion,
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
I'm not a big fan of Glue, nor creating schemas from data
Here's how to do it in Athena, which is dramatically faster than Glue.
This is for the CSV files:
create table foo (
id int,
name string,
some date
)
row format delimited
fields terminated by ','
location 's3://mybucket/path/to/csvs/'
This is for the parquet files:
create table bar
with (
external_location = 's3://mybucket/path/to/parquet/',
format = 'PARQUET'
)
as select * from foo
You don't need to create that path for parquet, even if you use partitioning
you can convert either JSON or CSV files into parquet directly, without importing it to the catalog first.
This is for the JSON files - the below code would convert anything hosted at the rawFiles directory
import sys
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
## #params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
spark = glueContext.spark_session
job = Job(glueContext) job.init(args['JOB_NAME'], args)
s3_json_path = 's3://rawFiles/'
s3_parquet_path = 's3://convertedFiles/'
output = spark.read.load(s3_json_path, format='json')
output.write.parquet(s3_parquet_path)
job.commit()
Sounds like in your step 1 you are crawling the individual csv file (e.g some-bucket/container-path/file.csv), but if you instead set your crawler to look at a path level instead of a file level (e.g some-bucket/container-path/) and all your csv files are uniform then the crawler should only create a single external table instead of an external table per file and you’ll be able to extract the data from all of the files at once.

PySpark - Spark clusters EC2 - unable to save to S3

I have set up a spark cluster with a master and 2 slaves (I'm using Spark Standalone). The cluster is working well with some of the examples but not my application. My application workflow is that, it will read the csv -> extract each line in the csv along with the header -> convert to JSON -> save to S3. Here is my code:
def upload_func(row):
f = row.toJSON()
f.saveAsTextFile("s3n://spark_data/"+ row.name +".json")
print(f)
print(row.name)
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Python Spark SQL data source example") \
.getOrCreate()
df = spark.read.csv("sample.csv", header=True, mode="DROPMALFORMED")
df.rdd.map(upload_func)
I have also export the AWS_Key_ID and AWS_Secret_Key into the ec2 environment. However with the above code, my application does not work. Below are the issues:
The JSON files are not saved in S3, I have tried run the application few times and also reload the S3 page but no data. The application completed without any error in the log. Also, the print(f) and print(row.name) are not printed out in the log. What do I need to fix to get the JSON save on S3 and is there anyway for me to print on the log for debug purpose?
Currently I need to put the csv file in the worker node so the application can read the csv file. How can I put the file in another place, let say the master node and when the application runs, it will split the csv file to all the worker nodes so they can do the upload parallel as a distributed system?
Help is really appreciated. Thanks for your help in advance.
UPDATED
After putting Logger to debug, I have identified the issue that the map function upload_func() is not being called or the application could not get inside this function (Logger printed messages before and after function call). Please help if you know the reason why?
you need to force the map to be evaluated; spark will only execute work on demand.
df.rdd.map(upload_func).count() should do it