PySpark Writing DataFrame Partitions to S3

PySpark Writing DataFrame Partitions to S3 - amazon-s3

I've been trying to partition and write a spark dataframe to S3 and I get an error.
df.write.partitionBy("year","month").mode("append")\
.parquet('s3a://bucket_name/test_folder/')
Error message is:
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxx,
AWS Error Code: SignatureDoesNotMatch,
AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
However, when I simply write without partitioning it does work.
df.write.mode("append").parquet('s3a://bucket_name/test_folder/')
What could be causing this problem?

I resolved this problem by upgrading from aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 to hadoop-aws:3.0.0 in my spark-submit.
I set this in my python file using:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:3.0.0 pyspark-shell
But you can also provide them as arguments to spark-submit directly.
I had to rebuild Spark providing my own version of Hadoop 3.0.0 to avoid dependency conflicts.
You can read some of my speculation as to the root cause here: https://stackoverflow.com/a/51917228/10239681

Related

AWS S3 Connection in druid

I have set up a clustered Druid with the configuration as mentioned in the Druid documentation
https://druid.apache.org/docs/latest/tutorials/cluster.html
I am using AWS S3 for deep storage. Following is the snippet of my common configuration file
druid.extensions.loadList=["druid-datasketches", "mysql-metadata-storage", "druid-s3-extensions", "druid-orc-extensions", "druid-lookups-cached-global"]
# For S3:
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
#druid.storage.disableAcl=true
druid.storage.sse.type=s3
#druid.s3.accessKey=...
#druid.s3.secretKey=...
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=bucket-name
druid.indexer.logs.s3Prefix=druid/stage/indexing-logs
While running any ingestion task I am getting Access denied error
Java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ; S3 Extended Request ID: ), S3 Extended Request ID:
at org.apache.druid.storage.s3.S3DataSegmentPusher.push(S3DataSegmentPusher.java:103) ~[?:?]
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$mergeAndPush$4(AppenderatorImpl.java:791) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:87) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:115) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:105) ~[druid-core-0.19.0.jar:0.19.0]
I am using s3 for two purposes
read data from s3 and ingest it. This connection is working fine and data is being from s3 location
for deep storage. I am getting error over here.
I am using Profile information authentication method to provide s3 credential. So I already have configured aws cli with appropriate credentials. Also, s3 data is encrypted by AES256 so i have added druid.storage.sse.type=s3 in config file.
Can someone help me out here as I am not able to debug the issue.

You asked how to approach debugging this. Normally I would:
Ssh onto the ec2 instance and run aws sts get-caller-identity. This will tell you what principal your requests are sent from. Then, I would confirm that principal has the S3 access that is expected.
I would confirm that I can write to the bucket in your configuration.
druid.storage.type=s3
druid.storage.bucket=<bucket-name>
druid.storage.baseKey=druid/segments
I would try some of the other auth methods such as exporting the keys into the environment mentioned in the third option since that is a simple test. Then I would run step 1 again to confirm my principal reflects those keys. And then I would try running your code again.

S3 Access error with Emr with Spark 2.4.4 and Scala

I am trying to access an S3 file in Spark on EMR using Scala code and getting the below error
EMR Configuration :
EMR Configuration
Scala Code
val hadoopConf = sparkContext.hadoopConfiguration
if (baseDirectory.startsWith("s3:")) {
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", jobProperties.getAs[String](S3_ACCESS_KEY_ID))
hadoopConf.set("fs.s3.awsSecretAccessKey", jobProperties.getAs[String](S3_SECRET_ACCESS_KEY))
}
org.apache.hadoop.fs.FileSystem.get(new java.net.URI(baseDirectory), hadoopConf)
ERROR
20/03/28 15:18:06 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: s3n://r10x-tlog/occ/gzip/test_$folder$
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at
I have checked the Spark And Hadoop jars
Hadoop Libraries
Could you please help?

Instead of s3n please use s3. s3a and s3n is not supported in EMR.
Also make sure your EMR_IAM_Role has access to that s3 bucket.

You should use EMRFS instead of s3a or s3n because it's the native implementation to use S3 as FS.
using EMRFS, you don't need to use credentials to use S3. You would just need to grant the permission to EMR_EC2_DefaultRole

just give the necessary permissions for list-objects and get-object to your access key and you are good to go.

Pyspark not using TemporaryAWSCredentialsProvider

I'm trying to read files from S3 using Pyspark using temporary session credentials but keep getting the error:
Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: null, AWS Request ID: XXXXXXXX, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: XXXXXXX
I think the issue might be that the S3A connection needs to use org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider in order to pull in the session token in addition to the standard access key and secret key, but even with setting the fs.s3a.aws.credentials.provider configuration variable, it is still attempting to authenticate with BasicAWSCredentialsProvider. Looking at the logs I see:
DEBUG AWSCredentialsProviderChain:105 - Loading credentials from BasicAWSCredentialsProvider
I've followed the directions here to add the necessary configuration values, but they do not seem to make any difference. Here is the code I'm using to set it up:
import os
import sys
import pyspark
from pyspark.sql import SQLContext
from pyspark.context import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.83,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
sc = SparkContext()
sc.setLogLevel("DEBUG")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
sc._jsc.hadoopConfiguration().set("fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
sql_context = SQLContext(sc)
Why is TemporaryAWSCredentialsProvider not being used?

Which Hadoop version are you using?
S3A STS support was added in Hadoop 2.8.0, and this was the exact error message i got on Hadoop 2.7.

Wafle is right, its 2.8+ only.
But you might be able to get away with setting the AWS_ environment variables and have the session secrets being picked up that way, as AWS environment variable support has long been in there, and I think it will pick up the AWS_SESSION_TOKEN
See AWS docs

Server Side Encryption in to_csv function

I'm getting this error while using to_csv("s3://mys3bucket/result.csv")
Exception: [Errno Write Failed: mys3bucket/result.csv/2489.part]
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
It may have been caused because Dask is not using server-side encryption. Please tell me how can I make it use SSE or some other method to successfully write the file to the s3 bucket

Expanding on #user32185's comment, for controling SSE via dask, your call should look something like this:
to_csv("s3://mys3bucket/result.csv",
storage_option={'s3_additional_kwargs':
{'ServerSideEncryption': 'AES256'}})
where the specifics of SSE with s3fs are detailed here. Note that you may also require other keywords from the same docs page, for credentials, storage zone, etc. The parameters are passed to the S3FileSystem constructor, and you can delve into the boto docs to see what everything means.

Redshift Spectrum / The bucket you are attempting to access must be addressed using the specified endpoint

I created a parquet file in S3 and an external table pointing to it in Redshift / Spectrum. Both my S3 bucket and Redshift cluster are in us-west-2. I specified the option region when creating the schema.
Queries run smoothly in Athena.
Yet when I run from Redshift client, I get this error:
Amazon Invalid operation: S3 Query Exception (Fetch)
Details:
error: S3 Query Exception (Fetch)
code: 15001
context: Task failed due to an internal error.
HTTP response error code: 301 Message: PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. >Please send all future requests to this endpoint.
x-amz-request-id: XXXX
query: XXXXX
location: dory_util.cpp:689
process: query0_40 [pid=XXX]
-----------------------------------------------;

AWS has acknowledged the issue and released a patch overnight.

Please make sure that your Redshift cluster is running with at least version 1.0.14016 in us-east-2 or us-west-2 and 1.0.1407 in us-east-1. To apply the patch to Redshift immediately, move the maintenance window of your cluster closer to the current time and day to pick it up at your convenience.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PySpark Writing DataFrame Partitions to S3 - amazon-s3

Related

AWS S3 Connection in druid

S3 Access error with Emr with Spark 2.4.4 and Scala

Pyspark not using TemporaryAWSCredentialsProvider

Server Side Encryption in to_csv function

Redshift Spectrum / The bucket you are attempting to access must be addressed using the specified endpoint

Categories

Resources