Server Side Encryption in to_csv function - amazon-s3

I'm getting this error while using to_csv("s3://mys3bucket/result.csv")
Exception: [Errno Write Failed: mys3bucket/result.csv/2489.part]
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
It may have been caused because Dask is not using server-side encryption. Please tell me how can I make it use SSE or some other method to successfully write the file to the s3 bucket

Expanding on #user32185's comment, for controling SSE via dask, your call should look something like this:
to_csv("s3://mys3bucket/result.csv",
storage_option={'s3_additional_kwargs':
{'ServerSideEncryption': 'AES256'}})
where the specifics of SSE with s3fs are detailed here. Note that you may also require other keywords from the same docs page, for credentials, storage zone, etc. The parameters are passed to the S3FileSystem constructor, and you can delve into the boto docs to see what everything means.

Related

Which error codes for SFTP error in Mulesoft 3

I have developed an application in mule3 to transform data and then upload the data as a file to sftp location. I have included all common errors, such as http 400 series and 500 but what is a proper handling status code for when ftp fails, for example with file upload, connection or permission.
I have searched a lot on the internet and the more I search the more I get lost.
Does anyone have experience with this?
Thanks
If you are asking for a table for mapping error codes between SFTP and HTTP, there is no standard for it. These are completely different protocols. You have to define your own mapping. Most of them will probably be 5xx in HTTP, with authentication errors probably 403.
Not sure which connector version you use. But if you open the documentation of the SFTP connector, like: https://docs.mulesoft.com/sftp-connector/1.4/sftp-documentation.
You can see the documentation refers to the error that could be thrown, for example the copy operation can throw the following errors.
Based on those errors you should do your logic. Also the HTTP connector is throwing such errors, but then in the HTTP namespace. If needed you can also remap errors to a different and new namespace. Based on your remapped errors you could also implement logic.

dms s3 source endpoint connection fails

Getting below connection error when trying to validate S3 source endpoint of DMS.
Test Endpoint failed: Application-Status: 1020912, Application-Message: Failed to connect to database.
Followed all the steps listed in the below links but still maybe I am missing something...
https://aws.amazon.com/premiumsupport/knowledge-center/dms-connection-test-fail-s3/
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.S3.html
The role associated with the endpoint does have access to the S3 bucket of the endpoint, along with dms being listed as trusted entity.
I got this same error when trying to use S3 as a target.
The one thing not mentioned in the documentation, and which turned out to be the root cause for my error, is that the DMS Replication Instance and the Bucket need to be in the same region.

PySpark Writing DataFrame Partitions to S3

I've been trying to partition and write a spark dataframe to S3 and I get an error.
df.write.partitionBy("year","month").mode("append")\
.parquet('s3a://bucket_name/test_folder/')
Error message is:
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxx,
AWS Error Code: SignatureDoesNotMatch,
AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
However, when I simply write without partitioning it does work.
df.write.mode("append").parquet('s3a://bucket_name/test_folder/')
What could be causing this problem?
I resolved this problem by upgrading from aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 to hadoop-aws:3.0.0 in my spark-submit.
I set this in my python file using:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:3.0.0 pyspark-shell
But you can also provide them as arguments to spark-submit directly.
I had to rebuild Spark providing my own version of Hadoop 3.0.0 to avoid dependency conflicts.
You can read some of my speculation as to the root cause here: https://stackoverflow.com/a/51917228/10239681

Does s4cmd support signature version 4 because i am unable to upload files to s3 bucket (LONDON)

I am trying to upload files to s3 bucket(LONDON) i.e. eu-west-2. S4cmd is not working.
s4cmd put /home/username/Documents/file-1.json s3://[BUCKETNAME]/file-1.json
error when i run this command is : -
[Exception] An error occurred (400) when calling the HeadObject operation: Bad Request
[Thread Failure] An error occurred (400) when calling the HeadObject operation: Bad Request
S3cmd works but it is slow. s4cmd works for US standard region but for London region it is not working.
Thanks in advance.
The aws s3 cp command in the AWS Command-Line Interface (CLI) uses multi-part upload to fully utilize available bandwidth, so it should give you pretty much the best speed possible.

error when ingesting with Watson Discovery Service

Trying to ingest a 7MB json file with Watson Discovery Service. When using WDS tooling interface to ingest, the interface indicates successful ingestion, but the document then looks to have failed. The error returned when using the API was: Your request could not be processed because of a problem on the server". The error is not really helping troubleshoot the problem. Any ideas? How do we troubleshoot these problems?
Thank you
The interface indicates successful ingestion because the process is asynchronous. Actually it means that the document was loaded and submitted for ingestion.
Check if your input json file has a top-level array. Currently this type of json files are not supported so it might be the cause of your issue.