Is there any difference in performance when we connect to S3 via S3 API versus via Hadoop Filesystem [closed] - amazon-s3

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 days ago.
Improve this question
I want to create a java utility to read S3 bucket information.
We can connect to s3 via native s3 APIs and the Hadoop filesystem approach.
Approach 1: Using S3 APIs
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.valueOf(region))
.build();
Approach 1: Using Hadoop Filesystem:
configuration.set("fs.s3a.access.key","XXXXXXXXXXX");
configuration.set("fs.s3a.secret.key","XXXXXXXXXXX");
configuration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem");
configuration.set("fs.s3a.endpoint","http://127.0.0.1:8080");
UserGroupInformation.setConfiguration(configuration);
fileSystem = new Path("s3a://"+ bucketName).getFileSystem(configuration);
Do we know when we use which approach? Which approach is more efficient to read data?
In my observation, the filesystem route is slower. But I have not found any documentation supporting the performance difference.

Performance shouldn't be the only factor. If you want higher performance, or at least better file operation consistency guarantees, look into S3Guard.
But if you have to create a Java client that will only ever talk to S3, and never needs to integrate with Hadoop ecosystem, or use other Hadoop compatible filesystems (HDFS, GCS, ADLS, etc), then you should use plain AWS SDK.
If you're trying to run some mocked S3 service (or MinIO) on 127.0.0.1, then that's not a proper benchmark to a real S3 service

Related

Rclone Compression | zip | rar and data transfer [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 months ago.
Improve this question
I have been using rclone to back up google drive data to AWS S3 cloud storage. I have multiple google drive accounts whose backup happens on AWS S3. All those google drives have different numbers of documents.
I want to compress those documents into a single zip file and then it needs to be copied on S3.
Is there any way to achieve the same?
I referred to the link below, but it doesn't have complete steps to accomplish the task.
https://rclone.org/compress/
Any suggestion would be appreciated.
Rclone can't compress the files, but you can instead use a simple code to zip or rar the files and then use rclone to back them up to AWS.
If this is OK, I can explain the details here.

Migrate Redis data to Amazon DynamoDB [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 days ago.
Improve this question
I'm looking for a lib or a command line tool that can help me to do data migration from Redis to Amazon DynamoDB. Does anyone know a tool or lib that can do the job?
Thanks!
I would suggest you have a look at redis-rdb-tools to extract data from Redis.
This package can dump the content of the Redis database as a JSON file. You can then use any loader tool provided by Amazon to feed their database (or write your own).
For instance the AWS command line interface support feeding the DynamoDB with JSON:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.CLI.html
http://docs.aws.amazon.com/cli/latest/reference/dynamodb/index.html
You may have to transform the JSON file in order to use the AWS CLI commands though.
Amazon recommended way to bulk load data into DynamoDB is Amazon EMR (i.e. map/reduce jobs). http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html

Why is my amazon s3 slow? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Is there a way to check that my files are already on the edge servers for my users to load from? Or does amazon s3 take time to spread your files around the world? How Long and can I receive notification about when?
So after I uploaded a file, I immediately tested the load speed by asking users from other far away places(like places in Japan). They said that it was rather slower than my current hosting in the US. That's odd because Amazon does have an edge server in Tokyo so Amazon s3 should be faster?
Before I created my bucket, I did set the region to be in the standard US. Is that why? If so, is there a way to set your files to work around the world?
Thank you for your time.
As you already said, your S3 buckets are situated in a specific location, for example us-east, europe, us-west etc. This is the place where your files are physically stored. They are not distributed geographically. Users from other places in the world will experience delay when requesting data from these buckets.
What you are looking for is the Cloudfront CDN from Amazon. You can specify an origin (that would be your S3 bucket in your case) and then your files will be distributed to all the Amazon Cloudfront edge locations worldwide. Check out their FAQ and the list of edge locations.

Amazon S3 & CloudFront high costs [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
We're uploading and serving/streaming media (pics,videos) using amazon s3 for storage combined with cloudfront for serving media. The site is used slightly but the Amazon costs come to 3000 $ per month and from the report 90% of the costs originate from the S3 service .
Heard that cloud can be expensive if you don't code the right way ..now my questions :
What is the right way ? and where should I pay more attention, to the way I upload files or to the way I serve them?
Has anyone else had to deal with unexpected high costs , if yes what was the cause?
We have almost similar model. We stream (rtmp ) from S3 and cloudfront. We do have 1000s of files and decent load, but our monthly bill for s3 is around 50$ ( negligible as compared to your figure). Firstly , you should complain about your charges to the technical support of AWS. They always give you a good response and also suggest better ways to utilize resources. Secondly , I think if you do live streaming, where you divide the file into chucks and stream them one by one, instead of streaming or downloading the whole file, it might be effective , in terms of i/o where users are not watching the whole video, but just the part of it. Also, you can try to utilize caching eat application level.
Another opportunity to get better picture on what's going on in your buckets: Qloudstat

Using amazon s3 with cloudfront as a CDN [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I would like to serve user uploaded content (pictures, videos, and other files) from a CDN. using Amazon S3 with cloudfront seems like a reasonable way to go. My only question is about the speed of the file system. My plan was to host user media with the following uri. cdn.mycompany.com/u/u/i/d/uuid.jpg.
I don't haven any prior experience with S3 or CDN's and I was just wondering if this strategy would scale well to handle a large amount of user uploaded content. And if there might be another conventional way to accomplish this.
You will never have problems dealing with scale on CloudFront. It's an enterprise-grade beast.
Disclaimer: Not if you're Google.
It is an excellent choice. Especially for streaming video and audio, CloudFront is priceless.
My customers use my plugin to display private streaming video and audio, one of them even has 8,000 videos in one bucket without problems.
My question stemmed from a misunderstanding of S3 buckets as a conventional file system. I was concerned that hacking too many files in the same directory would create overhead in finding the file. However, it turns out that S3 buckets are implemented more something like a hashmap so this overhead doesn't actually exist. See here for details: Max files per directory in S3