i am trying to figure out if it is possible, also with the help of other softwares (like minio, portworx, veeam etc) to take a snapshot (and eventually restore it later) of the content stored on an IBM Cloud Object Storage instance used as persistence layer for an openshfit cluster through its S3 compatible api endpoints.

Please check out this link and see if it provides what you are looking for.

In the end i found this on the official IBM Cloud documentation that achieves somehow what i answered here: basically it explains how to synch tow backets beetwen them, also in different regions, so to have both data and backups on an S3 sotrage


AWS S3 alternatives for private cloud

Right now we have a requirement to migrate from AWS to private Data Center. We need to find out potential alternative storage instead of AWS S3.
Currently S3 is used in the following way:
Overall storage size is 10TB;
Min/Avg/Max object size is 0.5/2/100 Mb;
We have N App instances that simultaneously writes/reads
objects approximately 50 writes/sec, 30 reads/sec;
This storage should be redundant (Highly Available), Fault Tolerant, Scalable;
The naive implementation could be store this data on:
Simple NFS storage and add some replication functionality;
Just store mentioned objects in NoSQL DB (as example in Cassandra). However Cassandra will require a number of instances to support this storage (It's nor recommended to store > 1TB pn 1 Cassandra node Cassandra capacity planning)
What solution would you recommend for such scenario ?
Using MinIO is your best bet if you want to have a private cloud storage. It is AWS S3 compatible meaning that applications use AWS S3 can be migrated to MinIO seamlessly. They have a tutorial how to connect MinIO server with AWS CLI. You can test it against the public hosted MinIO server Please refer to AWS CLI with MinIO Server.
You can have highly available storage system using MinIO distributed setup. Beware that the dynamic expansion is not a feature of MinIO distributed setup. If you want to expand your cluster you end up spinning a new cluster with your desired number of servers/disks and then you have to migrate your data from old one to new one.
I find it much more easier to use than HDFS. In addition to this, there are a lot of technologies outside Hadoop ecosystem lack HDFS integration. For example, Docker Registry lacks built in HDFS storage driver. However, it has a S3 driver so you can use MinIO as it's object storage.
There're a bunch of options as of S3-compatible private cloud service. if you like open source solutions, the above open stack and Cassandra are good ones. Note that usually no matter what you use, probably you end up setting up a cloud with multiple nodes and this is inevitable to exchange for redundancy and availability. There're some good commercial and economic products as well such as the one from Cloudian
If you need object store I could recommend elliptics (in english).
As I know, it doesn't has limits on disk store.
In case for Cassandra we are using SSD disks (for better performance) < 200-500 Gb. Ring size would be depend from your requirements (read/write latency, replication rate, time to life).
50 writes/sec, 30 reads/sec
This is really quite easy for Cassandra, as I can compare with our setup.
In that case it more depends from time to life for your objects.
Generally, in case for distributed network you also could look at GlusterFS.
You can use OpenStack Swift
Swift is a highly available, distributed, eventually consistent object/blob store. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.
Is there any EMRFS SDK to access S3 from EC2 Machine?

I've read this and I conclude that EMRFS is only available if I am using AWS EMR machine.
I am asking this because I am interested with EMRFS's read-after-write-consistency for s3.
I just would like to put new input to this question: there is an in-progress community effort that provides a consistent S3 model in Hadoop: S3Guard: Improved Consistency for S3A.
As the description mentioned:
This issue proposes S3Guard, a new feature of S3A, to provide an option for a stronger consistency model than what is currently offered. The solution coordinates with a strongly consistent external store to resolve inconsistencies caused by the S3 eventual consistency model.
For more information, please refer to the design doc.
This will be part of Hadoop distribution in the next release probably Hadoop 3.0.
UPDATE: Steve just kindly backport that to Hadoop 2.9.
It would take a bit more manual configuration, but you can get a similar setup as EMRFS + EMR's consistent view by using the existing open-source NativeS3FileSystem alongside Netflix's s3mper, which uses the same DynamoDB-backed configuration as EMRFS.
If you are looking just for read after write consistency then you can just use S3 as-is (all regions support read after write consistency) with EMR. The catch is for US-Standard buckets just set in EMR and use that same endpoint on all non-EMR applications.

Amazon S3 vs Window Azure Blob Storage

I went through the docs of both Azure and Amazon S3, but I confused about few things -
Both of these try to solve the same question i.e, storage on the cloud. Now My Question here is that when to use what? i.e., when is it preferred to use Azure and when Amazon S3 is preferred. I googled about it hard and couldn't find any substantial resource for the same. I would really appreciate if some one could enlighten me regarding the same.
I want to consider following params as the base for choosing my cloud provider -
1) Latency
2) Scalability
3) Size of each file
4) Cost & Performance
5) Files which are can be accessed quite randomly.
These are few params I have considered. It would be great if you can provide additional Params to consider.
There are many studies online. You should evaluate it by yourself based on your workload and scenario.
One of many reports, says that Azure is good at small files:
Under my understanding, this is because Azure blob is designed to be an unified storage, so it optimize for small block access. Conversely, S3 is originate from web storage.
On the other hand, S3 is good at scale up, since Azure has limitation per account.

Monitoring AWS account spends

I am planning to build a dashboard to monitor the AWS expenditure, after googling I realized AWS has no API so that developers can hook and build an app to get the real time data. Is there any way to achieve it. Kindly help me out
I believe, you are looking to monitor current AWS usage.
AWS provides optoins for same through "AWS programmatic billing access".
Once you enable it, AWS will upload csv file of your current usage every few hours to specified S3 bucket.
You need to write a program using your favourite programming language AWS S3 SDK to download and parse csv file and get real time data.
Newvem has a very good set of How to Guides available to work with AWS.
2) As mentioned by Mike, AWS also provides a way where you can get billing alert using Cloudwatch.
I hope above helps.
Taral Shah
If you're looking to monitor actual spending data, #Taral is correct. AWS Programmatic Billing Access is the best tool for recording the data. However you don't actually have to write a dashboard tool to view it, there are a lot of them out there already.
The company I work for, Cloudability, automatically imports all of your AWS Detailed Billing data and let's you build out all of the AWS spending and usage reports you need without ever having to write any code or mess with any spreadsheets.
First thing is to enable detailed billing export to a S3 bucket (see here)
Then I wrote a simplistic server in Python (BSD licenced) that retrieves your detailed bill and breaks it down per service-type and usage type (see it on this GitHib repo).
Thus you can check anytime what your costs are and which services cost you the most etc.
If you tag your EC2 instances, S3 buckets etc, they will also show up on a dedicated line.
CloudWatch has an "estimated billing" API, which will get you most of the way there. See this ServerFault question for more detail:
If you are looking for more detail you will need to download your CSV-formatted bill and parse it. But your question is too generic to provide any specifically useful answer. Even this will not be real time though.

AWS s3 standard vs. reduced redundancy storage?

Does anyone know of any real-world analysis on data loss using these two AWS s3 storage options? I know from the AWS docs (via Quora) that one is 99.9999999% guarenteed and the other is only 99.99% gaurenteed but I'm looking for data from a non-AWS source.
Anecdotes or something more thorough would both be great. I apologize if this isn't the right SE site for this question. Feel free to suggest a place to migrate it.
I guess it depends on the data you're storing if you really need 99.999999999% level of durability …
If you keep copies of your data locally and are just using S3 as a convenient place to store data that is actively being accessed by services within the AWS infrastructure, RRS might be the right choice for you :)
In my case, I keep fresh files on the normal durability level till I created a local backup and then move them to RRS, which saves you quite a bit a money.