What's Amazon Web Services *native* offering is closest to Apache Kudu? - sql

I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...

Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.

Related

apache flink write data to separate hive cluster

With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/

Hortonworks schema registry cluster mode

I'm using Hortonworks schema registry with NIFI and things are working fine. I have installed Hortonworks schema registry on a single node and I'm afraid if that machine goes down what will happen to my NIFI flows. I have seen in Hortonworks schema registry architecture that we can use mysql, PostgreSql and In-Memory storage for storing schema. AFAIK none of them are distributed system. Is there any way to achieve cluster mode for high availability?
Sure, you can do active-active or active-passive replication for MySQL and Postgres, but that is left up to you to implement, as Hortonworks will likely forward you to the respective documentation on each tool, and that is the reason why the documentation for these tools doesn't guide you towards these design decisions in itself, as you should be aware of the drawbacks when having a SPoF
The Schema Registry itself is just a web-app, so you could put it behing your favorite reverse proxy, or within a container orchestrator, such as Docker support in HDP 3.x

minio: What is the cluster architecture of minio.io object storage server?

I have searched minio.io for hours but id dosn't provide any good information about clustering, dose it has rings and instance are connected? or mini is just for single isolated machine. And for running a cluster we have to run many isolated instance of it and the our app choose to which instance we write?
if yes:
When I write a file to a bucket does minio replicate it between multi server?
I is it like amazon s3, or openstack swift that support of storing multi copy of object in different servers (and not multi disk on the same machine).
Here is the document for distributed minio: https://docs.minio.io/docs/distributed-minio-quickstart-guide
From what I can tell, minio does not support clustering with automatic replication across multiple servers, balancing, etcetera.
However, the minio documentation does say how you can set up one minio server to mirror another one:
https://gitlab.gioxa.com/opensource/minio/blob/1983925dcfc88d4140b40fc807414fe14d5391bd/docs/setup-replication-between-two-sites-running-minio.md
Minio also Introduced Continuous Availability and Active-Active Bucket Replication. CheckoutTheir active-active Replication Guide

How does Apache Hive work on Amazon?

I'm looking to figure out the mechanics of Apache Hive hosted by Amazon. I'm assuming, it substitutes HDFS with S3 and Hadoop MapReduce with EMR. Are my assumptions correct?
You mostly correct. I would say that it is most convenient way to run Hive on amazon is
replacing HDFS with S3. It is practical since data is living on S3 and we can run Hadoop / Hive cluster on demand. Some drawback is slow write performance - so doing data transformations will be slow. Doing aggregations - is mostly fine
In the same time there are other configurations:
Build HDFS over local drives.
Build HDFS over EBS volumes.
Each one with their trade offs.

Amazon web services: Where to start

I am a recent grad and wanted to learn about doing web application using AWS. I have gone through the documentation and ran their sample Travel Log application Successfully.
But still I am not clear about the terminologies used. can anyone explain me the difference between Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Amazon SimpleDB in simple words.
I am looking to come up with a web app that has a signin page and people posting some text there. may i know what services of amazon would be required for me to build this app.
Thanks
Amazon Simple Storage Service (S3) is for load static content , maybe images, videos, or something you want to save, You could think of it like a hard drive for storage.
Amazon Elastic Compute Cloud: ( EC2) basically is your Virtual Operative System, you can install whatever OS you want (Debian, Ubuntu, Fedora, Centos, Windows Server, Suse enterprise). ( if your application uses server side processing this will be its home)
Amazon Simple DB, is a no-sql database system, that you could use for your aplications, and Amazon gives you as a service, but if you want to use something more, you could install yours on EC2, or use RDS for Database server (MySql for example)
If you want to know more, there are some books, like: "programming Amazon EC2" or see Amazon screencast at http://www.youtube.com/user/AmazonWebServices or its presentation on http://www.slideshare.net/AmazonWebServices
Amazon Simple Storage Service (Amazon S3)
Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-based service designed for online backup and archiving of data and application programs. It allows to upload, store, and download any type of files up to 5 TB in size. This service allows the subscribers to access the same systems that Amazon uses to run its own web sites. The subscriber has control over the accessibility of data, i.e. privately/publicly accessible.
Amazon Elastic Compute Cloud (Amazon EC2)
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the Amazon Web Services (AWS) cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.
Amazon SimpleDB
Amazon SimpleDB is a highly available NoSQL data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest.
Unbound by the strict requirements of a relational database, Amazon SimpleDB is optimized to provide high availability and flexibility, with little or no administrative burden. Behind the scenes, Amazon SimpleDB creates and manages multiple geographically distributed replicas of your data automatically to enable high availability and data durability. The service charges you only for the resources actually consumed in storing your data and serving your requests. You can change your data model on the fly, and data is automatically indexed for you. With Amazon SimpleDB, you can focus on application development without worrying about infrastructure provisioning, high availability, software maintenance, schema and index management, or performance tuning.
For more information, go through these:
https://aws.amazon.com/simpledb/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
https://www.tutorialspoint.com/amazon_web_services/amazon_web_services_s3.htm
Amazon S3 is used for storage of files. It is basically like the hard drives like on your system you use C or D your files. If you are developing any application you can use S3 for storing the static files or any backup files.
Amazon EC2 is exactly like your physical machine. Only difference is EC2 is on cloud. You can install and run software, applications store files exactly you do on your physical machines.
Amazon Simple DB is a a database on cloud. you can integrate it with your application and make queries.