With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/
Related
I am new to Flink and going through documentation. I found out that Redis can be used as SINK (through Apache Bahir). But there is no mention of using Redis Streams as source, for streaming data. Is it possible to add Redis stream as source to Flink?
I tried googling it but could not find any answers. If this question is already asked please provide the link as i am not able to find it.
I want to transfer data from kafka to hdfs by confluent, and I do the experiments by the quickstart in CLI model successfully.
Now, I intend to deploy confluent platform on production environment, Is there any tutorial about distributed deployment in detail?
And if there are many topics in kafka, such as register_info, video_play_info, video_like_info, video_repost_info and etc.
I need to process messages by different converters, and transfer to different hive table.
what should i?
I need to process messages by different converters, and transfer to different hive table
Run bin/connect-distributed etc/kafka/connect-distributed.propeties
Create individual JSON files for each HDFS Connector
POST them to the REST endpoint of Kafka Connect
Distributed mode is documented here
I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...
Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.
I've been doing some experiments with Apache Ignite and I've started to look into WAN replication. By this I mean there would be 2 (or more) data centres each running an Ignite cluster. There would be some caches that I would like kept in sync between the two data centres.
Does Apache Ignite support this? If so how is this configured as I can't find any mention of this in the documentation.
At the moment Ignite does not support caches spanning multiple clusters(nor cache mirroring). If however you mean there is only one Ignite cluster consisting of nodes located in different data centers(WAN),that would be possible though would most likely be inefficient! since you will have to use the Replicated Mode.
GridGain provides asynchronous WAN replication on top of Ignite as part of their payed solution: https://www.gridgain.com/docs/latest/administrators-guide/data-center-replication/configuring-replication
if I deploy Apache Drill in a existing Hadoop cluster, dose the Apache Drill have any negative influence on the other members of Hadoop ecosystem in the existing Hadoop cluster ?
It wont have any negative impact on other members of ecosystem but it will hog a lot of memory of the node. Make sure you have enough memory before installing Drill.
It's Hadoop compatibility component. The influence only happen if you don't provide enough resource for all the members to operate.
You can put Drill with HDFS in the same node or same cluster to get the best performance.