Is it possible to access contents of spark.sql output via JDBC, like %sql interpreter does?
You just need to setup Spark's thriftserver as described in the Spark's documentation. Zeppelin is just consumer of the Spark's execution, and doesn't expose data itself.
If you really need to extract specific information from paragraph, you can use Notebook API.
Related
With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/
I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.
Why do Apache Hive needs Apache Thrift? On the Thrift's site it says that it can compile in multiple languages, but I can't understand where does it fits and why do Hive need it.
Thanks
Cited from safaribooksonline:
Chapter 16. Hive Thrift Service
Hive has an optional component known as HiveServer or HiveThrift that
allows access to Hive over a single port. Thrift is a software
framework for scalable cross-language services development. See
http://thrift.apache.org/ for more details. Thrift allows clients
using languages including Java, C++, Ruby, and many others, to
programmatically access Hive remotely.
The CLI is the most common way to access Hive. However, the design of
the CLI can make it difficult to use programmatically. The CLI is a
fat client; it requires a local copy of all the Hive components and
configuration as well as a copy of a Hadoop client and its
configuration. Additionally, it works as an HDFS client, a MapReduce
client, and a JDBC client (to access the metastore). Even with the
proper client installation, having all of the correct network access
can be difficult, especially across subnets or datacenters.
Couldn't have said it better. Emphasis mine.
https://cwiki.apache.org/confluence/display/Hive/HiveServer
HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache ThriftTM (http://thrift.apache.org/), therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift.
For more details on how to connect to hive server(thrift server) see the link above.
i want to write stream data to accumulo!. There is any API for accumulo to write data. It is possible in python instead of java?
See the BatchWriter with you instantiate via Connector. The Accumulo Thrift Proxy enables non-Java clients to interact with Accumulo.
Good day,i have a need to use hive just the way we could use mysql. I want to find a way in which I can host it online so that people in different places can communicate to one hive service. Thanks in advance.
This is the functionality that Hive Server 2 provides
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2
It exposes Hive as a thrift web service, and there are JDBC and ODBC drivers available.
You can also put Apache Knox in front of it, in order to have more options for authentication and authorization https://knox.apache.org/
If you are using a common distribution such as Hortonworks or Cloudera, Hive Server 2 will probably be installed automatically when you install Hive.