We have 10 nodes AWS EMR Cluster with emr 5.5.0 version, Spark 2.1.0
We want to write summary data into couchbase database. We are using PySpark with Spark SQL to generate summary data. Summary data is, in the form of PySpark DataFrame.
We want to write this summary data(PySpark DataFrame) into couchbase database.
Does the Couchbase Spark Connector having support for PySpark? If yes, could you please share the information on how to write data into couchbase database using PySpark.
At the moment, Couchbase does not have support for PySpark.
Related
I have a large number of Hive SQL scripts that I want to import into Azure Synapse Analytics to run as Spark SQL notebooks.
Is there any easy way to do this?
Synapse notebooks are json files with lots of extra information.
On HDFS Hive ORC ACID for Hive MERGE no issue.
On S3 not possible.
For Azure HD Insight I am not clear from docs if such a table on Azure Blob Storage is posible? Seeking confirmation or otherwise.
I am pretty sure no go. See the update I gave on the answer, however.
According to Azure HDInsight offical documents Azure HDInsight 4.0 overview as the figure below,
As I known, Hive MERGE requires MapReduce, but HDInsight does not support it for Hive, so it's also not possible.
UPDATE by question poster
HDInsight 4.0 doesn't support MapReduce for Apache Hive. Use Apache Tez instead. So, with Tez it will still work and from https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release Spark with Hive 3 and Warehouse Connector are also options.
I want to take data from hive tables that are in two different clusters , How can i do that?
Apache KYLIN runs on top of Spark SQL and Spark SQL runs on top of Hive and Hive runs on top of YARN.
I think for kylin it is not possible to take data from multiple clusters.
How ever you can use tools like sqoop to fetch data in one cluster. then you can use KYLIN on all data.
As part of my spark Job I am storing the output to a Hive table on HDInsight. I now want to expose the data to any COTS tools that can consume Odata feed like Tableau or other such tools. I was wondering if any one has some pointers on how this can be accomplished?
It's easy to do if data is stored in hive.
HDInsight Spark cluster has thrift server setup that allows BI tools like Tableau/PowerBI to process data via spark.
See:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/
I am planning to deploy a Spark cluster. Spark supports many storage schema such as HDFS, S3, HBase, Cassandra, Hive, etc.
Since I am not migrating from Hadoop to Spark, I have no existing 'big data' storage and still trying to figure out which one would be the best choice.
What is the best way to store data to optimize Spark to the fullest?
My use case is tracking user behavior data and use spark as ETL to create data warehouse and other data products.
One thing that came to my mind is having HDFS storage in each of worker node, just like what Hadoop storage schema usually is.