Issue connecting Elasticsearch from Spark - dataframe

I am receiving the below error message while try to connect elastic search with Spark..
Code used --
spark.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "xxxxx")
.option("es.port", "80")
.option("es.nodes.wan.only", "true")
.option("pushdown", "true")
.option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.load("indexname/doc")
Error message
java.lang.ClassNotFoundException: Failed to find data source: org.elasticsearch.spark.sql. Please find packages at http://spark.apache.org/third-party-projects.html
Using
Databricks Runtime Version 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

Related

Not able to connect to Redis Version 7.0.5 Using Lettuce

Recently upgraded from Redis version 3 to Redis version 7.0.5 which is deployed on aws cloud.
Now when I try to connect to it using lettuce pool I am getting below exception.
org.springframework.data.redis.RedisSystemException: Unknown redis exception; nested exception is org.springframework.data.redis.connection.PoolException: Could not get a resource from the pool; nested exception is com.lambdaworks.redis.RedisException: java.lang.UnsupportedOperationException
Below are the dependent software versions i am using :
org.springframework.data
spring-data-keyvalue : Version 1.2.7.RELEASE
biz.paluch.redis
lettuce : Version 4.4.5.FINAL
org.springframework.data
spring-data-commons : Version 1.13.7.RELEASE
org.springframework.data
spring-data-redis : Version 1.8.7.RELEASE
Please help to get this sorted as I am stuck here.

Error when trying to start HiveServer2: NullPointerException in ThriftBinaryCLIService

When I start hiveserver2 with the following command:
hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console
I receive the following error before the program exits:
2022-09-12T14:46:53,713 ERROR [Thrift Server] transport.TServerSocket: Could not set socket timeout.
java.net.SocketException: Socket is closed
at java.net.ServerSocket.setSoTimeout(ServerSocket.java:666) ~[?:1.8.0_292]
at org.apache.thrift.transport.TServerSocket.listen(TServerSocket.java:117) ~[hive-exec-3.1.3.jar:3.1.3]
at org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:146) ~[hive-exec-3.1.3.jar:3.1.3]
at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:169) ~[hive-service-3.1.3.jar:3.1.3]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Hive Session ID = 56c28481-2b0c-4712-808d-ff7ccf31b543
Hive Session ID = 9771e219-095c-4524-b34a-b8e05c335fc0
2022-09-12T14:48:03,871 ERROR [Thrift Server] thrift.ThriftCLIService: Exception caught by ThriftBinaryCLIService. Exiting.
java.lang.NullPointerException: null
at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:169) ~[hive-service-3.1.3.jar:3.1.3]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Here is a brief explanation of my setup:
I am using vagrant and VirtualBox to create a "virtual" cluster.
This is very loosely (since the repository hasn't been updated in a while, I have had to make many changes to get it to work) based on this repository - https://github.com/njvijay/vagrant-jilla-hadoop
I have created 5 nodes (1 name node and 4 data nodes). The namenode also contains yarnm hive, pig, spark, mysql, python etc.
I am using Ubuntu 14.04.6, Hadoop 2.10.1, Hive 3.1.3, Spark 3.3.0 and Pig 0.15
It seems that there may be some compatibility issue between Hadoop 2 and Spark 3. I was able to resolve the error after updating Hadoop, Hive and Spark to the latest versions.

Using kafka-handler 4.0 in HDP 3.1.4

We are using HDP 3.1.4 and want to use kafka-handler for Kafka and Hive integration.
Lots of our topics in Kafka are serialized with Avro using Confluent Schema Registry and the default kafka-handler shipped with HDP 3.1.4 does not contain new features for Avro messages serialized with Confluent that contains Magic Byte. in new versions of kafka-handler we have some properties such as 'avro.serde.type'='skip' and 'avro.serde.skip.bytes'='5' to handle Magic Byte.
So we wanted to replace kafka-handler with our own compiled kafka-handler 4.0.
we simply started with replacing kafka-handler 4.0 with default HDP kafka-handler but we faced this error :
MetaException(message:java.lang.NoSuchFieldError AVRO_SERDE_TYPE)
AVRO_SERDE_TYPE is added in hive-serde 4.0 and we investigate it is because kafka-handler has dependency to kafka-exec and kafka-exec contains our old hive-serde that shipped with HDP.
We should replace hive-serde too. so we continue with shading hive-serde in kafka-handler:
<relocation>
<pattern>org.apache.hadoop.hive.serde2</pattern>
<shadedPattern>org.shaded.apache.hadoop.hive.serde2_shaded</shadedPattern>
</relocation>
but we face this exception:
SQL Error [40000] [42000]: Error while compiling
statement: FAILED: ClassCastException class org.apache.hadoop.hive.kafka.KafkaSerDe
so how we should handle this issue ?

Use saiku with apache hive

have you ever used Saiku to make data analysis on BigData Platform (Hadoop)? My recent work need to integrate some legacy BI tools with Hadoop to support common OLAP queries on HDFS/HBase.
I found a solution implemented with Phoenix & Hbase here, which bridges saiku and Hbase with SQL Dialect in Phoenix and it worked. However, this method can only handle data within HBase through HBase-API. It cannot boost any Map-Reduce style job when building the data cube. I prefer some more BigData compatible alternatives, like through Apache Hive.
Saiku is based on Mondrian. My version of Saiku use Mondrian-4.0.0.0-SNAPSHOT.jar, which I found can already work well with Hive. And I found that there are many Hive-0.13 jars within Saiku's lib directory. So I thought a simple config of hive2 datasource can work. I started an hiveserver2 in the namenode of my HDFS cluster and add following datasource into saiku.
Name: hive2
Connection Type: Mondrian
URL: jdbc:hive2://localhost:10000/default
Schema: /datasources/movie.xml
Jdbc Driver: org.apache.hive.jdbc.HiveDriver
Username: ubuntu
Password: XXXX
The saiku indeed successfully connected to the hiveserver2 but failed to load the datasource. I found following error in the saiku log:
name:hive2
driver:mondrian.olap4j.MondrianOlap4jDriver
url:jdbc:mondrian:Jdbc=jdbc:hive2://localhost:10000/default;Catalog=mondrian:///datasources/movie.xml;JdbcDrivers=org.apache.hive.jdbc.HiveDriver
12:41:48,110 WARN [RolapSchema] Model is in legacy format
12:41:50,464 ERROR [SecurityAwareConnectionManager] Error connecting: hive2
mondrian.olap.MondrianException: Mondrian Error:Internal error: while quoting identifier
at mondrian.resource.MondrianResource$_Def0.ex(MondrianResource.java:992)
at mondrian.olap.Util.newInternal(Util.java:2543)
at mondrian.spi.impl.JdbcDialectImpl.deduceIdentifierQuoteString(JdbcDialectImpl.java:245)
at mondrian.spi.impl.JdbcDialectImpl.<init>(JdbcDialectImpl.java:146)
at mondrian.spi.DialectManager$DialectManagerImpl$1.createDialect(DialectManager.java:210)
...
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveDatabaseMetaData.getIdentifierQuoteString(HiveDatabaseMetaData.java:342)
at org.apache.commons.dbcp.DelegatingDatabaseMetaData.getIdentifierQuoteString(DelegatingDatabaseMetaData.java:306)
at mondrian.spi.impl.JdbcDialectImpl.deduceIdentifierQuoteString(JdbcDialectImpl.java:238)
... 99 more
I looked into the hive 0.13 source. I found the getIdentifierQuoteString isn't implemented yet and simply throw an exception.
public String getIdentifierQuoteString() throws SQLException {
throw new SQLException("Method not supported");
}
Till now I'm puzzled. Is it practical to use the saiku with a hive? It has Hive 0.13 jars in its lib dir but cannot load a simple hive datasource? Should I simply modify the source of hive. I found in the newly released Hive 1.0. This function is implemented by simple return an empty string.
Does anyone has good idea? Thanks!

Apache hive error Merging of credentials not supported in this version of hadoop

I am using hadoop 1.2.1, hbase 0.94.14 and hive 1.0.0. There are three datanodes in my clsuter and three regionservers also. I have to import some data from hbase to hive. I have configured hive successfully but when I ran a command to count no. of rows in hive table, its gives following
ERROR [main]: exec.Task (SessionState.java:printError(833)) - Job Submission failed with exception 'java.lang.RuntimeException(java.io.IOException: Merging of credentials not supported in this version of hadoop)'
java.lang.RuntimeException: java.io.IOException: Merging of credentials not supported in this version of hadoop
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureJobConf(HBaseStorageHandler.java:485)
at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobConf(PlanUtils.java:856)
at org.apache.hadoop.hive.ql.plan.MapWork.configureJobConf(MapWork.java:540)
I have changed version of hive to 0.14 but same error.
What is the solution of it?
Note: I cannot upgrade hadoop.
Although your version of Hive is current, this is not the source of your error. You need to upgrade your Hadoop version, to 2.4.0 or above.
The error originates from here https://github.com/apache/hive/blob/3b6825b5b61e943e8e41743f5cbf6d640e0ebdf5/shims/0.20S/src/main/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java#L579