error while creating hive table from mapr-db (hbase like) - hive

I have created a MapR-DB table in a spark code:
case class MyLog(count: Int, message: String)
val conf = new SparkConf().setAppName("Launcher").setMaster("local[2]")
val sc = new SparkContext(conf)
val data = Seq(MyLog(3, "monmessage"))
val log_rdd = sc.parallelize(data)
log_rdd.saveToMapRDB("/tables/tablelog",createTable = true, idFieldPath = "message")
when I print this line from spark code I get in the console:
{"_id":"monmessage","count":3,"message":"monmessage"}
I would like to create an hive table to make select or other query on this table so I try this:
CREATE EXTERNAL TABLE mapr_table_2(count int, message string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "message")
TBLPROPERTIES("hbase.table.name" = "/tables/tablelog");
but I get:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Error: the HBase columns mapping contains a badly formed column family, column qualifier specification.)
I took the create table query from this link :
http://maprdocs.mapr.com/home/Hive/HiveAndMapR-DBIntegration-GettingStarted.html
By the way, I don't understand what do I need to put in the line :
hbase.columns.mapping" =
Do you have any idea how to create the table ? Thanks

Related

Query sql server table in azure databricks

I am using the below code to query a sql server table hr.employee in my azure sql server database using azure databricks. I am new to spark sql and trying to learn the nuances one step at a time.
Azure Databricks:
%scala
val jdbcHostname = dbutils.widgets.get("hostName")
val jdbcPort = 1433
val jdbcDatabase = dbutils.widgets.get("database")
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
%scala
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
%scala
val employee = spark.read.jdbc(jdbcUrl, "hr.Employee", connectionProperties)
%scala
spark.sql("select * from employee")
%sql
select * from employee
employee.select("col1","col2").show()
I get the below error. Not sure what wrong am I doing. Tried a couple of variations as well and no luck so far.
Error:
';' expected but integer literal found.
command-922779590419509:26: error: not found: value %
%sql
command-922779590419509:27: error: not found: value select
select * from employee
command-922779590419509:27: error: not found: value from
select * from employee
command-922779590419509:16: error: not found: value %
%scala

Spark SQL push down query issue with max function in MS SQL

I want to execute an aggregate function MAX on a table's ID column residing in MS SQL. I am using spark SQL 1.6 and JDBC push down_query approach as I don't want spark SQL to pull all the data on spark side and do MAX (ID) calculation, but when I execute below code I get below exception, whereas If I try SELECT * FROM in code it works as expected.
Code:
def getMaxID(sqlContext: SQLContext,tableName:String) =
{
val pushdown_query = s"(SELECT MAX(ID) FROM ${tableName}) as t"
val maxID = sqlContext.read.jdbc(url = getJdbcProp(sqlContext.sparkContext).toString, table = pushdown_query, properties = getDBConnectionProperties(sqlContext.sparkContext))
.head().getLong(0)
maxID
}
Exception:
Exception in thread "main" java.sql.SQLException: No column name was specified for column 1 of 't'.
at net.sourceforge.jtds.jdbc.SQLDiagnostic.addDiagnostic(SQLDiagnostic.java:372)
at net.sourceforge.jtds.jdbc.TdsCore.tdsErrorToken(TdsCore.java:2988)
at net.sourceforge.jtds.jdbc.TdsCore.nextToken(TdsCore.java:2421)
at net.sourceforge.jtds.jdbc.TdsCore.getMoreResults(TdsCore.java:671)
at net.sourceforge.jtds.jdbc.JtdsStatement.executeSQLQuery(JtdsStatement.java:505)
at net.sourceforge.jtds.jdbc.JtdsPreparedStatement.executeQuery(JtdsPreparedStatement.java:1029)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:124)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
This exception is not related to Spark. You have to provide an alias for the column
val pushdown_query = s"(SELECT MAX(ID) AS max_id FROM ${tableName}) as t"

Spark sql SQLContext

I'm trying to select data from MSSQL database via SQLContext.sql in Spark application.
Connection works but I'm not able to select data from table, because it always fail on table name.
Here is my code:
val prop=new Properties()
val url2="jdbc:jtds:sqlserver://servername;instance=MSSQLSERVER;user=sa;password=Pass;"
prop.setProperty("user","username")
prop.setProperty("driver" , "net.sourceforge.jtds.jdbc.Driver")
prop.setProperty("password","mypassword")
val test=sqlContext.read.jdbc(url2,"[dbName].[dbo].[Table name]",prop)
sqlContext.sql("""
SELECT *
FROM 'dbName.dbo.Table name'
""")
I tried table name without (') or [dbName].[dbo].[Table name] but still the same ....
Exception in thread "main" java.lang.RuntimeException: [3.14] failure:
``union'' expected but `.' found
dependencies:
// https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.1" //%"provided"
// https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.6.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1" //%"provided"
I think the problem in your code is that the query you pass to sqlContext has no access to the original table in the source database. It has only access to the tables saved within the sql context, for example with df.write.saveAsTable() or with df.registerTempTable() (df.createTempView in Spark 2+).
So, in your specific case, I can suggest a couple of options:
1) if you want the query to be executed on the source database with the exact syntax of your database SQL, you can pass the query to the "dbtable" argument:
val query = "SELECT * FROM dbName.dbo.TableName"
val df = sqlContext.read.jdbc(url2, s"($query) AS subquery", prop)
df.show
Note that the query needs to be in parentheses, because it will be passed to a "FROM" clause, as specified in the docs:
dbtable: The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
2) If you don't need to run the query on the source database, you can just pass the table name and then create a temp view in the sqlContext:
val table = sqlContext.read.jdbc(url2, "dbName.dbo.TableName", prop)
table.registerTempTable("temp_table")
val df = sqlContext.sql("SELECT * FROM temp_table")
// or sqlContext.table("temp_table")
df.show()

Spark : Key not found

I am trying to insert into hive table from spark using following syntax.
tranl1.write.mode("overwrite").partitionBy("t_date").insertInto("tran_spark_part")
Note : tranl1 is DF, I had created it for loading data from oracle.
val tranl1 = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:userid/pwd#localhost:portid","dbtable" -> "(select a.*,to_char(to_date(trunc(txn_date,'DD'),'dd-MM-yy')) as t_date from table a WHERE TXN_DATE >= TRUNC(sysdate,'DD'))"))
my table in hive :
create table tran_spark_part(id string,amount string,creditaccount
string,creditbankname string,creditvpa string,customerid
string,debitaccount string,debitbankname string,debitvpa string,irc
string,refid string,remarks string,rrn string,status string,txn_date
string,txnid string,type string,expiry_date string,approvalnum
string,msgid string,seqno string,upirc string,reversal string,trantype
string) partitioned by (date1 string);
However when i run
tranl1.write.mode("overwrite").partitionBy("t_date").insertInto("tran_spark_part")
it gives error :
java.util.NoSuchElementException: key not found: date1
Please help me what I am missing or doing wrong?

How to get the partition column name of a Hive table

I'm developing a unix script where I'll be dealing with Hive tables partitioned by either column A or column B. I'd like to find on what column a table is partition on so that I can do subsequent operations on those partition instances.
Is there any property in Hive which returns the partition column directly?
I'm thinking I'll have to do a show create table and extract the partition name somehow if there isn't any other way possible.
May be not the best, but one more approach is by using describe command
Create table:
create table employee ( id int, name string ) PARTITIONED BY (city string);
Command:
hive -e 'describe formatted employee' | awk '/Partition/ {p=1}; p; /Detailed/ {p=0}'
Output:
# Partition Information
# col_name data_type comment
city string
you can improve it as per your need.
One more option which i dint explore is by querying meta-store repository tables to get the partition column information for a table.
Through scala/java api, we can get to the hive meta store and get the partition column names
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
val conf = new Configuration()
conf.set("hive.metastore.uris","thrift://hdppmgt02.domain.com:9083")
val hiveConf = new HiveConf(conf, classOf[HiveConf])
val metastoreClient = new HiveMetaStoreClient(hiveConf)
metastoreClient.getTable(db, tbl).getPartitionKeys.foreach(x=>println("Keys : "+x))
#use python pyhive:
import hive_client
def get_partition_column(table_name):
#hc=hive connection
hc=hive_client.HiveClient()
cur=hc.query("desc "+table_name)
return cur[len(cur)-1][0]
#################
hive_client.py
from pyhive import hive
default_encoding = 'utf-8'
host_name = 'localhost'
port = 10000
database="xxx"
class HiveClient:
def __init__(self):
self.conn = hive.Connection(host=host_name,port=port,username='hive',database=database)
def query(self, sql):
cursor = self.conn.cursor()
#with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetchall()
def execute(self,sql):
#with self.conn.cursor() as cursor:
cursor = self.conn.cursor()
cursor.execute(sql)
def close(self):`enter code here`
self.conn.close()
List<String> parts = new ArrayList<>();
try {
List<FieldSchema> partitionKeys = client.getTable(dbName, tableName).getPartitionKeys();
for (FieldSchema partition : partitionKeys) {
parts.add(partition.getName());
}
} catch (Exception e) {
throw new RuntimeException("Fail to get Hive partitions", e);
}