Spark : Key not found - apache-spark-sql

I am trying to insert into hive table from spark using following syntax.
tranl1.write.mode("overwrite").partitionBy("t_date").insertInto("tran_spark_part")
Note : tranl1 is DF, I had created it for loading data from oracle.
val tranl1 = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:userid/pwd#localhost:portid","dbtable" -> "(select a.*,to_char(to_date(trunc(txn_date,'DD'),'dd-MM-yy')) as t_date from table a WHERE TXN_DATE >= TRUNC(sysdate,'DD'))"))
my table in hive :
create table tran_spark_part(id string,amount string,creditaccount
string,creditbankname string,creditvpa string,customerid
string,debitaccount string,debitbankname string,debitvpa string,irc
string,refid string,remarks string,rrn string,status string,txn_date
string,txnid string,type string,expiry_date string,approvalnum
string,msgid string,seqno string,upirc string,reversal string,trantype
string) partitioned by (date1 string);
However when i run
tranl1.write.mode("overwrite").partitionBy("t_date").insertInto("tran_spark_part")
it gives error :
java.util.NoSuchElementException: key not found: date1
Please help me what I am missing or doing wrong?

Related

Flink Window Aggregation using TUMBLE failing on TIMESTAMP

We have one table A in database. We are loading that table into flink using Flink SQL JdbcCatalog.
Here is how we are loading the data
val catalog = new JdbcCatalog("my_catalog", "database_name", username, password, url)
streamTableEnvironment.registerCatalog("my_catalog", catalog)
streamTableEnvironment.useCatalog("my_catalog")
val query = "select timestamp, count from A"
val sourceTable = streamTableEnvironment.sqlQuery(query) streamTableEnvironment.createTemporaryView("innerTable", sourceTable)
val aggregationQuery = select window_end, sum(count) from TABLE(TUMBLE(TABLE innerTable, DESCRIPTOR(timestamp), INTERVAL '10' minutes)) group by window_end
It throws following error
Exception in thread "main" org.apache.flink.table.api.ValidationException: SQL validation failed. The window function TUMBLE(TABLE table_name, DESCRIPTOR(timecol), datetime interval[, datetime interval]) requires the timecol is a time attribute type, but is TIMESTAMP(6).
In short we want to apply windowing aggregation on an already existing column. How can we do that
Note - This is a batch processing
Timestamp columns used as time attributes in Flink SQL must be either TIMESTAMP(3) or TIMESTAMP_LTZ(3).
Column should be TIMESTAMP(3) or TIMESTAMP_LTZ(3) but also the column should be marked as ROWTIME.
Type this line in your code
sourceTable.printSchema();
and check the result. The column should be marked as ROWTIME as shown below.
(
`deviceId` STRING,
`dataStart` BIGINT,
`recordCount` INT,
`time_Insert` BIGINT,
`time_Insert_ts` TIMESTAMP(3) *ROWTIME*
)
You can find my sample below.
Table tableCpuDataCalculatedTemp = tableEnv.fromDataStream(streamCPUDataCalculated, Schema.newBuilder()
.column("deviceId", DataTypes.STRING())
.column("dataStart", DataTypes.BIGINT())
.column("recordCount", DataTypes.INT())
.column("time_Insert", DataTypes.BIGINT())
.column("time_Insert_ts", DataTypes.TIMESTAMP(3))
.watermark("time_Insert_ts", "time_Insert_ts")
.build());
watermark method makes it ROWTIME

Write a csv to a partitioned Hive table using Spark org.apache.spark.SparkException: Requested partitioning does not match the table

I have an existing Hive table:
CREATE TABLE form_submit (form_id String,
submitter_name String)
PARTITIONED BY
submission_date String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC;
I have a csv of raw data, which I read using
val session = SparkSession.builder()
.enableHiveSupport()
.config("spark.hadoop.hive.exec.dynamic.partition", "true")
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
val dataframe = session
.read
.option("header", "true")
.csv(hdfsPath)
I then perform some manipulations on this data, using a series of withColumn and drop statements, to make sure that the format matches the table format.
I then try to write it like so:
formattedDataframe.write
.mode(SaveMode.Append)
.format("hive")
.partitionBy("submission_date")
.saveAsTable(tableName)
I'm not using insertInto, because the columns in the dataframe end up in a bad order, and I wouldn't want to rely on column order anyway.
And run it as a Spark job. I get an exception:
Exception in thread "main" org.apache.spark.SparkException: Requested partitioning does not match the form_submit table:
Requested partitions:
Table partitions: "submission_date"
What am I doing wrong? Didn't I choose the partitioning by calling partitionedBy?

error while creating hive table from mapr-db (hbase like)

I have created a MapR-DB table in a spark code:
case class MyLog(count: Int, message: String)
val conf = new SparkConf().setAppName("Launcher").setMaster("local[2]")
val sc = new SparkContext(conf)
val data = Seq(MyLog(3, "monmessage"))
val log_rdd = sc.parallelize(data)
log_rdd.saveToMapRDB("/tables/tablelog",createTable = true, idFieldPath = "message")
when I print this line from spark code I get in the console:
{"_id":"monmessage","count":3,"message":"monmessage"}
I would like to create an hive table to make select or other query on this table so I try this:
CREATE EXTERNAL TABLE mapr_table_2(count int, message string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "message")
TBLPROPERTIES("hbase.table.name" = "/tables/tablelog");
but I get:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Error: the HBase columns mapping contains a badly formed column family, column qualifier specification.)
I took the create table query from this link :
http://maprdocs.mapr.com/home/Hive/HiveAndMapR-DBIntegration-GettingStarted.html
By the way, I don't understand what do I need to put in the line :
hbase.columns.mapping" =
Do you have any idea how to create the table ? Thanks

SparkSQL errors when using SQL DATE function

In Spark I am trying to execute SQL queries on a temporary table derived from a data frame that I manually built by reading a csv file and converting the columns into the right data type.
Specifically, the table I'm talking about is the LINEITEM table from [TPC-H specification][1]. Unlike stated in the specification I am using TIMESTAMP rather than DATE because I've read that Spark does not support the DATE type.
In my single scala source file, after creating the data frame and registering a temporary table called "lineitem", I am trying to execute the following query:
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01 00:00:00');")
When I submit the packaged jar using spark-submit, I get the following error:
Exception in thread "main" java.lang.RuntimeException: [1.75] failure: ``union'' expected but but `;' found
When I omit the semicolon and do the same thing, I get the following error:
Exception in thread "main" java.util.NoSuchElementException: key not found: date
Spark version is 1.4.0.
Does anyone have an idea what's the problem with these queries?
[1] http://www.tpc.org/TPC_Documents_Current_Versions/pdf/tpch2.17.1.pdf
SQL queries passed to SQLContext.sql shouldn't be delimited using semicolon - this the source of your first problem
DATE UDF expects date in the YYYY-­MM-­DD form and DATE('1998-12-01 00:00:00') evaluates to null. As long as timestamp can be casted to DATE correct query string looks like this:
"SELECT * FROM lineitem l WHERE date(l.shipdate) <= date('1998-12-01')"
DATE is a Hive UDF. It means you have to use HiveContext not a standard SQLContext - this is the source of your second problem.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc) // where sc is a SparkContext
In Spark >= 1.5 it is also possible to use to_date function:
import org.apache.spark.sql.functions.{lit, to_date}
df.where(to_date($"shipdate") <= to_date(lit("1998-12-01")))
Please try hive function CAST (expression AS toDatatype)
It changes an expression from one datatype to other
e.g. CAST ('2016-06-17 00.00.000' AS DATE) will convert String to Date
In your case
val res = sqlContext.sql("SELECT * FROM lineitem l WHERE CAST(l.shipdate as DATE) <= CAST('1998-12-01 00:00:00' AS DATE);")
Supported datatype conversions are as listed in Hive Casting Dates

Reading partitioned parquet file into Spark results in fields in incorrect order

For a table with
create table mytable (
..
)
partitioned by (my_part_column String)
We are executing a hive sql as follows:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
data = hc.sql("select * from my_table limit 10")
The values read back show the "my_part_columns" as the FIRST items for each row instead of the last ones.
Turns out this is a known bug fixed in spark 1.3.0 and 1.2.1
https://issues.apache.org/jira/browse/SPARK-5049