Error reading Parquet file with SparkSQL and HiveContext - apache-spark-sql

I am trying to read a table from Hive (well, it is Impala) stored in parquet format. I use Spark 1.3.0 and HiveContext.
The schema of the table is:
(a,DoubleType)
(b,DoubleType)
(c,IntegerType)
(d,StringType)
(e,DecimalType(18,0))
My code is:
val sc = new SparkContext(conf)
val hc = new HiveContext(sc)
import hc.implicits._
import hc.sql
val df: DataFrame = hc.table(mytable)
The trace log error is:
16/03/31 11:33:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, cloudera-smm-2.desa.taiif.aeat): java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.types.Decimal
at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.toDouble(Decimal.scala:330)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:361)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:361)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:105)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
...
It seems that the Decimal format is not being cast properly. Any ideas?

The problem was that SparkSQL was using its built-in metastore, instead of using the existing metastore in Hive.
You should set this property to false:
hc.setConf("spark.sql.hive.convertMetastoreParquet", "false")

Related

Using unionbyname to merge 2 Df2 with different columns not working with spark 3.2.1

Hi I am trying to merge 2 FDs with different columns in spark and I came across
unionByName which allows property allowMissingColumns
I am actually facing issues while using it here is my piece of code
import spark.implicits._
val data = Seq(("1","[unassigned]"))
val DefaultResponseDF = data.toDF("try_number","assessment_item_response_xid")
val data2 = Seq(("2"))
val DefaultResponseDF2 = data2.toDF("try_number")
DefaultResponseDF.unionByName(DefaultResponseDF2, allowMissingColumns=True).show()
When I run this on spark 3.2.1 and scala 2.12 in databricks cluster getting this error
error: not found: value True
DefaultResponseDF.unionByName(DefaultResponseDF2, allowMissingColumns=True).show()
^
Let me know if I am missing anything in using this. I believe it is available in spark 3.1 onwards I am on spark 3.2 so that cant be the issue .
let me know if anyone faced this before
Nvm it was because in scala True does not work it should be true
DefaultResponseDF.unionByName(DefaultResponseDF2,
allowMissingColumns=true).show()
This works

how to avoid needing to restart spark session after overwriting external table

I have an Azure data lake external table, and want to remove all rows from it. I know that the 'truncate' command doesn't work for external tables, and BTW I don't really want to re-create the table (but might consider that option for certain flows). Anyway, the best I've gotten to work so far is to create an empty data frame (with a defined schema) and overwrite the folder containing the data, e.g.:
from pyspark.sql.types import *
data = []
schema = StructType(
[
StructField('Field1', IntegerType(), True),
StructField('Field2', StringType(), True),
StructField('Field3', DecimalType(18, 8), True)
]
)
sdf = spark.createDataFrame(data, schema)
#sdf.printSchema()
#display(sdf)
sdf.write.format("csv").option('header',True).mode("overwrite").save("/db1/table1")
This mostly works, except that if I go to select from the table, it will fail with the below error:
Error: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 13) (vm-9cb62393 executor 2): java.io.FileNotFoundException: Operation failed: "The specified path does not exist."
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I tried running 'refresh' on the table but the error persisted. Restarting the spark session fixes it, but that's not ideal. Is there a correct way for me to be doing this?
UPDATE: I don't have it working yet, but at least I now have a function that dynamically clears the table:
from pyspark.sql.types import *
from pyspark.sql.types import _parse_datatype_string
def empty_table(database_name, table_name):
data = []
schema = StructType()
for column in spark.catalog.listColumns(table_name, database_name):
datatype_string = _parse_datatype_string(column.dataType)
schema.add(column.name, datatype_string, True)
sdf = spark.createDataFrame(data, schema)
path = "/{}/{}".format(database_name, table_name)
sdf.write.format("csv").mode("overwrite").save(path)

How to read tables from a location and write data to a table of other cluster

I read table statistics from a metastore starting spark application setting up hive.metastore.uris. However I need write data to another hive.
I've tryed to clean Active and Default Session, build another session with the new metastore uri, but spark continues trying write to the table of the first hive.
val spark = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.metastore.uris", FIRST_METASTORE)
.config("spark.sql.hive.convertMetastoreOrc", "false")
.config("spark.sql.caseSensitive", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
val df = spark.sql("DESCRIBE FORMATTED source_table")
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val spark2 = SparkSession.builder()
.appName(appName)
.enableHiveSupport()
.config("hive.metastore.uris", NEW_MESTASTORE)
.config("spark.sql.hive.convertMetastoreOrc", "false")
.config("spark.sql.caseSensitive", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
SparkSession.setDefaultSession(sparkSession2)
SparkSession.setActiveSession(sparkSession2)
df.write
.format("parquet")
.mode(SaveMode.Overwrite)
.insertInto("other_cluster_table")
}
As I said, it would be expected that dataframe should be wrote to the table location of the new metastore and catalog, but it doesn't. This happens because interface DataFrameWriter get information from df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) in order to insert into some existent table, but how could I deal with it?
After reading about multiple sparkContexts, I solve this question just writing the parquet directly to namenode/directory/to/partition/ and then adding partition to table using beeline.

calculate median, average using hadoop spark1.6 dataframe, Failed to start database 'metastore_db'

spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
1. using SQLContext
~~~~~~~~~~~~~~~~~~~~
1. import org.apache.spark.sql.SQLContext
2. val sqlctx = new SQLContext(sc)
3. import sqlctx._
val df = sqlctx.read.format("com.databricks.spark.csv").option("inferScheme","true").option("delimiter",";").option("header","true").load("/user/cloudera/data.csv")
df.select(avg($"col1")).show() // this works fine
sqlctx.sql("select percentile_approx(balance,0.5) as median from port_bank_table").show()
or
sqlctx.sql("select percentile(balance,0.5) as median from port_bank_table").show()
// both are not working , getting the below error
org.apache.spark.sql.AnalysisException: undefined function percentile_approx; line 0 pos 0
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:65)
using HiveContext
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
so tried using hive context
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hivectx = new HiveContext(sc)
18/01/09 22:51:06 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
hivectx: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#5be91161
scala> import hivectx._
import hivectx._
getting the below error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#be453c4,
see the next exception for details.
I can't find any percentile_approx,percentile function in Spark aggregation functions.It does not seem like this functionality is built into the Spark DataFrames. For more information please follow this How to calculate Percentile of column in a DataFrame in spark?
I hope so it will help you.
I don't think so, it should work, for that you should save the table in
dataFrame using saveAsTable. Then you will be able to run your query using
HiveContext.
df.someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("Table_name")
# In my case "mode" is working as mode("Overwrite")
hivectx.sql("select avg(col1) as median from Table_name").show()
It will work.

How to auto update %spark.sql result in zeppelin for structured streaming query

I'm running structured streaming in (spark 2.1.0 with zeppelin 0.7) for data coming from kafka and I'm trying to visualize the streaming result with spark.sql
as below :
%spark2
val spark = SparkSession
.builder()
.appName("Spark structured streaming Kafka example")
.master("yarn")
.getOrCreate()
val inputstream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "n11.hdp.com:6667,n12.hdp.com:6667,n13.hdp.com:6667 ,n10.hdp.com:6667, n9.hdp.com:6667")
.option("subscribe", "st")
.load()
val stream = inputstream.selectExpr("CAST( value AS STRING)").as[(String)].select(
expr("(split(value, ','))[0]").cast("string").as("pre_post_paid"),
expr("(split(value, ','))[1]").cast("double").as("DataUpload"),
expr("(split(value, ','))[2]").cast("double").as("DataDowndownload"))
.filter("DataUpload is not null and DataDowndownload is not null")
.groupBy("pre_post_paid").agg(sum("DataUpload") + sum("DataDowndownload") as "size")
val query = stream.writeStream
.format("memory")
.outputMode("complete")
.queryName("test")
.start()
after it running I query on "test" as below:
%sql
select *
from test
it updates only when I running it manually, my question is How to make it updates as new data is processed (streaming visualization) as this example:
Insights Without Tradeoffs: Using Structured Streaming in Apache Spark
Replace the lines
"%sql
select *
from test"
with
%spark
spark.table("test").show()