Spark SQL saveAsTable is not compatible with Hive when partition is specified - hive

Kind of edge case, when saving parquet table in Spark SQL with partition,
#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("time", DataTypes.StringType, true),
DataTypes.createStructField("accountId", DataTypes.StringType, true),
...
DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);
df.coalesce(1)
.write()
.mode(SaveMode.Append)
.format("parquet")
.partitionBy("year")
.saveAsTable("tblclick8partitioned");
Spark warns:
Persisting partitioned data source relation into Hive metastore in
Spark SQL specific format, which is NOT compatible with Hive
In Hive:
hive> describe tblclick8partitioned;
OK
col array<string> from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)
Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.
Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?

That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:
hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")

A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable").
In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable") did not work. This is with Spark 1.6.2.

Related

In hive, how to generate dynamic table name in hql?

i want generate dynamic table name in hql which runing using beeline.
in db2, i can implement this requirement using ||.
for example, using year to generate table name 'as400.trxfintrx_' || year(current date), but how can i implement this in hive'hql?
If i understand it correct, you want the table name to be paramterized,
For that you can use hive variables,
create table dbName.table1_${hivevar:yearMonthDate}
(
c1 int,
c2 int
)
stored as orc
tblproperties('ZLIB');
$ hive -f test_create_table.hql --hivevar yearMonthDate=20190215
OK
Time taken: 1.149 seconds
$ hive
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> use dbname;
OK
Time taken: 0.726 seconds
hive> desc table1_20190215;
OK
c1 int
c2 int
Time taken: 0.302 seconds, Fetched: 2 row(s)
you can refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
From beeline terminal, you cannot define any function to set parameter value and then use them in you queries.
Hope this helps

Apache Zeppelin - Can't load a dataframe from a HIVE table using SparkR

I need to load a dataframe from a Hive table and for that I followed this instruction from Apache Spark 2.3 docs.(https://spark.apache.org/docs/latest/sparkr.html). I'm doing that by a Zeppelin notebook.
Can someone please explain how to create a dataframe using SparkR? Or what I'm doing wrong? Any answer is appreciated.
Documentation
Queries can be expressed in HiveQL.
results <- sql("FROM src SELECT key, value")
My code:
sp_df <- sql("SELECT * FROM sparkr_test")
Results of my code:
head(sp_df)
[1] “SELECT * FROM sparkr_test”
Where is your data located, and have you registered the source data as a table? You need to run something like:
sql("CREATE TABLE IF NOT EXISTS sparkr_test (column1 INT, column2 STRING ...) USING hive")
sql("LOAD DATA LOCAL INPATH 'path/to/data/data.txt' INTO TABLE sparkr_test")
before you can query the table
I had the same issue, solved it by specifying the library.
SparkR::sql("select * from mytable")

Can hive tables that contain DATE type columns be queried using impala?

Everytime I am trying to select in IMPALA a DATE type field from a table created in HIVE I get the AnalysisException: Unsupported type 'DATE'.
Are there any workarounds?
UPDATE this is an example of a create table schema from hive and an impala query
Schema:
CREATE TABLE myschema.mytable(day_dt date,
event string)
PARTITIONED BY (day_id int)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Impala query
select b.day_dt
from myschema.mytable b;
Impala doesn't have a DATE datatype, whereas Hive has. You will get AnalysisException: Unsupported type 'DATE' when you access it from Impala. A quick fix would be to create a string column of that date value in Hive and access it in whichever way you want from Impala.
If you're storing as strings, it may work to create a new external hive table that points to the same HDFS location as the existing table, but with the schema having day_dt with datatype STRING instead of DATE.
This is a true workaround, it may only suit some use cases, and you'd at least need to do "MSCK REPAIR" on the external hive table whenever a new partition is added.

SparkSql get float type field value null from hive table

I create and import hive table with sqoop and use pyspark to get data. The table is composed by one string field, one int field and several float field. I can get the whole data by hue hive sql query. But while I program with pyspark sql the non-float field can be displayed and the float fields always show null value.
HUE hive sql results:
zeppelin pyspark output:
The details of hive table:
I finally found the cause. since I import these tables from mysql via sqoop. the original table columns are uppercase and in hive they were converted to all lowercase automatically. it caused all converted fields value can not be retrieved by sparksql. (but HUE hive queries these data normally, It might be a bug of spark.) I have to convert uppercase field names to lower case by specify the option --query in sqoop command. i.e. --query 'select MMM as mmm from table...'

How to know location about partition in hive?

If I write a hive sql like
ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023;
How can I query this location about partition later? Because I found there is some data in location but I can't query them, hive sql like
SELECT data FROM tbl_name where dt=20131023;
Do a describe on the partition instead of the full table.
This will show the linked location if it's an external table.
describe formatted tbl_name partition (dt='20131023')
show table extended like 'tbl_name' partition (dt='20131023');
Show Tables/Partitions Extended
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users cannot use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.
If you have multiple nested partitions, the syntax is:
describe formatted table_name partition (day=123,hour=2);
If you want to know the location of files you're reading, use
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE FROM <table> WHERE <part_name> = '<part_key>'
Then you get
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_0.snappy, 0
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_1.snappy, 0
This is the format of the command I use to get the exact HDFS location of a specific partition in a specific table:
show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
In the command above, the partition spec consists of three separate fields. Your example may have more or less.
See results below. Notice the "location:" field shows the HDFS folder location.
hive (nva_test)> show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
OK
tableName:flight_context_fused_record
owner:nva-prod
location:hdfs://hdp1-ha/tmp/vfisher/cms-context-acquisition-2019-06-13/FlightContextFusedRecord/2018/10/13/ZMP/P-DUK2nESsv
inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
columns:struct columns { string primary_key, string facility, string position, i32 dalr_channel, i64 start_time_unix_millis, i64 end_time_unix_millis, string foreign_key_to_audio_segment, struct<on_frequency_flight_list:list<struct<acid:string,ac_type:string>>,transfer_list:list<struct<primary_key:string,acid:string,data_id:string,ac_type:string,from_facility:string,from_position:string,transition_time:i64,transition_time_start:i64,transtition_time_end:i64,to_facility:string,to_position:string,source:string,source_info:string,source_time:i64,confidence:double,confidence_description:string,uuid:string>>,source_list:list<string>,domain:string,domains:list<string>> flight_context}
partitioned:true
partitionColumns:struct partition_columns { i32 date_key, string partition_id, string custom_partition_1}
totalNumberFiles:1
totalFileSize:247075687
maxFileSize:247075687
minFileSize:247075687
lastAccessTime:1561122938361
lastUpdateTime:1561071155639
The generic form of the command (taking out my specific values and putting in argument specifiers) looks like this:
show table extended like <your table name here> partition(<your partition spec here>);
you can simply do this:
DESC FORMATTED tablename PARTITION (yr_no='y2019');
OR
DESC EXTENDED tablename PARTITION (yr_no='y2019');
You can get the location of the Hive partitions on HDFS by running any of the following Hive commands.
DESCRIBE FORMATTED tbl_name PARTITION(dt=20131023);
SHOW TABLE EXTENDED LIKE tbl_name PARTITION(dt=20131023);
Alternatively, you can also get by running HDFS list command
hdfs dfs -ls <your Hive store location>/<tablename>
Link: Hive show or list all partitions
Thanks,
NNK
You can get this info via Hive Metastore Thrift protocol, e.g. with hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> with client as c:
... partition = c.get_partition_by_name(db_name='default',
tbl_name='test_table_with_partitions',
part_name='dt=20210504')
...
>>> partition.sd.location
'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'