Case Sensitive Partition Column in hive - hive

I need to create partition on a column with an UPPERCASE Column Name. However, Hive converts all the column Names to lowercase implicitly. I am not able to get data in my select * query. I cannot change the Folder name to lowercase. Below is the Create query I am using:
CREATE EXTERNAL TABLE db.ALERT_DS_Hive_Test
PARTITIONED BY (PROC_ID String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('casesensitive'='PROC_ID')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/sam/hive-test/ALERT_DS.avro/'
TBLPROPERTIES ('avro.schema.url'='/user/sam/hive-test/ALERT_DS.avsc');

Related

Created external table but it's empty

I want to create an external table from a .csv file I uploaded to the server earlier.
In Bline (shell for Hive), I tried running this script:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
row format delimited fields terminated by '\073' stored as textfile
location '/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping'
TABLEPROPERTIES ('serialization.null.format' = '')
;
which creates the table w/o any error byt the table itself is empty.
Help would be appreciated.
My textfile is populated with data.
First, check if the location path is correct.
Then try with this configuration:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping';
response provided above seems to be correct:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping';
This will create the table using a comma as the delimiter, which should correctly parse the data in your CSV file and populate the table with the data from the file. You can also specify a different delimiter character, such as '\t', if that is more appropriate for your data.

How can I create an EXTERNAL table with HIVE format in databricks

I am having a external table with below format in hive.
CREATE EXTERNAL TABLE cs_mbr_prov(
key struct<inid:string,......>,
memkey string,
ob_id string,
.....
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=' :key,ci:MEMKEY, .....',
'serialization.format'='1')
I want to create same type of table in Azure Databricks where my Input and Output are in parquet format.
As per the official Doc I created and reproduced the table with Input and Output are in parquet format.
Sample code:
CREATE EXTERNAL TABLE `vams`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'dbfs:/FileStore/'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Reference:
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-hiveformat

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

Parsing nested xml file return null data in hive

I am parsing a nested xml file using hivexml serde but it returns null while we select the data from hive table.
Sample xml file is xml data.
Query which i created for parsing the xml.
CREATE EXTERNAL TABLE IF NOT EXISTS abc ( mail string, Type string, Id bigint, Date string, LId bigint, value string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.OptOutEmail"="/Re/mail/text()",
"column.xpath.OptOutType"="/Re/Type/text()",
"column.xpath.SurveyId"="/Re/Id/text()",
"column.xpath.RequestedDate"="/Re/Date/text()",
"column.xpath.EmailListId"="/Re/Lists/LId/text()",
"column.xpath.Description"="/Re/Lists/value/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/abc/xyz'
TBLPROPERTIES ("xmlinput.start"="<Out>","xmlinput.end"= "</Out>");
Please can someone help.
Try with the below query. I have loaded the data into the table from a local path.
CREATE EXTERNAL TABLE IF NOT EXISTS xmlList ( mail string, Type string, Id
bigint, Dated string, LId bigint, value string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.mail"="/Re/mail/text()",
"column.xpath.Type"="/Re/Type/text()",
"column.xpath.Id"="/Re/Id/text()",
"column.xpath.Dated"="/Re/Dated/text()",
"column.xpath.LId"="/Re/Lists/List/LId/text()",
"column.xpath.value"="/Re/Lists/List/value/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<Re>","xmlinput.end"= "</Re>");

Issue in Spark not able to read S3 subfolders of a hive table

I have 3 non-partitioned tables in Hive.
drop table default.test1;
CREATE EXTERNAL TABLE `default.test1`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test1/';
drop table default.test2;
CREATE EXTERNAL TABLE `default.test2`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test2/';
drop table default.test3;
CREATE EXTERNAL TABLE `default.test3`(
`c1` string,
`c2` string,
`c3` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket_name/dev/sri/sri_test3/';`
`
--INSERT:
insert into default.test1 values("a","b","c");
insert into default.test2 values("d","e","f");
insert overwrite default.test3 select * from default.test1 union all default.test2;
`
If I look in s3 two subfolders have been created because of the union all operation.
aws s3 ls s3://bucket_name/dev/sri/sri_test3/`;
PRE 1/
PRE 2/
Now the issue is that if I try to read the default.test3 table in pyspark and create dataframe.
df = spark.sql("select * from default.test3")
df.count()
0
How to fix this issue?