Timestamp datatype not supporting in hive, when reading a parquet file - hive

I have created a partitioned external table in hive that stores parquet format files. I have timestamp column in that table, when i load data its giving nulls in timestamp column.
create table query
CREATE EXTERNAL TABLE abc(
timestamp1 timestamp,
tagname string,
value string,
quality bigint,
own string)
PARTITIONED BY (
etldate string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'adl://refdatalakeprod.azuredatalakestore.net/iconic'
TBLPROPERTIES (
'PARQUET.COMPRESS'='SNAPPY');
Any suggestions pls?
Thanks in advance.

Your question is wrong.It's not timestamp type, it is a string type.I think you need to check your data.

Related

Hive table with Avro Schema

I have created a hive external table with Avro Schema (complex types) with partition columns. After adding the required partition files, select query returns null values for all the columns except the partition columns. Avro schema has arrays and structs types inside.
Here is the DDL,
CREATE EXTERNAL TABLE mytable PARTITIONED BY(date int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cloud/location'
TBLPROPERTIES ('avro.schema.url'='cloud location for schema file');
I tried to give the schema file directly and tried with schema literal as well in the TBLPROPERTIES.
Select query returns null for all the columns.
Any suggestions for fixing this issue? is there anything missing in this scenario?

Hive table from HBase with a column cotaining avro

I was able to create an external Hive table with just one column containing an Avro data stored into HBase through the following query:
CREATE EXTERNAL TABLE test_hbase_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,familyTest:columnTest",
"familyTest.columnTest.serialization.type" = "avro",
"familyTest.columnTest.avro.schema.url" = "hdfs://path/person.avsc")
TBLPROPERTIES (
"hbase.table.name" = "otherTest",
"hbase.mapred.output.outputtable" = "hbase_avro_table",
"hbase.struct.autogenerate"="true");
What I wish to do is to create a table with the same avro file and other columns containing strings or integer but I was not able to do that and didn't find any example. Can anyone help me? Thank you

Define nested items while creating table in Hive

I am trying to create an external hive table using CSV as input file.
How my data looks like:
xxx|2021-08-14 07:10:41.080|[{"sub1","90"},{"sub2","95"}]
I am creating the table using below sql:
CREATE EXTERNAL TABLE mydb.mytable (
Name string,
Last_upd_timestamp timestamp,
subjects array<struct<sub_code:string,sub_marks:string>>
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('collection.delim'=',','field.delim'='|','serialization.format'='|')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Location 'hdfs://nameservice1/myinputfile'
)
When i try the above, table is created with subjects column like:
[{"sub_code":"[{\"sub1\",\"90\"},{\"sub2\",\"95\"}]","sub_marks":null}]
Not sure what I am doing wrong in the above. Would highly appreciate if someone can help me with how I can create the table in expected output.

How to rename a column when creating an external table in Athena based on Parquet files in S3?

Does anybody know how to rename a column when creating an external table in Athena based on Parquet files in S3?
The Parquet files I'm trying to load have both a column named export_date as well as an export_date partition in the s3 structure.
An example file path is: 's3://bucket_x/path/to/data/export_date=2020-08-01/platform=platform_a'
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
So what I would like to do, is to rename the export_date column to export_date_exp. The AWS documentation indicates that:
To make Parquet read by index, which will allow you to rename
columns, you must create a table with parquet.column.index.access
SerDe property set to true.
https://docs.amazonaws.cn/en_us/athena/latest/ug/handling-schema-updates-chapter.html#parquet-read-by-name
But the following code does not load any data in the export_date_exp column:
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date_exp` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'parquet.column.index.access'='true')
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
This question has been asked already, but did not receive an answer:
How to rename AWS Athena columns with parquet file source?
I am asking again because the documentation explicitly says it is possible.
As a side note: in my particular use case I can just not load the export_date column, as I've learned that reading Parquet by name does not require you to load every column. In my case I don't need the export_date column, so this avoids the conflict with the partition name.

How to not have partition name when creating partitions via Hive

We have a table (table1) which is partitioned on year, month and day.
I created an orc table similar to table1 with similar partitions but of type ORC. I am trying to insert
date into partitions using the following statement but i am getting data dumped in folders with partition names.
How can i make sure where the folders don't have partition names in them?
create external table table1_orc(
col1 string,
col2 string,
col3 int
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/base_path_orc/';
set hive.exec.dynamic.partition=true;
insert overwrite table table1_orc partition(year,month,day) select * from table1 where year = '2015' and month = '10' and day = '01';
Path to table1 in hdfs - /base_path/2015/10/01/data.csv
Path to orc table in hdfs (current output) -/base_path_orc/year=2015/month=10/day=01/000000_0
Desired output - /base_path_orc/2015/10/01/000000_0