Hive table with Avro Schema - hive

I have created a hive external table with Avro Schema (complex types) with partition columns. After adding the required partition files, select query returns null values for all the columns except the partition columns. Avro schema has arrays and structs types inside.
Here is the DDL,
CREATE EXTERNAL TABLE mytable PARTITIONED BY(date int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cloud/location'
TBLPROPERTIES ('avro.schema.url'='cloud location for schema file');
I tried to give the schema file directly and tried with schema literal as well in the TBLPROPERTIES.
Select query returns null for all the columns.
Any suggestions for fixing this issue? is there anything missing in this scenario?

Related

BigQuery external table over GCS path with partitions

I have some data stored in GCS bucket in the following path:
gcs://my-bucket/my_data/subfolder1/subfolder2/**.csv.gz
I intent to create an external table mapping to my_data and want the external table is able to partition the data by different level of subfolders. Note that subfolder1 or subfolder2 don't have a hive partition prefix, i.e, not in the format of prefix=value.
If I would write some pseudo code in Athena syntax, it would be something like below:
CREATE EXTERNAL TABLE `my_data`(
--Column specs go here---
)
PARTITIONED BY (
`partition_0` string,
`partition_1` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'gcs://my-bucket/my-data/'
TBLPROPERTIES (...)
As a result of the pseudo code, the table will consists of two partition columns in addition to columns defined in the column spec.
partition_0
partition_1
Queries filtering on these two columns will then benefits from partition pruning.
Would anyone please advise if this possible in BigQuery. If yes, how I should go about it in SQL?

Hive table from HBase with a column cotaining avro

I was able to create an external Hive table with just one column containing an Avro data stored into HBase through the following query:
CREATE EXTERNAL TABLE test_hbase_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,familyTest:columnTest",
"familyTest.columnTest.serialization.type" = "avro",
"familyTest.columnTest.avro.schema.url" = "hdfs://path/person.avsc")
TBLPROPERTIES (
"hbase.table.name" = "otherTest",
"hbase.mapred.output.outputtable" = "hbase_avro_table",
"hbase.struct.autogenerate"="true");
What I wish to do is to create a table with the same avro file and other columns containing strings or integer but I was not able to do that and didn't find any example. Can anyone help me? Thank you

Define nested items while creating table in Hive

I am trying to create an external hive table using CSV as input file.
How my data looks like:
xxx|2021-08-14 07:10:41.080|[{"sub1","90"},{"sub2","95"}]
I am creating the table using below sql:
CREATE EXTERNAL TABLE mydb.mytable (
Name string,
Last_upd_timestamp timestamp,
subjects array<struct<sub_code:string,sub_marks:string>>
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('collection.delim'=',','field.delim'='|','serialization.format'='|')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Location 'hdfs://nameservice1/myinputfile'
)
When i try the above, table is created with subjects column like:
[{"sub_code":"[{\"sub1\",\"90\"},{\"sub2\",\"95\"}]","sub_marks":null}]
Not sure what I am doing wrong in the above. Would highly appreciate if someone can help me with how I can create the table in expected output.

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Timestamp datatype not supporting in hive, when reading a parquet file

I have created a partitioned external table in hive that stores parquet format files. I have timestamp column in that table, when i load data its giving nulls in timestamp column.
create table query
CREATE EXTERNAL TABLE abc(
timestamp1 timestamp,
tagname string,
value string,
quality bigint,
own string)
PARTITIONED BY (
etldate string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'adl://refdatalakeprod.azuredatalakestore.net/iconic'
TBLPROPERTIES (
'PARQUET.COMPRESS'='SNAPPY');
Any suggestions pls?
Thanks in advance.
Your question is wrong.It's not timestamp type, it is a string type.I think you need to check your data.