hive accesses orc external table by position rather than column name - hive

I'm using Hive 3.1.0 cluster on HDInsights 4.0.
The orc and parquet with the same data were created using spark, with schema (a string, b int, c string).
create external table a_st_b_int_d_st_orc(a string, b int, d string) stored as orc location <path_to_spark_created_files>
select * from a_st_b_int_d_st_orc;
+----+----+------+
| a | b | d |
+----+----+------+
| 1 | 2 | abc |
| 2 | 3 | bcd |
+----+----+------+
create external table a_st_b_int_d_st_parquet(a string, b int, d string) stored as parquet
location <path_to_spark_created_files>
select * from a_st_b_int_d_st_parquet;
+----+----+-------+
| a | b | d |
+----+----+-------+
| 1 | 2 | NULL |
| 2 | 3 | NULL |
+----+----+-------+
The default behavior of hive native ORC-Reader is that it maps meta-store column names by position with orc files.
There were JIRAs created to map columns by name and reverted as well.
The behavior wrt parquet can be configured using parquet.column.index.access although default being column resolution by name.
In Presto also we can specify hive.orc.use-column-names=true
How to turn this off the default ORC behavior in hive?

Related

Create external table from csv on HDFS , all values come with quotes

I have a csv file on HDFS and I am trying to create an impala table , the situation is it created the table and values with all the "
CREATE external TABLE abc.def
(
name STRING,
title STRING,
last STRING,
pno STRING
)
row format delimited fields terminated by ','
location 'hdfs:pathlocation'
tblproperties ("skip.header.line.count"="1") ;
The output is
name tile last pno
"abc" "mr" "xyz" "1234"
"rew" "ms" "pre" "654"
I just want to create table from csv file without quotes. Please guide where I am going wrong.
Regards,
R
A way to do that is creating a stage table that load the file with quotes and then with CTAS (Create table as select) create the right table cleaning the fields with replace function.
As an example
CREATE TABLE quote_stage(
id STRING,
name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
+-----+----------+
| id | name |
+-----+----------+
| "1" | "pepe" |
| "2" | "ana" |
| "3" | "maria" |
| "4" | "ramon" |
| "5" | "lucia" |
| "6" | "carmen" |
| "7" | "alicia" |
| "8" | "pedro" |
+-----+----------+
CREATE TABLE t_quote
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
AS SELECT replace(id,'"','') AS id, replace(name,'"','') AS name FROM quote_stage;
+----+--------+
| id | name |
+----+--------+
| 1 | pepe |
| 2 | ana |
| 3 | maria |
| 4 | ramon |
| 5 | lucia |
| 6 | carmen |
| 7 | alicia |
| 8 | pedro |
+----+--------+
Hope this helps.

Insert overwrite on partitioned table is not deleting the existing data

I am trying to run insert overwrite over a partitioned table.
The select query of insert overwrite omits one partition completely. Is it the expected behavior?
Table definition
CREATE TABLE `cities_red`(
`cityid` int,
`city` string)
PARTITIONED BY (
`state` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'auto.purge'='true',
'last_modified_time'='1555591782',
'transient_lastDdlTime'='1555591782');
Table Data
+--------------------+------------------+-------------------+--+
| cities_red.cityid | cities_red.city | cities_red.state |
+--------------------+------------------+-------------------+--+
| 13 | KARNAL | HARYANA |
| 13 | KARNAL | HARYANA |
| 1 | Nagpur | MH |
| 22 | Mumbai | MH |
| 22 | Mumbai | MH |
| 755 | BPL | MP |
| 755 | BPL | MP |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 12 | NOIDA | UP |
| 12 | NOIDA | UP |
+--------------------+------------------+-------------------+--+
Queries
insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';
It does not delete any data from the table
insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';
It removes the expected 2 rows from the table.
Is this an expected behavior from Hive in case of partitioned tables?
Yes, this is expected behavior.
Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.
In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.
Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.
It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.
And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it.
Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.

How to explode map datatype in Hive OR how to give multiple aliases in Hive

Suppose I query :
select explode(map_column_name) as exploded from table_name
I get this error:
The number of aliases in the AS clause does not match the number of
columns output by the UDTF, expected 2 aliases but got 1
and I googled the error and got to know that to give more than one alias , we use stack function ..
How to use stack function along with explode function so that I eventually explode map datatype and also give 2 aliases at a time?
Kindly bear with me as I am a beginner and learning Hive.
With default columns names
select explode(map) from table_name
With aliases
select explode(map) as (mykey,myval) from table_name
Demo
With default columns names
select explode (map('A',1,'B',2,'C',3))
;
+-----+-------+
| key | value |
+-----+-------+
| A | 1 |
| B | 2 |
| C | 3 |
+-----+-------+
With aliases
select explode (map('A',1,'B',2,'C',3)) as (mykey,myvalue)
;
+-------+---------+
| mykey | myvalue |
+-------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
+-------+---------+

hive - show table's column details only

I have created a HIVE partition table and when I run describe table I see other table properties as well as the table column details. If I want to see only the table column details, then what command can I use?
create table t1 (x int, y int, s string) partitioned by (z date) stored as sequencefile;
describe t1;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| x | int | |
| y | int | |
| s | string | |
| z | date | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| z | date | |
+--------------------------+-----------------------+-----------------------+--+
Can the last 5 rows be avoided?
| NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| z | date | |
Also what does this NULL | NULL row means?
What you're looking for is this configuration parameter:
set hive.display.partition.cols.separately=false
From hive documentation:
In Hive 0.10.0 and earlier, no distinction is made between partition columns and non-partition columns while displaying columns for DESCRIBE TABLE. From Hive 0.12.0 onwards, they are displayed separately.
In Hive 0.13.0 and later, the configuration parameter hive.display.partition.cols.separately lets you use the old behavior, if desired (HIVE-6689). For an example, see the test case in the patch for HIVE-6689.

Bigquery query to find the column names of a table

I need a query to find column names of a table (table metadata) in Bigquery, like the following query in SQL:
SELECT column_name,data_type,data_length,data_precision,nullable FROM all_tab_cols where table_name ='EMP';
BigQuery now supports information schema.
Suppose you have a dataset named MY_PROJECT.MY_DATASET and a table named MY_TABLE, then you can run the following query:
SELECT column_name
FROM MY_PROJECT.MY_DATASET.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'MY_TABLE'
Yes you can get table metadata using INFORMATION_SCHEMA.
One of the examples mentioned in the past link retrieves metadata from the INFORMATION_SCHEMA.COLUMN_FIELD_PATHS view for the commits table in the github_repos dataset, you just have to
Open the BigQuery web UI in the GCP Console.
Enter the following standard SQL query in the Query editor box. INFORMATION_SCHEMA requires standard SQL syntax. Standard SQL is the default syntax in the GCP Console.
SELECT
*
FROM
`bigquery-public-data`.github_repos.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
table_name="commits"
AND column_name="author"
OR column_name="difference"
Note: INFORMATION_SCHEMA view names are case-sensitive.
Click Run.
The results should look like the following
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| table_name | column_name | field_path | data_type | description |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| commits | author | author | STRUCT<name STRING, email STRING, time_sec INT64, tz_offset INT64, date TIMESTAMP> | NULL |
| commits | author | author.name | STRING | NULL |
| commits | author | author.email | STRING | NULL |
| commits | author | author.time_sec | INT64 | NULL |
| commits | author | author.tz_offset | INT64 | NULL |
| commits | author | author.date | TIMESTAMP | NULL |
| commits | difference | difference | ARRAY<STRUCT<old_mode INT64, new_mode INT64, old_path STRING, new_path STRING, old_sha1 STRING, new_sha1 STRING, old_repo STRING, new_repo STRING>> | NULL |
| commits | difference | difference.old_mode | INT64 | NULL |
| commits | difference | difference.new_mode | INT64 | NULL |
| commits | difference | difference.old_path | STRING | NULL |
| commits | difference | difference.new_path | STRING | NULL |
| commits | difference | difference.old_sha1 | STRING | NULL |
| commits | difference | difference.new_sha1 | STRING | NULL |
| commits | difference | difference.old_repo | STRING | NULL |
| commits | difference | difference.new_repo | STRING | NULL |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
For newbies like me, the above is of the following syntax:
select * from project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS where table_catalog=project_name and table_schema=dataset_name and table_name=table_name
Update: This is now possible! See the INFORMATION SCHEMA docs and the answers below.
Answer, circa 2012:
It's not currently possible to retrieve table metadata (i.e. column names and types) via a query, though this isn't the first time it's been requested.
Is there a reason you need to do this as a query? Table metadata is available via the tables API.
Actually it is possible to do so using SQL. To do so you need to query the logging table for the last log of this particular table being created.
For example, assuming the table is loaded/created daily:
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)
To Check column, You can access your table Through CLI Easy and simple to find
bq query --use_legacy_sql=false 'select Hour, sum(column 1) as column from `project_id.dataset.table_name` where Date(Hour) = '2020-06-10';'