Impala insert vs hive insert - hive

When I tried to insert integer values into a column in a parquet table with Hive command, values are not getting insert and shows as null. But when used impala command it is working. But the partition size reduces with impala insert. Also number of rows in the partitions (show partitions) show as -1. What is the reason for this?
CREATE TABLE `TEST.LOGS`(
`recordtype` string,
`recordstatus` string,
`recordnumber` string,
`starttime` string,
`endtime` string,
`acctsessionid` string,
`subscriberid` string,
`framedip` string,
`servicename` string,
`totalbytes` int,
`rxbytes` int,
`txbytes` int,
`time` int,
`plan` string,
`tcpudp` string,
`intport` string)
PARTITIONED BY (`ymd` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://dev-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
TBLPROPERTIES (
'transient_lastDdlTime'='1634390569')
Insert Statement
Hive
sudo -u hdfs hive -e 'insert into table TEST.LOGS partition (ymd="20220221") select * from TEMP.LOGS;'
Impala
impala-shell --ssl -i xxxxxxxxxxx:21000 -q 'insert into table TEST.LOGS partition (ymd="20220221") select * from TEMP.LOGS;'

When I tried to insert integer values into a column in a parquet table with Hive command, values are not getting insert and shows as null.
Could you pls share your exact insert statement and table definition for precise answer? If i have to guess, this may be because of difference in implicit data type conversion by hive and impala.
HIVE - If you set hive.metastore.disallow.incompatible.col.type.changes to false, the types of columns in Metastore can be changed from any type to any other type. After such a type change, if the data can be shown correctly with the new type, the data will be displayed. Otherwise, the data will be displayed as NULL. As per documentation forward conversion works(int> bigint) whereas backward (big int > small int) doesnt and produces null.
Impala - it supports a limited set of implicit casts to avoid undesired results from unexpected casting behavior. Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT.
Also number of rows in the partitions (show partitions) show as -1 -
Please run compute stats table_name to fix this issue.

Related

How to create external table on parquet file

I have a parquet file on gcp storage. File converted from simple json {"id":1,"name":"John"}.
Could you help me write the correct script? Is it possible to do that without schema?
create external table test (
id string,
name string
)
row format delimited
fields terminated by '\;'
stored as ?????
location '??????'
tblproperties ('skip.header.line.count'='1');
Hive is , as sql databases, working in a write-in schema architecture so you cannot create a table using HQL without using a schema ( not like other cases for NoSql like Hbase for example or others). I advise you to use a Hive version >= 0.14, it is easier:
CREATE TABLE table_name (
string1 string,
string2 string,
int1 int,
boolean1 boolean,
long1 bigint,
float1 float,
double1 double,
inner_record1 struct,
enum1 string,
array1 array,
map1 map,
union1 uniontype,
fixed1 binary,
null1 void,
unionnullint int,
bytes1 binary)
PARTITIONED BY (ds string);

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

Amazon Athena returning "mismatched input 'partitioned' expecting {, 'with'}" error when creating partitions

I'd like to use this query to create a partitioned table in Amazon Athena:
CREATE TABLE IF NOT EXISTS
testing.partitioned_test(order_id bigint, name string, car string, country string)
PARTITIONED BY (year int)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS 'PARQUET'
LOCATION 's3://testing-imcm-into/partitions'
Unfortunately I don't get the error message which tells me the following:
line 3:2: mismatched input 'partitioned' expecting {, 'with'}
The quotes around 'PARQUET' seemed to be causing a problem.
Try this:
CREATE EXTERNAL TABLE IF NOT EXISTS
partitioned_test (order_id bigint, name string, car string, country string)
PARTITIONED BY (year int)
STORED AS PARQUET
LOCATION 's3://testing-imcm-into/partitions/'

hive record inserted but then get a error

I create a table in hive:
CREATE TABLE `test3`.`shop_dim` (
`shop_id` bigint,
`shop_name` string,
`shop_company_id` bigint,
`shop_url1` string,
`shop_url2` string,
`sid` string,
`shop_open_duration` string,
`date_modified` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ("path"="hdfs://myhdfs/warehouse/tablespace/managed/hive/test3.db/shop_dim")
STORED AS PARQUET
TBLPROPERTIES ('COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"date_modified\":\"true\",\"shop_company_id\":\"true\",\"shop_id\":\"true\",\"shop_name\":\"true\",\"shop_open_duration\":\"true\",\"shop_url1\":\"true\",\"shop_url2\":\"true\",\"sid\":\"true\"}}', 'bucketing_version'='2', 'numFiles'='12', 'numRows'='12', 'rawDataSize'='96', 'spark.sql.create.version'='2.3.0', 'spark.sql.sources.provider'='parquet', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"Shop_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Shop_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Shop_company_id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Shop_url1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Shop_url2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Shop_open_duration\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Date_modified\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}}]}', 'totalSize'='17168')
GO
then I insert a record use below sql:
insert into test3.shop_dim values(11,'aaa',22,'11113','2222','sid','opend',unix_timestamp())
I can see the record is inserted,but waited for a long time,there is error:
>[Error] Script lines: 1-2 --------------------------
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask
[Executed: 2018-10-24 下午12:00:03] [Execution: 0ms]
I use aqua studio as a tool.Why this error occur?
This issue can happen if the values being inserted are not matching to the expected type.
In your case "date_modified" column is of timestamp type, but unix_timestamp() will return bigint (current Unix timestamp in seconds).
If you execute the query
select unix_timestamp();
Output would be like : 1558547043
Instead, you need to use current_timestamp.
select current_timestamp;
Output would be like : 2019-05-22 17:50:18.803
You can refer Hive manual for in-built date functions at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
Below given hive setting can help resolve org.apache.hadoop.hive.ql.exec.StatsTask (state=08S01,code=1)
set hive.stats.column.autogather=false; or set hive.stats.autogather=false
set hive.optimize.sort.dynamic.partition=true;

Error while creating table in Hive

I am new to hadoop. I need a help regarding error encountered in Hive while creating a new table. I have gone through this Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification
My question: Is it necessary to write a location of the table in the script? because I am writing table location at starting and I am afraid about writing the location because it should not disturb my rest of the databases by any mulfunction operation.
Here is my query:
CREATE TABLE meta_statistics.tank_items (
shop_offers_history_before bigint,
shop_offers_temp bigint,
videos_distinct_temp bigint,
deleted_temp bigint,
t_stamp timestamp )
CLUSTERED BY (
tank_items_id)
INTO 8 BUCKETS
ROW FORMAT SERDE
TBLPROPERTIES (transactional=true)
STORED AS ORC;
The error I am getting is-
ParseException line 1:3 cannot recognize input near 'TBLPROPERTIES'
'(' 'transactional'
What would be the other possibilities of errors and how can I remove those?
There is a syntax error in your create query. Error which you have shared says that hive cannot recognize input near 'TBLPROPERTIES'.
Solution:
As per hive syntax, the key value passed in TBLPROPERTIES should be in double quotes. it should be like this: TBLPROPERTIES ("transactional"="true")
So if I correct your query it will be:
CREATE TABLE meta_statistics.tank_items (
shop_offers_history_before bigint,
shop_offers_temp bigint,
videos_distinct_temp bigint,
deleted_temp bigint,
t_stamp timestamp
) CLUSTERED BY (tank_items_id) INTO 8 BUCKETS
ROW FORMAT SERDE TBLPROPERTIES ("transactional"="true") STORED AS ORC;
Execute above query, then if you get any other syntax error them make sure that the order of STORED AS , CLUSTERED BY , TBLPROPERTIES is as per the hive syntax.
Refer this for more details:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
1) ROW FORMAT SERDE -> you should pass some serde
2) TBLPROPERTIES key value should be in double quotes
3) if you give CLUSTERED BY value should be there in the columns given
replace as follows
CREATE TABLE meta_statistics.tank_items ( shop_offers_history_before bigint, shop_offers_temp bigint, videos_distinct_temp bigint, deleted_temp bigint, t_stamp timestamp ) CLUSTERED BY (shop_offers_history_before) INTO 8 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS ORC TBLPROPERTIES ("transactional"="true");
hope this helps