Presto datatype Mismatch Issue In Hive ORC Table - hive

I am trying to query my hive orc table by presto ,In Hive its working Fine.In prestro I am able to access all the column except lowrange It's showing Below Erroe
error : Query 20220322_135856_00076_a33ec failed: Error opening Hive split hdfs://.....filename.orc
(offset=0, length=24216): Malformed ORC file. Cannot read SQL type varchar from ORC stream .lowrange
of type LONG [hdfs://.....filename.orc.orc]
I have set below property in presto before starting the query:
set hive1.orc.use-column-names=true
where hive1 is my catalog name.
I have also tried to change Hive tables datatype for this column as Double/BigInt,Int But Nothing Worked.
Can someone help me to resolve the error.
Table Description:
+-------------------------------+---------------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+---------------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| lowrange | string | |
| type | string | |
| processed_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| type | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| Owner: | hdfs | NULL |
| CreateTime: | Tue Mar 22 08:28:49 UTC 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://......../user/hdfs/test/ | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | EXTERNAL | TRUE |
| | skip.header.line.count | 1 |
| | transient_lastDdlTime | 1647937729 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | , |
| | serialization.format | , |
+-------------------------------+---------------------------------------------------------+-----------------------+--+
Sample Data:
lowrange type processed_date
1234567890001212 01 20220323
1234567890001213 01 20220323
Table Create Statement:
CREATE EXTERNAL TABLE `table1`(
`lowrange` string,
`processed_date` string)
PARTITIONED BY (
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://......./user/hdfs/test'
TBLPROPERTIES (
'skip.header.line.count'='1')
Update: I dropped the existing Table and created new table with datatype from String to BigINt in hive table and able to select data from Table but When I am trying to perform lpad operation its again showing same issue.
Logic I want to apply on Field : lpad(lowrange ,13,'9')
Error: Unexpected parameters (bigint, integer, varchar(1)) for
function lpad. Expected: lpad(varchar(x), bigint, varchar(y))
then I tried to cast bigint to varchar using Below query:
Updated Logic : lpad(cast(lowrange as varchar),13,'9'))
Error:
Malformed ORC file. Cannot read SQL type bigint
from ORC stream .lowrange of type STRING

Related

How to determine the name of an Impala object corresponds to a view

Is there a way in Impala to determine whether an object name returned by SHOW TABLES corresponds to a table or a view since:
this statement only return the object names, without their type
SHOW CREATE VIEW is just an alias for SHOW CREATE TABLE (same result, no view/table distinction)
DESCRIBE does not give any clue about the type of the item
Ideally I'd like to list all the tables + views and their types using a single operation, not one to retrieve the tables + views and then another call for each name to determine the type of the object.
(please note the question is about Impala, not Hive)
You can use describe formatted to know the type of an object
impala-shell> CREATE TABLE table2(
id INT,
name STRING
);
impala-shell> CREATE VIEW view2 AS SELECT * FROM table2;
impala-shell> DESCRIBE FORMATTED table2;
+------------------------------+--------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+--------------------------------------------------------------------+----------------------+
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/test.db/table2 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
+------------------------------+--------------------------------------------------------------------+----------------------+
impala-shell> DESCRIBE FORMATTED view2;
+------------------------------+-------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------+----------------------+
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Table Type: | VIRTUAL_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1601632695 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
+------------------------------+-------------------------------+----------------------+
In the case of the table type is Table Type: MANAGED_TABLE and for the view is Table Type: VIRTUAL_VIEW
Other way is querying metastore database (if you can) to know about metadata in Impala(or Hive)
mysql> use metastore;
mysql> select * from TBLS;
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| 9651 | 1601631971 | 9331 | 0 | anonymous | 0 | 27996 | table1 | MANAGED_TABLE | NULL | NULL | NULL |
| 9652 | 1601632121 | 9331 | 0 | anonymous | 0 | 27997 | view1 | VIRTUAL_VIEW | SELECT `table1`.`id`, `table1`.`name` FROM `test`.`table1` | SELECT * FROM table1 | NULL |
| 9653 | 1601632676 | 9331 | 0 | cloudera | 0 | 27998 | table2 | MANAGED_TABLE | NULL | NULL | NULL |
| 9654 | 1601632695 | 9331 | 0 | cloudera | 0 | 27999 | view2 | VIRTUAL_VIEW | SELECT * FROM test.table2 | SELECT * FROM test.table2 | NULL |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+

Can't INSERT INTO postgres column

I have a table with the following schema:
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------------------------+-----------------------------+-----------+----------+----------------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('test_table_id_seq'::regclass) | plain | |
orig_filename | text | | not null | | extended | |
file_extension | text | | not null | | extended | |
created_date | date | | not null | | plain | |
last_modified_date | date | | not null | | plain | |
upload_timestamp_utc | timestamp without time zone | | not null | | plain | |
uploaded_by | text | | not null | | extended | |
file_size_in_bytes | integer | | not null | | plain | |
original_containing_folder | text | | not null | | extended | |
file_data | bytea | | not null | | extended | |
source_shortname | text | | | | extended | |
Indexes:
"test_table_pkey" PRIMARY KEY, btree (id)
I appended the source_shortname column after building the table. I now want to INSERT values into the columns.
When I run this command:
INSERT INTO test_table(source_shortname) VALUES('name');
I get this error:
ERROR: null value in column "orig_filename" violates not-null constraint
DETAIL: Failing row contains (31, null, null, null, null, null, null, null, null, null, name).
I didn't set the source_shortname column to "not null" so I'm not sure why it's throwing that error. Particularly because there are only 28 rows and this seems to throw an error on row 31.
You are inserting entire records at a time, so when you insert and specify only one value for one column, it will implicitly assume that all the rest will be NULL. You need to insert a complete record and not have any NULL values in the insert statement.
INSERT INTO test_table(
orig_filename
,file_extension
,created_date
,last_modified_date
,upload_timestamp_utc
,uploaded_by
,file_size_in_bytes
,original_containing_folder
,file_data
,source_shortname)
VALUES
('','','','','','','','','','name')
Or you need to run an update instead since you just added that column and you want to insert something into that column
UPDATE test_table
SET source_shortname = 'name'
WHERE source_shortname IS NULL

changing default value for one column in SQL 5.7 gives warning about other column

Since SQL 5.7, my customers get more problems. One of the issues I found was following:
When I run a query, I get a message about another column, that is really strange?? who can explain this?
mysql> ALTER TABLE advertisement ALTER COLUMN local_name set default 'x';
ERROR 1067 (42000): Invalid default value for 'end_time'
The table which created this error is following:
mysql> show columns from advertisement;
+----------------+--------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| local_name | varchar(64) | NO | | | |
| chinese_name | varchar(64) | NO | | | |
| image | varchar(128) | NO | | | |
| top1 | varchar(128) | NO | | | |
| center1 | varchar(128) | NO | | | |
| bottom1 | varchar(128) | NO | | | |
| bottom2 | varchar(128) | NO | | | |
| bottom3 | varchar(128) | NO | | | |
| top_colour1 | varchar(16) | NO | | | |
| center_colour1 | varchar(16) | NO | | | |
| bottom_colour1 | varchar(16) | NO | | | |
| bottom_colour2 | varchar(16) | NO | | | |
| bottom_colour3 | varchar(16) | NO | | | |
| start_time | timestamp | NO | | CURRENT_TIMESTAMP | |
| end_time | timestamp | NO | | 0000-00-00 00:00:00 | |
| hour | varchar(64) | NO | | | |
| status | smallint(6) | NO | | 0 | |
+----------------+--------------+------+-----+---------------------+----------------+
That is because of server SQL Mode - NO_ZERO_DATE.
From the reference: NO_ZERO_DATE - In strict mode, don't allow '0000-00-00' as a valid date. You can still insert zero dates with the IGNORE option. When not in strict mode, the date is accepted but a warning is generated.
Documentation link

Will data get deleted on dropping internal table using location clause during its creation from hive?

In hive if I create an internal table using the loaction clause (mentioning loaction other than default location of hive) in table creation statement then on dropping that table will it delete the data from the specified location just like it does when the data is in default location of hive?
Yes, it will delete the location even it is not default location of hive also.
Let's assume i'm having test table in default database on /user/yashu/test5 directory.
hive> desc formatted test_tmp;
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| id | int | |
| name | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | shu | NULL |
| CreateTime: | Fri Mar 23 03:42:15 EDT 2018 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://nn1.com/user/yashu/test5 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | numFiles | 1 |
| | totalSize | 12 |
| | transient_lastDdlTime | 1521790935 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | , |
| | serialization.format | , |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
hadoop directory having one .txt file in test 5 directory
bash$ hadoop fs -ls /user/yashu/test5/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 12 2018-03-23 03:42 /user/yashu/test5/test.txt
Hive table data
select * from test_tmp;
+--------------+----------------+--+
| test_tmp.id | test_tmp.name |
+--------------+----------------+--+
| 1 | bar |
| 2 | foo |
+--------------+----------------+--+
once i drop the table in hive then the directory test5 also dropped from hdfs
hive> drop table test_tmp;
bash$ hadoop fs -ls /user/yashu/test5/
ls: `/user/yashu/test5/': No such file or directory
So once we delete the internal table in hive even the hive table is not on default location also drops the directory(location) that the table is pointing to.

pgAdmin Crashing on Query

I'm trying to perform the below SQL on the listed table/materialized view:
SELECT e.subject_id, e.hadm_id, e.icustay_id, e.itemid, e.charttime, e.value, e.valuenum
FROM MIMICIII.chartevents as e
INNER JOIN MIMICIII.adults_with_sepsis as aas1
ON aas1.subject_id = e.subject_id
INNER JOIN MIMICIII.adults_with_sepsis as aas2
ON aas2.hadm_id = e.hadm_id
However, pgAdmin crashes (windows error: pgAdmin has stopped responding) after about a minute of runtime. I believe the issue has to do with how pgAdmin has the chartevents stored:
I was hoping I could just perform the join on chartevents and then it would join all of the chartevents tables. As this is incorrect, how should I go about performing this join?
As well, I have provided the code for how chartevents is formed bellow.
Table:
Table "mimiciii.chartevents_2"
Column | Type | Modifiers | Storage | Stats target | Description
--------------+--------------------------------+-----------+----------+--------------+-------------
row_id | integer | not null | plain | |
subject_id | integer | not null | plain | |
hadm_id | integer | | plain | |
icustay_id | integer | | plain | |
itemid | integer | | plain | |
charttime | timestamp(0) without time zone | | plain | |
storetime | timestamp(0) without time zone | | plain | |
cgid | integer | | plain | |
value | character varying(255) | | extended | |
valuenum | double precision | | plain | |
valueuom | character varying(50) | | extended | |
warning | integer | | plain | |
error | integer | | plain | |
resultstatus | character varying(50) | | extended | |
stopped | character varying(50) | | extended |
|
Materialized View:
Materialized view "mimiciii.adults_with_sepsis"
Column | Type | Modifiers | Storage | Stats target | Description
tion -----
------------------------+---------+-----------+---------+--------------+-------------
subject_id | integer | | plain | |
hadm_id | integer | | plain | |
infection | integer | | plain | |
explicit_severe_sepsis | integer | | plain | |
explicit_septic_shock | integer | | plain | |
organ_dysfunction | integer | | plain | |
mech_vent | integer | | plain | |
angus | integer | | plain | |
Chart Events Code:
--------------------------------------------------------
-- DDL for Table CHARTEVENTS
--------------------------------------------------------
DROP TABLE IF EXISTS CHARTEVENTS CASCADE;
CREATE TABLE CHARTEVENTS
( ROW_ID INT NOT NULL,
SUBJECT_ID INT NOT NULL,
HADM_ID INT,
ICUSTAY_ID INT,
ITEMID INT,
CHARTTIME TIMESTAMP(0),
STORETIME TIMESTAMP(0),
CGID INT,
VALUE VARCHAR(255),
VALUENUM DOUBLE PRECISION,
VALUEUOM VARCHAR(50),
WARNING INT,
ERROR INT,
RESULTSTATUS VARCHAR(50),
STOPPED VARCHAR(50),
CONSTRAINT chartevents_rowid_pk PRIMARY KEY (ROW_ID)
);