How to determine the name of an Impala object corresponds to a view - impala

Is there a way in Impala to determine whether an object name returned by SHOW TABLES corresponds to a table or a view since:
this statement only return the object names, without their type
SHOW CREATE VIEW is just an alias for SHOW CREATE TABLE (same result, no view/table distinction)
DESCRIBE does not give any clue about the type of the item
Ideally I'd like to list all the tables + views and their types using a single operation, not one to retrieve the tables + views and then another call for each name to determine the type of the object.
(please note the question is about Impala, not Hive)

You can use describe formatted to know the type of an object
impala-shell> CREATE TABLE table2(
id INT,
name STRING
);
impala-shell> CREATE VIEW view2 AS SELECT * FROM table2;
impala-shell> DESCRIBE FORMATTED table2;
+------------------------------+--------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+--------------------------------------------------------------------+----------------------+
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/test.db/table2 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
+------------------------------+--------------------------------------------------------------------+----------------------+
impala-shell> DESCRIBE FORMATTED view2;
+------------------------------+-------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------+----------------------+
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Table Type: | VIRTUAL_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1601632695 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
+------------------------------+-------------------------------+----------------------+
In the case of the table type is Table Type: MANAGED_TABLE and for the view is Table Type: VIRTUAL_VIEW
Other way is querying metastore database (if you can) to know about metadata in Impala(or Hive)
mysql> use metastore;
mysql> select * from TBLS;
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| 9651 | 1601631971 | 9331 | 0 | anonymous | 0 | 27996 | table1 | MANAGED_TABLE | NULL | NULL | NULL |
| 9652 | 1601632121 | 9331 | 0 | anonymous | 0 | 27997 | view1 | VIRTUAL_VIEW | SELECT `table1`.`id`, `table1`.`name` FROM `test`.`table1` | SELECT * FROM table1 | NULL |
| 9653 | 1601632676 | 9331 | 0 | cloudera | 0 | 27998 | table2 | MANAGED_TABLE | NULL | NULL | NULL |
| 9654 | 1601632695 | 9331 | 0 | cloudera | 0 | 27999 | view2 | VIRTUAL_VIEW | SELECT * FROM test.table2 | SELECT * FROM test.table2 | NULL |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+

Related

Presto datatype Mismatch Issue In Hive ORC Table

I am trying to query my hive orc table by presto ,In Hive its working Fine.In prestro I am able to access all the column except lowrange It's showing Below Erroe
error : Query 20220322_135856_00076_a33ec failed: Error opening Hive split hdfs://.....filename.orc
(offset=0, length=24216): Malformed ORC file. Cannot read SQL type varchar from ORC stream .lowrange
of type LONG [hdfs://.....filename.orc.orc]
I have set below property in presto before starting the query:
set hive1.orc.use-column-names=true
where hive1 is my catalog name.
I have also tried to change Hive tables datatype for this column as Double/BigInt,Int But Nothing Worked.
Can someone help me to resolve the error.
Table Description:
+-------------------------------+---------------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+---------------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| lowrange | string | |
| type | string | |
| processed_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| type | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | test | NULL |
| Owner: | hdfs | NULL |
| CreateTime: | Tue Mar 22 08:28:49 UTC 2022 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://......../user/hdfs/test/ | NULL |
| Table Type: | EXTERNAL_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | EXTERNAL | TRUE |
| | skip.header.line.count | 1 |
| | transient_lastDdlTime | 1647937729 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | , |
| | serialization.format | , |
+-------------------------------+---------------------------------------------------------+-----------------------+--+
Sample Data:
lowrange type processed_date
1234567890001212 01 20220323
1234567890001213 01 20220323
Table Create Statement:
CREATE EXTERNAL TABLE `table1`(
`lowrange` string,
`processed_date` string)
PARTITIONED BY (
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://......./user/hdfs/test'
TBLPROPERTIES (
'skip.header.line.count'='1')
Update: I dropped the existing Table and created new table with datatype from String to BigINt in hive table and able to select data from Table but When I am trying to perform lpad operation its again showing same issue.
Logic I want to apply on Field : lpad(lowrange ,13,'9')
Error: Unexpected parameters (bigint, integer, varchar(1)) for
function lpad. Expected: lpad(varchar(x), bigint, varchar(y))
then I tried to cast bigint to varchar using Below query:
Updated Logic : lpad(cast(lowrange as varchar),13,'9'))
Error:
Malformed ORC file. Cannot read SQL type bigint
from ORC stream .lowrange of type STRING

Will data get deleted on dropping internal table using location clause during its creation from hive?

In hive if I create an internal table using the loaction clause (mentioning loaction other than default location of hive) in table creation statement then on dropping that table will it delete the data from the specified location just like it does when the data is in default location of hive?
Yes, it will delete the location even it is not default location of hive also.
Let's assume i'm having test table in default database on /user/yashu/test5 directory.
hive> desc formatted test_tmp;
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| col_name | data_type | comment |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
| # col_name | data_type | comment |
| | NULL | NULL |
| id | int | |
| name | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| Owner: | shu | NULL |
| CreateTime: | Fri Mar 23 03:42:15 EDT 2018 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Location: | hdfs://nn1.com/user/yashu/test5 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | numFiles | 1 |
| | totalSize | 12 |
| | transient_lastDdlTime | 1521790935 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | field.delim | , |
| | serialization.format | , |
+-------------------------------+-------------------------------------------------------------+-----------------------+--+
hadoop directory having one .txt file in test 5 directory
bash$ hadoop fs -ls /user/yashu/test5/
Found 1 items
-rw-r--r-- 3 hdfs hdfs 12 2018-03-23 03:42 /user/yashu/test5/test.txt
Hive table data
select * from test_tmp;
+--------------+----------------+--+
| test_tmp.id | test_tmp.name |
+--------------+----------------+--+
| 1 | bar |
| 2 | foo |
+--------------+----------------+--+
once i drop the table in hive then the directory test5 also dropped from hdfs
hive> drop table test_tmp;
bash$ hadoop fs -ls /user/yashu/test5/
ls: `/user/yashu/test5/': No such file or directory
So once we delete the internal table in hive even the hive table is not on default location also drops the directory(location) that the table is pointing to.

db2, roll up unknown number of rows from case statement result

I am trying to write a query where I can concatenate some rows into a single column based on the result of the case statement in DB2 v9.5
The contractId can be a variable number of rows as well.
Given I have the following table structure
Table1
+------------+------------+------+
| ContractId | Reference | Code |
+------------+------------+------+
| 12 | P123456789 | A |
| 12 | A987654321 | B |
| 12 | 9995559971 | C |
| 12 | 3215654778 | D |
| 13 | abcdef | A |
| 15 | asdfa | B |
| 37 | 282jd | B |
| 89 | asdf82 | C |
+------------+------------+------+
I would like to get the output of the result like so
+-------------+-----------------------+------------------------------------+
| ContractId | Reference with Code A | Other References |
+-------------+-----------------------+------------------------------------+
| 12 | P123456789 | A987654321, 9995559971, 3215654778 |
| 13 | abcdef | asdfa, 282jd, asdf82 |
+-------------+-----------------------+------------------------------------+
I've tried queries like
select t1.contract_id,
max(case when t1.code = A then t1.reference end) as "reference with code a",
max(case when t1.code in ('B','C','D') then t1.reference end) as 'other references
from table t1
group by t1.contractId
however, this is still giving me an output like
+-------------+-----------------------+------------------+
| ContractId | Reference with Code A | Other References |
+-------------+-----------------------+------------------+
| 12 | P123456789 | null |
| 12 | null | A987654321 |
| 12 | null | 9995559971 |
| 12 | null | 3215654778 |
+-------------+-----------------------+------------------+
I've also attempted using some of the XML Agg functions but can't seem to get it to format the way I want it too.

SQL count results on left join

I'm trying to get the total count of a table from a left join where there's a multiple of the same id. Here's my example below -
Table 1:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| project_id | int(11) | NO | | NULL | |
| token | varchar(32) | NO | | NULL | |
| email | varchar(255) | NO | | NULL | |
| status | char(1) | NO | | 0 | |
| permissions | varchar(255) | YES | | NULL | |
| created | datetime | NO | | NULL | |
| modified | datetime | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
Table 2:
+------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(32) | NO | | NULL | |
| account_id | int(11) | NO | | NULL | |
| created | datetime | NO | | NULL | |
| modified | datetime | NO | | NULL | |
| active | tinyint(1) | YES | | 1 | |
+------------+-------------+------+-----+---------+----------------+
I have this statement so far -
SELECT account_id, (SELECT COUNT(invitations.id)
FROM invitations WHERE invitations.project_id = projects.id) AS inv_count
FROM projects order by account_id;
And here's a sample of the results:
+------------+-----------+
| account_id | inv_count |
+------------+-----------+
| 1 | 0 |
| 2 | 2 |
| 2 | 0 |
| 3 | 4 |
| 3 | 0 |
| 3 | 4 |
| 3 | 0 |
| 4 | 6 |
| 4 | 3 |
| 4 | 3 |
| 4 | 5 |
| 4 | 3 |
| 4 | 9 |
| 5 | 6 |
| 5 | 0 |
| 5 | 4 |
| 5 | 2 |
| 5 | 2 |
How do I get account_id to show once and the sum of inv_count to show as 1 line? So I should see -
+------------+-----------+
| account_id | inv_count |
+------------+-----------+
| 1 | 0 |
| 2 | 2 |
| 3 | 8 |
You only need to put your query in a derived table (and name it, say tmp) and then group by the account_id:
SELECT account_id,
SUM(inv_count) AS inv_count
FROM
( SELECT account_id,
(SELECT COUNT(invitations.id)
FROM invitations
WHERE invitations.project_id = projects.id
) AS inv_count
FROM projects
) AS tmp
GROUP BY account_id
ORDER BY account_id ;
To simplify it farther, you can convert the inline subquery to a LEFT join. This way, no derived table is needed. I've also added aliases and removed the ORDER BY. MySQL does an implicit ORDER BY when you have GROUP BY so it's not needed here (unless you want to order by some other expression, different from the one you group by):
SELECT
p.account_id,
COUNT(i.id) AS inv_count
FROM
projects AS p
LEFT JOIN
invitations AS i
ON i.project_id = p.id
GROUP BY
p.account_id ;

Selecting multiple "most recent by timestamp" in mysql

I have a table containing log entries for various servers. I need to create a view with the most recent (by time) log entry for each idServer.
mysql> describe serverLog;
+----------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-----------+------+-----+-------------------+----------------+
| idLog | int(11) | NO | PRI | NULL | auto_increment |
| idServer | int(11) | NO | MUL | NULL | |
| time | timestamp | NO | | CURRENT_TIMESTAMP | |
| text | text | NO | | NULL | |
+----------+-----------+------+-----+-------------------+----------------+
mysql> select * from serverLog;
+-------+----------+---------------------+------------+
| idLog | idServer | time | text |
+-------+----------+---------------------+------------+
| 1 | 1 | 2009-12-01 15:50:27 | log line 2 |
| 2 | 1 | 2009-12-01 15:50:32 | log line 1 |
| 3 | 3 | 2009-12-01 15:51:43 | log line 3 |
| 4 | 1 | 2009-12-01 10:20:30 | log line 0 |
+-------+----------+---------------------+------------+
What makes this difficult (for me) is:
Entries for earlier dates/times may be inserted later, so I can't just rely on idLog.
timestamps are not unique, so I need to use idLog as a tiebreaker for "latest".
I can get the result I want using a subquery, but I can't put a subquery into a view. Also, I hear that subquery performance sucks in MySQL.
mysql> SELECT * FROM (
SELECT * FROM serverLog ORDER BY time DESC, idLog DESC
) q GROUP BY idServer;
+-------+----------+---------------------+------------+
| idLog | idServer | time | text |
+-------+----------+---------------------+------------+
| 2 | 1 | 2009-12-01 15:50:32 | log line 1 |
| 3 | 3 | 2009-12-01 15:51:43 | log line 3 |
+-------+----------+---------------------+------------+
What is the correct way to write my view?
I recommend using:
CREATE OR REPLACE VIEW vw_your_view AS
SELECT t.*
FROM SERVERLOG t
JOIN (SELECT sl.idserver,
MAX(sl.time) 'max_time'
FROM SERVERLOG sl
GROUP BY sl.idserver) x ON x.idserver = t.idserver
AND x.max_time = t.time
Never define an ORDER BY in a VIEW, because there's no guarantee that the order you specify is needed for every time you use the view.