Does hive have any api to check a sql but not execute it? - hive

I am looking for an api that check the submitted sql but no need to excute it. When my users submit a sql I need to tell them if sql is correct in real time, then I need to save the sql and excute the sql in future. So if hive api have this features ?

EXPLAIN <query to be checked> would meet this requirement of yours.
Hive provides an EXPLAIN command that shows the execution plan for a query. When a query is executed with EXPLAIN at the start of it, the query is initially checked for any syntax errors and then the execution plan is shown as the result. This way, users will be able to check if the query they have written is correct and also if it is efficient for execution from the execution plan. (No actual execution of the query happens here)
To read more about EXPLAIN, you can refer HERE
Sample output of a query with EXPLAIN:
EXPLAIN SELECT * FROM test_table;
STAGE DEPENDENCIES:
2 Stage-0 is a root stage
3
4 STAGE PLANS:
5 Stage: Stage-0
6 Fetch Operator
7 limit: -1
8 Processor Tree:
9 TableScan
10 alias: test_table
11 Statistics: Num rows: 1 Data size: 15812 Basic stats: COMPLETE Column stats: NONE
12 Select Operator
13 expressions: id (type: int), name (type: string), email_preferences (type: struct<email_format:string,frequency:string,categories:struct<promos:boolean,surveys:boolean>>), addresses (type: map<string,struct<street_1:string,street_2:string,city:string,state:string,zip_code:string>>), orders (type: array<struct<order_id:string,order_date:string,items:array<struct<product_id:int,sku:string,name:string,price:double,qty:int>>>>)
14 outputColumnNames: _col0, _col1, _col2, _col3, _col4
15 Statistics: Num rows: 1 Data size: 15812 Basic stats: COMPLETE Column stats: NONE
16 ListSink
Hope it helps!

Related

Does spark-sql query plan indicate which table partitions are used?

By looking at spark-sql plans, is there a way I can tell if a particular table (hive/iceberg) partition is being used or not?
For example, we have a table that has 3 partitions, let's say A=A_VAL, B=B_VAL, C=C_VAL. By looking at the plan is there a way I can tell if
the partitions are used fully (all 3 partitions used)
the partitions are used only partially (may be only 1 or 2 of the partitions are used, for example partition A is used but now B or C)
If spark-sql plans do not provide this information, is there any way I can get this information?
You can use below code to print the (logical and physical) plans.
import pyspark.sql
#create a df using your sql
df = sqlContext.sql("SELECT field1 AS f1, field2 as f2 from table1")
#use explain to see explain output. Without argument, you will get only physical plan
df.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...
EDIT : I ran explain for mytable and posted excerpt below. Which shows hive is choosing only few partitions(folders) and not going through all partitions. You should be able to see similar output.
here table is partitioned on part_col.
query used to generate this explain extended select * from mytab where part_col in (10,50).
sorry, I do not have spark installed so cant test it.
29
Path -> Alias:
30
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0 [tmp]
31
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0 [tmp]
32
Path -> Partition:
33
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0
34
Partition
35
base file name: part_col=10.0
36
input format: org.apache.hadoop.mapred.TextInputFormat
37
...
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0
85
Partition
86
base file name: part_col=50.0
87
input format: org.apache.hadoop.mapred.TextInputFormat
88
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
partition values:
90
college_marks 50.0

Create a view with Filter parameter

I'm creating a view in hive which unions two tables and has a lot of data. Is there a way to pass a filter paramter to view in hive so that it is applied to the table as well.
I have
CREATE VIEW abc
AS
SELECT * FROM
(SELECT * FROM table_a
UNION
SELECT * table_b) temp;
If I run something like SELECT * FROM abc WHERE day='2018-10-22'
It should return the union on the selected date only like
SELECT * FROM table _a WHERE day='2018-10-22' UNION
SELECT * FROM table _b WHERE day='2018-10-22'
How do I create a view to do this.
There is no need to add filter explicitly for optimization purposes. Query optimizer can push down the predicate. Take a look to this
CREATE TABLE `t5`(`a` string);
CREATE TABLE `t6`(`a` string);
CREATE VIEW v1
AS
SELECT * FROM
(
SELECT * FROM t5
UNION ALL
SELECT * from t6
) temp;
This is the explain of the query select * from v1 where a = "b", as you can see there is 2 independent table scan and for each the predicate is applied. It would be really disappointing if Hive at this point pull all the data and filter at the end :)
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t5
filterExpr: (a = 'b') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (a = 'b') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: 'b' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
alias: t6
filterExpr: (a = 'b') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: (a = 'b') (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: 'b' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

Hive Map Join. Hive picking the bigger table to store in cache

I have the below properties set.
set hive.auto.convert.join=true;
set hive.optimize.ppd=true;
Table A has 25 Million records. Table B has 44 million records. But there exists conditions on where clause which has filters on Table B. So, after applying the filters the number of records come down to 2 million.
Instead of processing the map join for table B, HIVE chooses table A. 25 million records are cached into all the data nodes.
Below is the query used
select col1,col2,col3,col4
from table_A a
join
table_B c
on
a.account_number=c.account_number and c.ins_date between '$date_6' and '$date_cur'.
What should be done to make sure HIVE caches table B?
Plan after including the Stream table hint on the larger table-
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
b
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
b
TableScan
alias: b
Statistics: Num rows: 23894045 Data size: 7048743275 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
condition expressions:
0 {cm_mac_fin} {wan_mac} {restart} {reboot} {day_id}
1 {division} {region}
keys:
0 cm_mac_fin (type: string)
1 mac (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 2599797 Data size: 678547017 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {cm_mac_fin} {wan_mac} {restart} {reboot} {day_id}
1 {mac} {division} {region}
keys:
0 cm_mac_fin (type: string)
1 mac (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: int), _col4 (type: date), _col8 (type: string), _col9 (type: string), _col10 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Plan after including the map join hint on the smaller table-
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
b
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
b
TableScan
alias: b
Statistics: Num rows: 23894045 Data size: 7048743275 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
condition expressions:
0 {cm_mac_fin} {wan_mac} {restart} {reboot} {day_id}
1 {division} {region}
keys:
0 cm_mac_fin (type: string)
1 mac (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 2599797 Data size: 678547017 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {cm_mac_fin} {wan_mac} {restart} {reboot} {day_id}
1 {mac} {division} {region}
keys:
0 cm_mac_fin (type: string)
1 mac (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: int), _col4 (type: date), _col8 (type: string), _col9 (type: string), _col10 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 26283450 Data size: 7753617770 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Hive internally uses multiple factors to determine cache table and stream table for joins:
It convert queries to map-joins based on the configuration flags(hive.auto.convert.join.noconditionaltask, hive.auto.convert.join.noconditionaltask.size, hive.mapjoin.smalltable.filesize).
The size configuration enables the user to control what size table can fit in memory.
Suppose n tables are partiipating in join then n-1 tables of the join have to fit in memory for the map-join optimization to take effect.
When n=2 and the parameter hive.auto.convert.join is set to true then hive goes for mapjoins and cache the table which is smaller than hive.mapjoin.smalltable.filesize this parameter.
In your case, you may explicitly specify cache table instead of hive to determine it:
select /*+MAPJOIN(c)*/ col1,col2,col3,col4
from table_A a
join
table_B c
on
a.account_number=c.account_number and c.ins_date between '$date_6' and '$date_cur'.
Move your where clause to a CTE before join.
WITH b as (
SELECT col1,col2,col3,col4
FROM table_B
WHERE ins_date between '$date_6' and '$date_cur'
)
SELECT col1,col2,col3,col4
FROM table_A a join b
on a.account_number = b.account_number;
This way b which is the right side of the join has only 2 million records and therefore gets loaded in RAM.

Execution time select * vs select count(*)

I'm trying to measure execution time of a query, but I have a feeling that my results are wrong.
Before every query I execute: sync; echo 3 > /proc/sys/vm/drop_caches
My server log file results are:
2014-02-08 14:28:30 EET LOG: duration: 32466.103 ms statement: select * from partsupp
2014-02-08 14:32:48 EET LOG: duration: 9785.503 ms statement: select count(*) from partsupp
Shouldn't select count(*) take more time to execute since it makes more operations?
To output all the results from select * I need 4 minutes (not 32 seconds, as indicated by server log). I understand that the client has to output a lot of data and it will be slow, but what about the server's log? Does it count output operations too?
I also used explain analyze and the results are (as expected):
select *: Total runtime: 13254.733 ms
select count(*): Total runtime: 13463.294 ms
I have run it many times and the results are similar.
What exactly does the log measure?
Why there is so big difference in select * query between explain analyze and server's log, although it doesn't count I/O operations?
What is the difference between log measurement and explain analyze?
I have a dedicated server with Ubuntu 12.04 and PostgreSQL 9.1
Thank you!
Any aggregate function has some small overhead - but on second hand SELECT * send to client lot of data in dependency on column numbers and column size.
log measurements is a total query time, it can be similar to EXPLAIN ANALYZE - but much more times is significantly faster, because EXPLAIN ANALYZE collects a execution time (and execution statistics) for all subnodes of execution plan. And it is significant overhead usually. But there are no overhead from transport data from server to client.
The first query asks for all rows in a table. Therefore, the entire table must be read.
The second query only asks for how many rows there are. The database can answer this by reading the entire table, but can also answer this by reading any index it has for that table. Since the index is smaller than the table, doing that would be faster. In practice, nearly all tables have indexes (because a primary key constraint creates an index, too).
select * = select all data all column included
select count(*) = count how many rows
for example this table
------------------------
name | id | address
----------------------
s | 12 | abc
---------------------
x | 14 | cc
---------------------
y | 15 | vv
---------------------
z | 16 | ll
---------------------
select * will display all the table
select count(*) will display the total of the rows = 4

how does a SQL query work?

How does a SQL query work?
How does it get compiled?
Is the from clause compiled first to see if the table exists?
How does it actually retrieve data from the database?
How and in what format are the tables stored in a database?
I am using phpmyadmin, is there any way I can peek into the files where data is stored?
I am using MySQL
sql execution order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT .
SQL Query mainly works in three phases .
1) Row filtering - Phase 1: Row filtering - phase 1 are done by FROM, WHERE , GROUP BY , HAVING clause.
2) Column filtering: Columns are filtered by SELECT clause.
3) Row filtering - Phase 2: Row filtering - phase 2 are done by DISTINCT , ORDER BY , LIMIT clause.
In here i will explain with an example . Suppose we have a students table as follows:
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
10
Rahim
80
A
11
Karim
81
B
12
Abdullah
92
D
Now we run the following sql query:
select section_,sum(marks) from students where id_<10 GROUP BY section_ having sum(marks)>100 order by section_ LIMIT 2;
Output of the query is:
section_
sum
A
247
B
131
But how we got this output ?
I have explained the query step by step . Please read bellow:
1. FROM , WHERE clause execution
Hence from clause works first therefore from students where id_<10 query will eliminate rows which has id_ greater than or equal to 10 . So the following rows remains after executing from students where id_<10 .
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
2. GROUP BY clause execution
now GROUP BY clause will come , that's why after executing GROUP BY section_ rows will make group like bellow:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
3
Maria
10
C
8
Jahid
25
C
3. HAVING clause execution
having sum(marks)>100 will eliminates groups . sum(marks) of D group is 185 , sum(marks) of A groupd is 247 , sum(marks) of B group is 131 , sum(marks) of C group is 35 . So we can see tha C groups's sum is not greater than 100 . So group C will be eliminated . So the table looks like this:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
4. SELECT clause execution
select section_,sum(marks) query will only decides which columns to prints . It is decided to print section_ and sum(marks) column .
section_
sum
D
185
A
245
B
131
5. ORDER BY clause execution
order by section_ query will sort the rows ascending order.
section_
sum
A
245
B
131
D
185
6. LIMIT clause execution
LIMIT 2; will only prints first 2 rows.
section_
sum
A
245
B
131
This is how we got our final output .
Well...
First you have a syntax check, followed by the generation of an expression tree - at this stage you can also test whether elements exist and "line up" (i.e. fields do exist WITHIN the table). This is the first step - any error here any you just tell the submitter to get real.
Then you have.... analysis. A SQL query is different from a program in that it does not say HOW to do something, just WHAT THE RESULT IS. Set based logic. So you get a query analyzer in (depending on product bad to good - oracle long time has crappy ones, DB2 the most sensitive ones even measuring disc speed) to decide how best to approach this result. This is a really complicated beast - it may try dozens or hundreds of approaches to find one he believes to be fastest (cost based, basically some statistics).
Then that gets executed.
The query analyzer, by the way, is where you see huge differences. Not sure about MySQL - SQL Server (Microsoft) shines in that it does not have the best one (but one of the good ones), but that it really has nice visual tools to SHOW the query plan, compare the estimates the the analyzer to the real needs (if they differ too much table statistics may be off so the analyzer THINKS a large table is small). They present that nicely visually.
DB2 had a great optimizer for some time, measuring - i already said - disc speed to put it into it's estimates. Oracle went "left to right" (no real analysis) for a long time, and took user provided query hints (crap approach). I think MySQL was VERY primitive too in the start - not sure where it is now.
Table format in database etc. - that is really something you should not care for. This is documented (clearly, especially for an open source database), but why should you care? I have done SQL work for nearly 15 years or so and never had that need. And that includes doing quite high end work in some areas. Unless you try building a database file repair tool.... it makes no sense to bother.
The order of SQL statement clause execution-
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
My answer is specific to Oracle database, which provides tutorials pertaining to your queries. Well, when SQL database engine processes any SQL query/statement, It first starts parsing and within parsing it performs three checks Syntax, Semantic and Shared Pool. To know how do these checks work? Follow the link below.
Once query parsing is done, it triggers the Execution plan. But hey Database Engine! you are smart enough. You do check if this SQL query has already been parsed (Soft Parse), if so then you directly jump on execution plan or else you deep dive and optimize the query (Hard Parse). While performing hard parse, you also use a software called Row Source Generation which provides Iterative Execution Plan received from optimizer. Enough! see the SQL query processing stages below.
Note - Before execution plan, it also performs Bind operations for variable's values and once the query is executed It performs Fetch to obtain the records and finally store into result set. So in short, the order is-
PASRE -> BIND -> EXECUTE -> FETCH
And for in depth details, this tutorial is waiting for you.
This may be helpful to someone.
If you're using SSMS for Sql Server and want to know where your data files are stored, you can use this query
SELECT
mdf.database_id,
mdf.name,
mdf.physical_name as data_file,
ldf.physical_name as log_file,
db_size = CAST((mdf.size * 8.0)/1024 AS DECIMAL(8,2)),
log_size = CAST((ldf.size * 8.0 / 1024) AS DECIMAL(8,2))
FROM (SELECT * FROM sys.master_files WHERE type_desc = 'ROWS' ) mdf
JOIN (SELECT * FROM sys.master_files WHERE type_desc = 'LOG' ) ldf
ON mdf.database_id = ldf.database_id
Here's a copy of the output