Does spark-sql query plan indicate which table partitions are used? - apache-spark-sql

By looking at spark-sql plans, is there a way I can tell if a particular table (hive/iceberg) partition is being used or not?
For example, we have a table that has 3 partitions, let's say A=A_VAL, B=B_VAL, C=C_VAL. By looking at the plan is there a way I can tell if
the partitions are used fully (all 3 partitions used)
the partitions are used only partially (may be only 1 or 2 of the partitions are used, for example partition A is used but now B or C)
If spark-sql plans do not provide this information, is there any way I can get this information?

You can use below code to print the (logical and physical) plans.
import pyspark.sql
#create a df using your sql
df = sqlContext.sql("SELECT field1 AS f1, field2 as f2 from table1")
#use explain to see explain output. Without argument, you will get only physical plan
df.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...
EDIT : I ran explain for mytable and posted excerpt below. Which shows hive is choosing only few partitions(folders) and not going through all partitions. You should be able to see similar output.
here table is partitioned on part_col.
query used to generate this explain extended select * from mytab where part_col in (10,50).
sorry, I do not have spark installed so cant test it.
29
Path -> Alias:
30
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0 [tmp]
31
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0 [tmp]
32
Path -> Partition:
33
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=10.0
34
Partition
35
base file name: part_col=10.0
36
input format: org.apache.hadoop.mapred.TextInputFormat
37
...
hdfs://namenode:8020/user/hive/warehouse/tmp/part_col=50.0
85
Partition
86
base file name: part_col=50.0
87
input format: org.apache.hadoop.mapred.TextInputFormat
88
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
partition values:
90
college_marks 50.0

Related

Does hive have any api to check a sql but not execute it?

I am looking for an api that check the submitted sql but no need to excute it. When my users submit a sql I need to tell them if sql is correct in real time, then I need to save the sql and excute the sql in future. So if hive api have this features ?
EXPLAIN <query to be checked> would meet this requirement of yours.
Hive provides an EXPLAIN command that shows the execution plan for a query. When a query is executed with EXPLAIN at the start of it, the query is initially checked for any syntax errors and then the execution plan is shown as the result. This way, users will be able to check if the query they have written is correct and also if it is efficient for execution from the execution plan. (No actual execution of the query happens here)
To read more about EXPLAIN, you can refer HERE
Sample output of a query with EXPLAIN:
EXPLAIN SELECT * FROM test_table;
STAGE DEPENDENCIES:
2 Stage-0 is a root stage
3
4 STAGE PLANS:
5 Stage: Stage-0
6 Fetch Operator
7 limit: -1
8 Processor Tree:
9 TableScan
10 alias: test_table
11 Statistics: Num rows: 1 Data size: 15812 Basic stats: COMPLETE Column stats: NONE
12 Select Operator
13 expressions: id (type: int), name (type: string), email_preferences (type: struct<email_format:string,frequency:string,categories:struct<promos:boolean,surveys:boolean>>), addresses (type: map<string,struct<street_1:string,street_2:string,city:string,state:string,zip_code:string>>), orders (type: array<struct<order_id:string,order_date:string,items:array<struct<product_id:int,sku:string,name:string,price:double,qty:int>>>>)
14 outputColumnNames: _col0, _col1, _col2, _col3, _col4
15 Statistics: Num rows: 1 Data size: 15812 Basic stats: COMPLETE Column stats: NONE
16 ListSink
Hope it helps!

Execution time select * vs select count(*)

I'm trying to measure execution time of a query, but I have a feeling that my results are wrong.
Before every query I execute: sync; echo 3 > /proc/sys/vm/drop_caches
My server log file results are:
2014-02-08 14:28:30 EET LOG: duration: 32466.103 ms statement: select * from partsupp
2014-02-08 14:32:48 EET LOG: duration: 9785.503 ms statement: select count(*) from partsupp
Shouldn't select count(*) take more time to execute since it makes more operations?
To output all the results from select * I need 4 minutes (not 32 seconds, as indicated by server log). I understand that the client has to output a lot of data and it will be slow, but what about the server's log? Does it count output operations too?
I also used explain analyze and the results are (as expected):
select *: Total runtime: 13254.733 ms
select count(*): Total runtime: 13463.294 ms
I have run it many times and the results are similar.
What exactly does the log measure?
Why there is so big difference in select * query between explain analyze and server's log, although it doesn't count I/O operations?
What is the difference between log measurement and explain analyze?
I have a dedicated server with Ubuntu 12.04 and PostgreSQL 9.1
Thank you!
Any aggregate function has some small overhead - but on second hand SELECT * send to client lot of data in dependency on column numbers and column size.
log measurements is a total query time, it can be similar to EXPLAIN ANALYZE - but much more times is significantly faster, because EXPLAIN ANALYZE collects a execution time (and execution statistics) for all subnodes of execution plan. And it is significant overhead usually. But there are no overhead from transport data from server to client.
The first query asks for all rows in a table. Therefore, the entire table must be read.
The second query only asks for how many rows there are. The database can answer this by reading the entire table, but can also answer this by reading any index it has for that table. Since the index is smaller than the table, doing that would be faster. In practice, nearly all tables have indexes (because a primary key constraint creates an index, too).
select * = select all data all column included
select count(*) = count how many rows
for example this table
------------------------
name | id | address
----------------------
s | 12 | abc
---------------------
x | 14 | cc
---------------------
y | 15 | vv
---------------------
z | 16 | ll
---------------------
select * will display all the table
select count(*) will display the total of the rows = 4

Optimising Postgresql Queries

I have two tables and i have to query my postgresql database. The table 1 has about 140 million records and table 2 has around 50 million records of the following.
the table 1 has the following structure:
tr_id bigint NOT NULL, # this is the primary key
query_id numeric(20,0), # indexed column
descrip_id numeric(20,0) # indexed column
and table 2 has the following structure
query_pk bigint # this is the primary key
query_id numeric(20,0) # indexed column
query_token numeric(20,0)
The sample db of table1 would be
1 25 96
2 28 97
3 27 98
4 26 99
The sample db of table2 would be
1 25 9554
2 25 9456
3 25 9785
4 25 9514
5 26 7412
6 26 7433
7 27 545
8 27 5789
9 27 1566
10 28 122
11 28 1456
I am preferring queries in which i would be able to query in blocks of tr_id. In range of 10,000 as this is my requirement.
I would like to get output in the following manner
25 {9554,9456,9785,9514}
26 {7412,7433}
27 {545,5789,1566}
28 {122,1456}
I tried in the following manner
select query_id,
array_agg(query_token)
from sch.table2
where query_id in (select query_id
from sch.table1
where tr_id between 90001 and 100000)
group by query_id
I am performing the following query which takes about 121346 ms and when some 4 such queries are fired it still takes longer time. Can you please help me to optimise the same.
I have a machine which runs on windows 7 with i7 2nd gen proc with 8GB of RAM.
The following is my postgresql configuration
shared_buffers = 1GB
effective_cache_size = 5000MB
work_mem = 2000MB
What should I do to optimise it.
Thanks
EDIT : it would be great if the results ordered according to the following format
25 {9554,9456,9785,9514}
28 {122,1456}
27 {545,5789,1566}
26 {7412,7433}
ie according to the order of the queryid present in table1 ordered by tr_id. If this is computationally expensive may be in the client code i would try to optimise it. But I am not sure how efficient it would be.
Thanks
Query
I expect a JOIN to be much faster that the IN condition you have presently:
SELECT t2.query_id
,array_agg(t2.query_token) AS tokens
FROM t1
JOIN t2 USING (query_id)
WHERE t1.tr_id BETWEEN 1 AND 10000
GROUP BY t1.tr_id, t2.query_id
ORDER BY t1.tr_id;
This also sorts the results as requested. query_token remains unsorted per query_id.
Indexes
Obviously you need indexes on t1.tr_id and t2.query_id.
You obviously have that one already:
CREATE INDEX t2_query_id_idx ON t2 (query_id);
A multicolumn index on t1 may improve performance (you'll have to test):
CREATE INDEX t1_tr_id_query_id_idx ON t1 (tr_id, query_id);
Server configuration
If this is a dedicated database server, you can raise the setting for effective_cache_size some more.
#Frank already gave advise on work_mem. I quote the manual:
Note that for a complex query, several sort or hash operations might
be running in parallel; each operation will be allowed to use as much
memory as this value specifies before it starts to write data into
temporary files. Also, several running sessions could be doing such
operations concurrently. Therefore, the total memory used could be
many times the value of work_mem;
It should be just big enough to be able to sort your queries in RAM. 10 MB is more than plenty to hold 10000 of your rows at a time. Set it higher, if you have queries that need more at a time.
With 8 GB on a dedicated database server, I would be tempted to set shared_buffers to at least 2 GB.
shared_buffers = 2GB
effective_cache_size = 7000MB
work_mem = 10MB
More advice on performance tuning in the Postgres Wiki.

How can I speed up queries that are looking for the root node of a transitive closure?

I have a historical transitive closure table that represents a tree.
create table TRANSITIVE_CLOSURE
(
CHILD_NODE_ID number not null enable,
ANCESTOR_NODE_ID number not null enable,
DISTANCE number not null enable,
FROM_DATE date not null enable,
TO_DATE date not null enable,
constraint TRANSITIVE_CLOSURE_PK unique (CHILD_NODE_ID, ANCESTOR_NODE_ID, DISTANCE, FROM_DATE, TO_DATE)
);
Here's some sample data:
CHILD_NODE_ID | ANCESTOR_NODE_ID | DISTANCE
--------------------------------------------
1 | 1 | 0
2 | 1 | 1
2 | 2 | 0
3 | 1 | 2
3 | 2 | 1
3 | 3 | 0
Unfortunately, my current query for finding the root node causes a full table scan:
select *
from transitive_closure tc
where
distance = 0
and not exists (
select null
from transitive_closure tci
where tc.child_node_id = tci.child_node_id
and tci.distance <> 0
);
On the surface, it doesn't look too expensive, but as I approach 1 million rows, this particular query is starting to get nasty... especially when it's part of a view that grabs the adjacency tree for legacy support.
Is there a better way to find the root node of a transitive closure? I would like to rewrite all of our old legacy code, but I can't... so I need to build the adjacency list somehow. Getting everything except the root node is easy, so is there a better way? Am I thinking about this problem the wrong way?
Query plan on a table with 800k rows.
OPERATION OBJECT_NAME OPTIONS COST
SELECT STATEMENT 2301
HASH JOIN RIGHT ANTI 2301
Access Predicates
TC.CHILD_NODE_ID=TCI.CHILD_NODE_ID
TABLE ACCESS TRANSITIVE_CLOSURE FULL 961
Filter Predicates
TCI.DISTANCE = 1
TABLE ACCESS TRANSITIVE_CLOSURE FULL 962
Filter Predicates
DISTANCE=0
How long does the query take to execute, and how long do you want it to take? (You usually do not want to use the cost for tuning. Very few people know what the explain plan cost really means.)
On my slow desktop the query only took 1.5 seconds for 800K rows. And then 0.5 seconds after the data was in memory. Are you getting something significantly worse,
or will this query be run very frequently?
I don't know what your data looks like, but I'd guess that a full table scan will always be best for this query. Assuming that your hierarchical data
is relatively shallow, i.e. there are many distances of 0 and 1 but very few distances of 100, the most important column will not be very distinct. This means
that any of the index entries for distance will point to a large number of blocks. It will be much cheaper to read the whole table at once using multi-block reads
than to read a large amount of it one block at a time.
Also, what do you mean by historical? Can you store the results of this query in a materialized view?
Another possible idea is to use analytic functions. This replaces the second table scan with a sort. This approach is usually faster, but for me this
query actually takes longer, 5.5 seconds instead of 1.5. But maybe it will do better in your environment.
select * from
(
select
max(case when distance <> 0 then 1 else 0 end)
over (partition by child_node_id) has_non_zero_distance
,transitive_closure.*
from transitive_closure
)
where distance = 0
and has_non_zero_distance = 0;
Can you try adding an index on distance and child_node_id, or change the order of these column in the existing unique index? I think it should then be possible for the outer query to access the table by the index by distance while the inner query needs only access to the index.
Add ONE root node from which all your current root nodes are descended. Then you would simply query the children of your one root. Problem solved.

how does a SQL query work?

How does a SQL query work?
How does it get compiled?
Is the from clause compiled first to see if the table exists?
How does it actually retrieve data from the database?
How and in what format are the tables stored in a database?
I am using phpmyadmin, is there any way I can peek into the files where data is stored?
I am using MySQL
sql execution order:
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT .
SQL Query mainly works in three phases .
1) Row filtering - Phase 1: Row filtering - phase 1 are done by FROM, WHERE , GROUP BY , HAVING clause.
2) Column filtering: Columns are filtered by SELECT clause.
3) Row filtering - Phase 2: Row filtering - phase 2 are done by DISTINCT , ORDER BY , LIMIT clause.
In here i will explain with an example . Suppose we have a students table as follows:
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
10
Rahim
80
A
11
Karim
81
B
12
Abdullah
92
D
Now we run the following sql query:
select section_,sum(marks) from students where id_<10 GROUP BY section_ having sum(marks)>100 order by section_ LIMIT 2;
Output of the query is:
section_
sum
A
247
B
131
But how we got this output ?
I have explained the query step by step . Please read bellow:
1. FROM , WHERE clause execution
Hence from clause works first therefore from students where id_<10 query will eliminate rows which has id_ greater than or equal to 10 . So the following rows remains after executing from students where id_<10 .
id_
name_
marks
section_
1
Julia
88
A
2
Samantha
68
B
3
Maria
10
C
4
Scarlet
78
A
5
Ashley
63
B
6
Abir
95
D
7
Jane
81
A
8
Jahid
25
C
9
Sohel
90
D
2. GROUP BY clause execution
now GROUP BY clause will come , that's why after executing GROUP BY section_ rows will make group like bellow:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
3
Maria
10
C
8
Jahid
25
C
3. HAVING clause execution
having sum(marks)>100 will eliminates groups . sum(marks) of D group is 185 , sum(marks) of A groupd is 247 , sum(marks) of B group is 131 , sum(marks) of C group is 35 . So we can see tha C groups's sum is not greater than 100 . So group C will be eliminated . So the table looks like this:
id_
name_
marks
section_
9
Sohel
90
D
6
Abir
95
D
1
Julia
88
A
4
Scarlet
78
A
7
Jane
81
A
2
Samantha
68
B
5
Ashley
63
B
4. SELECT clause execution
select section_,sum(marks) query will only decides which columns to prints . It is decided to print section_ and sum(marks) column .
section_
sum
D
185
A
245
B
131
5. ORDER BY clause execution
order by section_ query will sort the rows ascending order.
section_
sum
A
245
B
131
D
185
6. LIMIT clause execution
LIMIT 2; will only prints first 2 rows.
section_
sum
A
245
B
131
This is how we got our final output .
Well...
First you have a syntax check, followed by the generation of an expression tree - at this stage you can also test whether elements exist and "line up" (i.e. fields do exist WITHIN the table). This is the first step - any error here any you just tell the submitter to get real.
Then you have.... analysis. A SQL query is different from a program in that it does not say HOW to do something, just WHAT THE RESULT IS. Set based logic. So you get a query analyzer in (depending on product bad to good - oracle long time has crappy ones, DB2 the most sensitive ones even measuring disc speed) to decide how best to approach this result. This is a really complicated beast - it may try dozens or hundreds of approaches to find one he believes to be fastest (cost based, basically some statistics).
Then that gets executed.
The query analyzer, by the way, is where you see huge differences. Not sure about MySQL - SQL Server (Microsoft) shines in that it does not have the best one (but one of the good ones), but that it really has nice visual tools to SHOW the query plan, compare the estimates the the analyzer to the real needs (if they differ too much table statistics may be off so the analyzer THINKS a large table is small). They present that nicely visually.
DB2 had a great optimizer for some time, measuring - i already said - disc speed to put it into it's estimates. Oracle went "left to right" (no real analysis) for a long time, and took user provided query hints (crap approach). I think MySQL was VERY primitive too in the start - not sure where it is now.
Table format in database etc. - that is really something you should not care for. This is documented (clearly, especially for an open source database), but why should you care? I have done SQL work for nearly 15 years or so and never had that need. And that includes doing quite high end work in some areas. Unless you try building a database file repair tool.... it makes no sense to bother.
The order of SQL statement clause execution-
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
My answer is specific to Oracle database, which provides tutorials pertaining to your queries. Well, when SQL database engine processes any SQL query/statement, It first starts parsing and within parsing it performs three checks Syntax, Semantic and Shared Pool. To know how do these checks work? Follow the link below.
Once query parsing is done, it triggers the Execution plan. But hey Database Engine! you are smart enough. You do check if this SQL query has already been parsed (Soft Parse), if so then you directly jump on execution plan or else you deep dive and optimize the query (Hard Parse). While performing hard parse, you also use a software called Row Source Generation which provides Iterative Execution Plan received from optimizer. Enough! see the SQL query processing stages below.
Note - Before execution plan, it also performs Bind operations for variable's values and once the query is executed It performs Fetch to obtain the records and finally store into result set. So in short, the order is-
PASRE -> BIND -> EXECUTE -> FETCH
And for in depth details, this tutorial is waiting for you.
This may be helpful to someone.
If you're using SSMS for Sql Server and want to know where your data files are stored, you can use this query
SELECT
mdf.database_id,
mdf.name,
mdf.physical_name as data_file,
ldf.physical_name as log_file,
db_size = CAST((mdf.size * 8.0)/1024 AS DECIMAL(8,2)),
log_size = CAST((ldf.size * 8.0 / 1024) AS DECIMAL(8,2))
FROM (SELECT * FROM sys.master_files WHERE type_desc = 'ROWS' ) mdf
JOIN (SELECT * FROM sys.master_files WHERE type_desc = 'LOG' ) ldf
ON mdf.database_id = ldf.database_id
Here's a copy of the output