Optimize a Delta table used to make queries that use GROUP BY - apache-spark-sql

Working in a Delta table.
When it comes to optimising my delta table, I have learned a few things.
Partition by: Is beneficial when the column in which the partition is made is used in a "where" condition.
i.e In a previous step. Let's say this is table A
df.write.partitionBy("column_1").format("delta").mode("overwrite").save("path")
I will use the Delta table like this in the future.
%sql
select
colum1
column2
...
from TableA
where column1 = "XX"
Bloom Filters: Are beneficial for joins between tables. Imagine column_1 is our id column ,so you will be use it in that way
%sql
CREATE BLOOMFILTER INDEX ON TABLE TableA FOR COLUMNS(column_1)
OPTIMIZE TableA ZORDER BY (column_1)
select
colum1
column2
...
from TableA a
INNER JOIN TableB b
ON a.column_1 = b.column_1
But how can I optimise a table where Group by queries are to be used and no optimisation techniques such as Partition By or Bloom Filter are applied ? i.e. the table will be used as follows
%sql
select
colum1
column2
...
from TableA
group by (column1, column2)
Thanks in advance

OPTIMIZE ... ZORDER BY (column1, column2) may help colocating related values in smaller number of files, and make querying of data more efficient. (see doc)

In addition to ZORDER as suggested.
Few more things:
You can also fine tune the file sizes.
Apply compaction (OPTIMIZE) periodically, especially if you're streaming or dealing with small batch sizes. I guess this is implicit if you're using ZORDER as other post suggested.
Same page also has many other suggestions including specific ones if you're trying to improve interactive query performance.
For interactive queries, also note that the "SQL Warehouse" has it's Spark configuration set up for optimal interactive performance, which will be better than a general purpose cluster.

Related

using row_number over (partition by ...) in a view runs endlessly vs run as a query

I have been tasked to enhance a SQL view for performance improvement that has psuedo code as follows. It has row_number over (partition by... order by...), which seems to be causing this view to run indefinitely until i kill the query.
I.e. When I run select * from view_name where Date = '2015-01-31', it runs forever. But it runs fine if I run the whole view as a query (e.g. remove alter view statement on top and pass the where clause at the end of the code).
I am using SQL 2005. It maybe that SQL 2005 engine generates execution plans differently for views vs normal queries because as I mentioned the entire code in the view, when executed as a query, runs fine. How can I make the view itself run faster so it can return the results? One of the tables that my view queries (table1 in this psuedo code) is very large and is partitioned by date where each month's data is one partition.
PSEUDO-CODE:
CREATE VIEW Sample
AS
WITH Dataset1
AS (
SELECT table1.DATE
,column1
,column2
,column3
,column4
FROM table1
INNER JOIN table2 ON table1.DATE = table2.DATE
)
,Dataset2
AS (
SELECT Dataset1.DATE
,column1
,column2
,column3
,column4
FROM table3
INNER JOIN Dataset1 ON table3.column1 = Dataset1.column1
)
SELECT ROW_NUMBER() OVER (
PARTITION BY column1 ORDER BY column1 ASC
) AS RowNumber
,*
FROM Dataset2
GO
My first steps towards improving this query would be:
Reducing the code complexity: why are you using two CTEs? It appears from the example code that this could be rewritten as a single query joining table 1 to 2, then 2 to 3, with the ROW_NUMBER() directly in the SELECT clause. This may not affect the performance directly, but it is much easier to analyse a simple query than a complex one.
Reconsidering the intended behaviour of the ROW_NUMBER(): you are partitioning and ordering by the same column. This means that for each distinct value in column1, SQL Server will attempt to order the rows based on the values in column1; the values are all the same within that partition so the ordering is essentially "random" and any processing time devoted to this is wasted. (Depending hugely on other factors e.g. any clustered indexes on these tables.)
Retrieving the execution plan for this query and examining it for further ideas. The execution plan may include tips for indexes that can be applied - which you should consider, but don't take SQL Server's word as gospel.
I might have further suggestions if I could see an execution plan, have a bit more insight into the structures of these tables (including indexes and cardinality of the relationships), and know how big "very large" means to you :)

Optimize Oracle SELECT on large dataset

I am new in Oracle (working on 11gR2). I have a table TABLE with something like ~10 millions records in it, and this pretty simple query :
SELECT t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1
AND t.col11 = val2
AND t.col12 = val3
AND t.col13 = val4
The query is currently taking about 30s/1min.
My question is: how can I improve performance ? After a lot of research, I am aware of the most classical ways to improve performance but I have some problems :
Partitioning: can't really, the table is used in an other project and it would be too impactful. Plus it only delay the problem given the number of rows inserted in the table every day.
Add an index: The thing is, the columns used in the WHERE clause are not the one returned by the query (except for one). Thus, I have not been able to find an appropriate index yet. As far as I know, setting an index on 12~13 columns does not make a lot of sense (or does it?).
Materialized views: I must say I never used them, but I understood the maintenance cost is pretty high and my table is updated quite often.
I think the best way to do this would be to add an appropriate index, but I can't find the right columns on which it should be created.
An index makes sense provided that your query results in a small percentage of all rows. You would create one index on all four columns used in the WHERE clause.
If too many records match, then a full table scan will be done. You may be able to speed this up by having this done in parallel threads using the PARALLEL hint:
SELECT /*+parallel(t,4)*/
t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1 AND t.col11 = val2 AND t.col12 = val3 AND t.col13 = val4;
Table with 10 millions records is quite little table. You just need to create an appropriate index. Which column select for index - depends on content of them. For example, if you have column that contains only "1" and "0", or "yes" and "no", you shouldn't index it. The more different values contains column - the more effect gives index. Also you can make index on two or three (and more) columns, or function-based index (in this case index contains results of your SQL function, not columns values). Also you can create more than one index on table.
And in any case, if your query selects more then 20 - 30% of all table records, index will not help.
Also you said that table is used by many people. In this case, you need to cooperate with them to avoid duplicating indexes.
Indexes on each of the columns referenced in the WHERE clause will help performance of a query against a table with a large number of rows, where you are seeking a small subset, even if the columns in the WHERE clause are not returned in the SELECT column list.
The downside of course is that indexes impede insert/update performance. So when loading the table with large numbers of records, you might need to disable/drop the indexes prior to loading and then re-create/enable them again afterwards.

SQL WHERE ID IN (id1, id2, ..., idn)

I need to write a query to retrieve a big list of ids.
We do support many backends (MySQL, Firebird, SQLServer, Oracle, PostgreSQL ...) so I need to write a standard SQL.
The size of the id set could be big, the query would be generated programmatically. So, what is the best approach?
1) Writing a query using IN
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
My question here is. What happens if n is very big? Also, what about performance?
2) Writing a query using OR
SELECT * FROM TABLE WHERE ID = id1 OR ID = id2 OR ... OR ID = idn
I think that this approach does not have n limit, but what about performance if n is very big?
3) Writing a programmatic solution:
foreach (var id in myIdList)
{
var item = GetItemByQuery("SELECT * FROM TABLE WHERE ID = " + id);
myObjectList.Add(item);
}
We experienced some problems with this approach when the database server is queried over the network. Normally is better to do one query that retrieve all results versus making a lot of small queries. Maybe I'm wrong.
What would be a correct solution for this problem?
Option 1 is the only good solution.
Why?
Option 2 does the same but you repeat the column name lots of times; additionally the SQL engine doesn't immediately know that you want to check if the value is one of the values in a fixed list. However, a good SQL engine could optimize it to have equal performance like with IN. There's still the readability issue though...
Option 3 is simply horrible performance-wise. It sends a query every loop and hammers the database with small queries. It also prevents it from using any optimizations for "value is one of those in a given list"
An alternative approach might be to use another table to contain id values. This other table can then be inner joined on your TABLE to constrain returned rows. This will have the major advantage that you won't need dynamic SQL (problematic at the best of times), and you won't have an infinitely long IN clause.
You would truncate this other table, insert your large number of rows, then perhaps create an index to aid the join performance. It would also let you detach the accumulation of these rows from the retrieval of data, perhaps giving you more options to tune performance.
Update: Although you could use a temporary table, I did not mean to imply that you must or even should. A permanent table used for temporary data is a common solution with merits beyond that described here.
What Ed Guiness suggested is really a performance booster , I had a query like this
select * from table where id in (id1,id2.........long list)
what i did :
DECLARE #temp table(
ID int
)
insert into #temp
select * from dbo.fnSplitter('#idlist#')
Then inner joined the temp with main table :
select * from table inner join temp on temp.id = table.id
And performance improved drastically.
First option is definitely the best option.
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
However considering that the list of ids is very huge, say millions, you should consider chunk sizes like below:
Divide you list of Ids into chunks of fixed number, say 100
Chunk size should be decided based upon the memory size of your server
Suppose you have 10000 Ids, you will have 10000/100 = 100 chunks
Process one chunk at a time resulting in 100 database calls for select
Why should you divide into chunks?
You will never get memory overflow exception which is very common in scenarios like yours.
You will have optimized number of database calls resulting in better performance.
It has always worked like charm for me. Hope it would work for my fellow developers as well :)
Doing the SELECT * FROM MyTable where id in () command on an Azure SQL table with 500 million records resulted in a wait time of > 7min!
Doing this instead returned results immediately:
select b.id, a.* from MyTable a
join (values (250000), (2500001), (2600000)) as b(id)
ON a.id = b.id
Use a join.
In most database systems, IN (val1, val2, …) and a series of OR are optimized to the same plan.
The third way would be importing the list of values into a temporary table and join it which is more efficient in most systems, if there are lots of values.
You may want to read this articles:
Passing parameters in MySQL: IN list vs. temporary table
I think you mean SqlServer but on Oracle you have a hard limit how many IN elements you can specify: 1000.
Sample 3 would be the worst performer out of them all because you are hitting up the database countless times for no apparent reason.
Loading the data into a temp table and then joining on that would be by far the fastest. After that the IN should work slightly faster than the group of ORs.
For 1st option
Add IDs into temp table and add inner join with main table.
CREATE TABLE #temp (column int)
INSERT INTO #temp (column)
SELECT t.column1 FROM (VALUES (1),(2),(3),...(10000)) AS t(column1)
Try this
SELECT Position_ID , Position_Name
FROM
position
WHERE Position_ID IN (6 ,7 ,8)
ORDER BY Position_Name

Indexing Null Values in PostgreSQL

I have a query of the form:
select m.id from mytable m
left outer join othertable o on o.m_id = m.id
and o.col1 is not null and o.col2 is not null and o.col3 is not null
where o.id is null
The query returns a few hundred records, although the tables have millions of rows, and it takes forever to run (around an hour).
When I check my index statistics using:
select * from pg_stat_all_indexes
where schemaname <> 'pg_catalog' and (indexrelname like 'othertable_%' or indexrelname like 'mytable_%')
I see that only the index for othertable.m_id is being used, and that the indexes for col1..3 are not being used at all. Why is this?
I've read in a few places that PG has traditionally not been able to index NULL values. However, I've read this has supposedly changed since PG 8.3? I'm currently using PostgreSQL 8.4 on Ubuntu 10.04. Do I need to make a "partial" or "functional" index specifically to speed up IS NOT NULL queries, or is it already indexing NULLs and I'm just misunderstanding the problem?
You could try a partial index:
CREATE INDEX idx_partial ON othertable (m_id)
WHERE (col1 is not null and col2 is not null and col3 is not null);
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html
Partial indexes aren't going to help you here as they'll only find the records you don't want. You want to create an index that contains the records you do want.
CREATE INDEX findDaNulls ON othertable ((COALESCE(col1,col2,col3,'Empty')))
WHERE col1 IS NULL AND col2 IS NULL AND col3 IS NULL;
SELECT *
FROM mytable m
JOIN othertable o ON m.id = o.m_id
WHERE COALESCE(col1,col2,col3,'Empty') = 'Empty';
BTW searching for null left joins generally isn't as fast as using EXISTS or NOT EXISTS in Postgres.
A single index on m_id, col1, col2 and o.col3 would be my first thought for this query.
And use EXPLAIN on this query to see how it is executed and what takes so much time. You could show us the results to help you out.
A partial index seems the right way here:
If you have a table that contains both
billed and unbilled orders, where the
unbilled orders take up a small
fraction of the total table and yet
those are the most-accessed rows, you
can improve performance by creating an
index on just the unbilled rows.
Perhaps those nullable columns (col1,col2,col3) act in your scenario as some kind of flag to distinguish some subclass of records in your table? (for example, some sort of "logical deletion") ? In that case, besides the partial index solution, you might prefer to rethink your design, and put them in different physical tables (perhaps using inheritance), one for the "live records" other for the "historical records" and access the full set (only when needed) thrugh a view.
Did you try to create a combined index on othertable(m_id, col1, col2, col3)?
You should also check the execution plan (using EXPLAIN) rather than checking the system tables for the index usage.
PostgreSQL 9.0 (currently in beta) will be able to use and index for a IS NULL condition. That feature got postponed

Performance of updating one big table using values from one small table

First, I know that the sql statement to update table_a using values from table_b is in the form of:
Oracle:
UPDATE table_a
SET (col1, col2) = (SELECT cola, colb
FROM table_b
WHERE table_a.key = table_b.key)
WHERE EXISTS (SELECT *
FROM table_b
WHERE table_a.key = table_b.key)
MySQL:
UPDATE table_a
INNER JOIN table_b ON table_a.key = table_b.key
SET table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
What I understand is the database engine will go through records in table_a and update them with values from matching records in table_b.
So, if I have 10 millions records in table_a and only 10 records in table_b:
Does that mean that the engine will do 10 millions iterations through table_a just to update 10 records? Are Oracle/MySQL/etc smart enough to do only 10 iterations through table_b?
Is there a way to force the engine to actually iterate through records in table_b instead of table_a to do the update? Is there an alternative syntax for the sql statement?
Assume that table_a.key and table_b.key are indexed.
Either engine should be smart enough to optimize the query based on the fact that there are only ten rows in table b. How the engine determines what to do is based factors like indexes and statistics.
If the "key" column is the primary key and/or is indexed, the engine will have to do very little work to run this query. It will basically already sort of "know" where the matching rows are, and look them up very quickly. It won't have to "iterate" at all.
If there is no index on the key column, the engine will have to to a "table scan" (roughly the equivalent of "iterate") to find the right values and match them up. This means it will have to scan through 10 million rows.
Do a little reading on what's called an Execution Plan. This is basically an explanation of what work the engine had to do in order to run your query (some databases show it in text only, some have the option of seeing it graphically). Learning how to interpret an Execution Plan will give you great insight into adding indexes to your tables and optimizing your queries.
Look these up if they don't work (it's been a while), but it's something like:
In MySQL, put the work "EXPLAIN" in front of your SELECT statement
In Oracle, run "SET AUTOTRACE ON" before you run your SELECT statement
I think the first (Oracle) query would be better written with a JOIN instead of a WHERE EXISTS. The engine may be smart enough to optimize it properly either way. Once you get the hang of interpreting an execution plan, you can run it both ways and see for yourself. :)
Okay I know answering own question is usually frowned upon but I already accepted another answer and won't unaccept it so meh here it is ..
I've discovered a much better alternative that I'd like to share it with anyone who encounters the same scenario: MERGE statement.
Apparently, newer Oracle versions introduced this MERGE statement which simply blows! Not only that the performance is so much better in most cases, the syntax is so simple and so make sense that I feel stupid for using the UPDATE statement! Here comes ..
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb;
And what more is that I can also extend the statement to include INSERT action when table_a does not have matching records for some records in table_b:
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
WHEN NOT MATCHED THEN INSERT
(key, col1, col2)
VALUES (table_b.key, table_b.cola, table_b.colb);
This new statement type made my day the day I discovered it :)