How to improve SQL query in Spark when updating table? ('NOT IN' in subquery)

How to improve SQL query in Spark when updating table? ('NOT IN' in subquery) - sql

I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?

I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.

Related

Creating a view from JOIN two massive tables

Context
I have a big table, say table_A, with roughly 20 billion rows and 600 columns. I don't own this table but I can read from it.
For a fraction of these columns I produce a few extra columns (50) which I store in a separate table, say table_B, which is therefore roughly 20 bn X 50 large.
Now I have the need to expose the join of table table_A and table_B to users, which I tried as
CREATE VIEW table_AB
AS
SELECT *
FROM table_A AS ta
LEFT JOIN table_B AS tb ON (ta.tec_key = tb.tec_key)
The problem is that for any simple query like SELECT * FROM table_AB LIMIT 2 will fail because of memory issues: apparently Impala attempts to do a full join first in memory which would result into a table of 0.5 Petabyte. Hence the failure.
Question
What is the best way to create such a view?
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
I have also tried with [...] SELECT STRAIGHT_JOIN * [...] but did not help.

What is the best way to create such a view?
Since both tables are huge, there will be memory problem. here are some points i would recommend,
Assuming table a and b have same tec_key, do a inner join
Keep (smaller) table b as driver. create vw as select ... from b join a on .... Impala stores driver table in memory and so it will require less memory.
Select only columns required and do not select all.
put filter in view.
Do partitions in table b if you can on some dtae/year/region/anything that can evenly distribute the data.
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
You can not ensure filter goes before or after join. Only way to ensure a filter will improve perf is if you have partition on the filter column. Else, you can try to filter first and the join to see if it improves perf like this
select ... from b
join ( select ... from a where region='Asia') a on ... -- wont improve much
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
Completely agree on this. Multiple smaller tables is way better than one giant table with 600 columns. So, create few stg table with only required fields and then enrich that data. Its a difficult data set, but no one will change 20b rows everyday - so some sort of incremental is also possible to implement.

Hive join query optimisation

Table A
---------
col1, col2,Adate,qty
Table B
-------
col2,cost,Bdate
The table sizes are as follows:
A: 1 million
B: 700k
Consider this query:
SELECT
A.col1,
A.col2,
B.Bdate bdate,
SUM(qty)*COLLECT_LIST(cost)[0] price
FROM A
JOIN B
ON (A.col2 = B.col2 AND A.Adate <= B.Bdate)
GROUP BY
A.col1,
A.col2,
B.bdate;
The above hive query takes more than 3 hrs on a cluster of 4 slaves(8GB memory,100 GB disk) and 1 master(16 GB memory, 100 GB disk)
Can this query be optimized? If yes, where can the optimization be possible?

Use Tez and mapjoin.
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --adjust for your smaller table to fit in memory
set hive.execution.engine=tez;
Also this computation is not memory-efficient:
SUM(qty)*COLLECT_LIST(cost)[0] price
COLLECT_LIST will collect all cost values in the group into non unique(contains values from ALL rows in the group) and unordered (yes, unordered, because you have no any distribute + sort before collect_list) array. This array can be big enough (the number of elements = the number of rows in the group), depending on your data, then you are taking [0] element, it means that you are picking just any random cost from the group. Does it make any sense to collect array to get just any random element? Use min() or max instead. If it does not matter which cost should be taken, then min(cost) or max(cost) or some other scalar function will consume less memory. You can use first_value analytic function (may require sub-query, but it will be memory-efficient also)

I will try to give you some advices to improve query performance in Hive.
Check the execution engine you are using
set hive.execution.engine;
If you execution engine is mr, rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
set hive.execution.engine=tez;
Join queries are computationally expensive and can be slow, especially when you’re joining three or more tables, or if you’re working with very large data.
One strategy that can be used to remedy this problem is to join the data in advance and store the pre-joined result in a separate table, which you can then query.
this is one way of denormalizing a normalized database, to make it easier to run analytic queries.
This approach of pre-joining tables has some costs, but it can make analytic queries easier to write and faster to run.
There are some other techniques for improving Hive query performance
Join table ordering (Largest table last)
As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join,
it needs to select which table is streamed and which table is cached.
Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.
A: 1 million B: 700k
Hence, when these two tables are joined it is important that the larger table comes last in the query.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables,
BucketedTables
bucketing-in-hive
Partitioning
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition.
Using partition it is easy to do queries on slices of the data.
apache-hive-partitions

Two very alike select statements different performance

I've just came across some weird performance differences.
I have two selects:
SELECT s.dwh_end_date,
t.*,
'-1' as PROMOTION_DROP_EMP_CODE,
trunc(sysdate +1) as PROMOTION_END_DATE,
'K01' as PROMOTION_DROP_REASON,
-1 as PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
Which takes approximately 20 seconds.
And this one:
SELECT s.dwh_end_date,
s.dwh_product_key,
s.promotion_expire_date,
s.PROMOTION_DROP_EMP_CODE,
s.PROMOTION_END_DATE,
s.PROMOTION_DROP_REASON,
s.PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
That takes approximately 400 seconds
They are basically the same - its just to assure that I've updated my data correct (first select is to update the FCT tables) second select to make sure every thing updated correctly.
The only differences between this two selects, is the columns I select. (STG table has two columns - dwh_p_key and prom_expire_date)
First select explain plan
Second select explain plan
What can cause this weird behaviour?..
The FCT tables is indexed UNIQUE (dwh_product_key, dwh_end_date) and partitioned by dwh_end_date (250 million records), the STG doesn't have any indexes (and its only 15k records)
Thanks in advance.

The plans are not exactly the same. The first query uses a fast full scan of the index on fct_customer_services and doesn't need to access any blocks from the actual table, since you only refer to the two indexed columns.
The second query does have to access the table blocks to get the other unidexed column values. It's doing a full table scan - slower and more expensive than a full index scan. The optimiser doesn't see any improvement from using the index and then accessing specific table rows, presumably because the cardinality is too high - it needs to access too many table rows to save any effort by hitting the index first. Doing so would be even slower.
So the second query is slower because it has to read the whole table from disk/cache rather than just the whole index, and the table is much larger than the index. You can look at the segments assigned to both objects (index and table) to see the ratio of their sizes.

cardinality estimation when the foreign key is limited to a small subset

Recently I have been optimizing the performance of a large scale ERP packet.
One of the performance issues I haven't been able to solve involves bad cardinality estimation for a foreign key which is limited to a very small subset of records from a large table.
Table A holds 3 mil records and has a type field
Table B holds 7 mil records and holds a foreign key FK to Table A
The foreign key will only be filled with primary keys from table A with a certain type, only 36 from the 3 mil records in Table A have this certain type.
B JOIN A ON B.FK = A.PK AND A.TYPE = X AND A.Name = Y
Now using the correct statistics SQL knows table A will only return 1 value
But SQL estimates only 2 records will be returned from table B (my guess is 7 mil / 3 mil) while actually 930 000 records are returned
This results in a slow query plan being used.
The real query is more complex but the cause of the bad query plan is because of this simplified statement.
Our DB does have accurate statistics for the FK (histogram shows EQ_Rows for every distinct value of this FK) and filtering on a fixed FK value does result in accurate estimations.
Is there any way to show SQL that this FK can only hold a small amount of distinct values or in any other way help him with the estimations.
If we had a chance we would split up the table and put these 36 records in a separate table but unfortunately this is how the ERP system works.
Extra info:
We are using SQL 2014.
The ERP system is Dynamics AX 2012 R3
Using trace flag 9481 does help (not perfect but a lot better) but unfortunately we cannot set trace flags for separate queries with Dynamics AX

I've encountered these kinds of problems before, and have often found that I can dramatically reduce total run time for a stored proc or script by pulling those 'very few relevant rows' from a large table into a small temp table and then joining that temp table into the main query later. Or using CTE queries to isolate the few needed rows. A little experimentation should quickly tell you if this has potential in your case.

Look at the query plan
Clearly you want it to filter on TYPE early
It is probably doing loop join
FROM B
JOIN A
ON B.FK = A.PK
AND A.TYPE = X
AND A.Name = Y
Try the various join hints
Next would be to create a #temp and join to it
Declare a PK on you temp

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.

First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.

Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas