How to improve select performance from huge tables

How to improve select performance from huge tables - sql

I currently have a process that is taking to long(an hour+-) .
The process basically does this :
First, it left joining from one table to a VIEW -
SELECT * FROM
STG_CRM, V_CRM
WHERE
STG_CRM.CRM_CASE_ID=V_CRM.CASE_ID(+)
The view DDL:
create or replace view stg_admin.v_crm as
select t.case_id
from crm_case t, dim_crm x
where t.case_id=x.crm_case_id;
STG_CRM - 200k records - no indexes.
DIM_CRM - 90MIL records - indexed (crm_case_id - unique).
CRM_CASE - 200k records - no indexes.
Until now everything is not to heavy yet (about 2-3 minutes), then there is a left join to another VIEW , that currently is the heaviest select (just select * from the view is 10 minutes).
View DDL - I'm currently thinking over two different queries :
select t.crm_case_id,s.customer_key
from stg_crm t, stg_scd s
where t.account_number=s.account_number
and t.case_create_date between s.start_date and s.end_date;
Or:
select t.crm_case_id,
(select min(s.customer_key) keep (dense_rank first order by s.end_date asc)
from stg_scd s
where t.account_number = s.account_number and
t.case_create_date <= s.end_date
) as customer_key
from stg_crm t
Table stg_scd - 500MIL records indexed (customer_key,start_date,end_date) - UNIQUE partitioned by end_date daily.
Now currently both of this queries taking a very long time, the second a bit longer . My guess is because it is not using the index, since start_date is not beeing used to filter, but I have no idea how to add it.
My question is: How can I make it faster? If I add an index on STG_CRM on create_date, will it help?(I don't even know if DBA will allow at) because this is the small table.
LIMITATIONS :
I can't change indexes on the big table (STG_SCD)
I may be able to add index on other tables, but only given a good reason because it can hurt performances on other processes that use this tables.
The implicit join syntax are generated through my program, so no need comments there.
Thanks a lot in advance!
P.S. The first select left joined to the second select takes about 30-60 minutes .

Since you are aggregating data from a very large table (as you are doing in your second query option), you might gain some performance benefits from using query_rewrite to calculate the results of these values in advance. Here's a paper that covers the use of this feature -
http://gerardnico.com/wiki/database/oracle/query_rewriting
You can also find a lot of other examples for the best practices of using this feature, it is highly valuable in tuning data warehouse queries.
Good luck!

I've managed to solve this problem!
I was missing two things:
1) Statistics on stg_scd, which we disabled after we add partitions to the tables and forgot to add statistics after each partition is being added.
2) Adding index on stg_crm on(account_number, case_create_date)
Thanks for all your attempts :)

Related

Extract data from view ORACLE performance

Hello I created a view to make a subquery ( select from two tables)
this the sql order :
CREATE OR REPLACE VIEW EMPLOYEER_VIEW
AS
SELECT A.ID,A.FIRST_NAME||' '||A.LAST_NAME AS NAME,B.COMPANY_NAME
FROM EMPLOY A, COMPANY B
WHERE A.COMPANY_ID=B.COMPANY_ID
AND A.DEPARTEMENT !='DEP_004'
ORDER BY A.ID;
If I select data from EMPLOYEER_VIEW the average execution time is 135,953 s
Table EMPLOY contiens 124600329 rows
Table COMPANY contiens 609 rows.
My question is :
How can i make the execution faster ?
I created two indexes:
emply_index (ID,COMPANY_ID,DEPARTEMENT)
and company_index(COMPANY_ID)
Can you help me to make selections run faster ? ( creating another index or change join )
PS: I Can't create a materialized view in this database.
In advance thanks for help.

You have a lot of things to do.
If you must work with a view, and can not create a scheduled job to insert data in a table, I will remove my answer.
VIEWs does not have the scope to support hundred of million data. Is for few million.
INDEXes Must be cleaned when data are inserting. If you insert data with an index the process is 100 times slower. (You can drop and create or update them).
In table company CREATE PARTITION.
If you have a lot of IDs, use RANGE.
If you have around 100 IDs LIST PARTITION.
You do not need Index, because the clause to JOIN does not optimize, INDEXes is specified to strict WHERE Clause.
We had a project with 433.000.000 data, and the only way to works was playing with partitions.

Partitioning table in postgres

I have a simple table structure in postgres which has a site and site_pages table which is a one to many relationship. The tables join on site.id to site_pages.site_id
These tables are still performing quickly but growing fast and am aware they might not for much longer so just want to be prepared as.
I had two ideas:
Partition on site.id and site_pages.site_id grouping by 1M rows but will have queries selecting from multiple partitions
Partitioning by active (True/False) but will probably only be a short term fix.
Is there a better approach i'm missing?
Table Structure
site ~ 7 million rows
id
url
active
site_pages ~ 60 millions rows
id
site_id
page_url
active

I don't think that partitioning in the classical sense will help you there. If you end up having to select from all partitions, you won't end up faster.
If most of the queries access only active data and you want to optimize for that case, you could introduce an old_siteand an old_site_pages and move all data there when they become inactive. Queries accessing all data will have to use a UNION of the current and the old data and might become slower, but queries accessing active data can become fast.

Tables with just a few columns should perform acceptably up to some hundreds of millions of rows. From this I think you could skip on site table for now.
As for site_pages, partitioning will help you if you use the partitioning criteria in your SELECTs. This means if you partition by site_id (grouped by some millions of rows) and have CHECK criteria set properly for each table (CHECK site_id >= 1000000 AND site_id < 2000000) then your SELECT ... WHERE site_id = 1536987 will not use UNION. It will only read partitions that match your criteria, thus going through only one table. You can see it from EXPLAIN.
And finally, you could move NOT active sites and site_pages into different tables - some archive.
P.S.: I assume you know how to set up partitioning on Postgres (subtables should INHERIT parent table, add check constraints, index each subtable, etc).

Two very alike select statements different performance

I've just came across some weird performance differences.
I have two selects:
SELECT s.dwh_end_date,
t.*,
'-1' as PROMOTION_DROP_EMP_CODE,
trunc(sysdate +1) as PROMOTION_END_DATE,
'K01' as PROMOTION_DROP_REASON,
-1 as PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
Which takes approximately 20 seconds.
And this one:
SELECT s.dwh_end_date,
s.dwh_product_key,
s.promotion_expire_date,
s.PROMOTION_DROP_EMP_CODE,
s.PROMOTION_END_DATE,
s.PROMOTION_DROP_REASON,
s.PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
That takes approximately 400 seconds
They are basically the same - its just to assure that I've updated my data correct (first select is to update the FCT tables) second select to make sure every thing updated correctly.
The only differences between this two selects, is the columns I select. (STG table has two columns - dwh_p_key and prom_expire_date)
First select explain plan
Second select explain plan
What can cause this weird behaviour?..
The FCT tables is indexed UNIQUE (dwh_product_key, dwh_end_date) and partitioned by dwh_end_date (250 million records), the STG doesn't have any indexes (and its only 15k records)
Thanks in advance.

The plans are not exactly the same. The first query uses a fast full scan of the index on fct_customer_services and doesn't need to access any blocks from the actual table, since you only refer to the two indexed columns.
The second query does have to access the table blocks to get the other unidexed column values. It's doing a full table scan - slower and more expensive than a full index scan. The optimiser doesn't see any improvement from using the index and then accessing specific table rows, presumably because the cardinality is too high - it needs to access too many table rows to save any effort by hitting the index first. Doing so would be even slower.
So the second query is slower because it has to read the whole table from disk/cache rather than just the whole index, and the table is much larger than the index. You can look at the segments assigned to both objects (index and table) to see the ratio of their sizes.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.

While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code

If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.

First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.

Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas