Which field should I use with Oracle Partition By clause to improve performance - sql

I have an update statement that works fine but takes a very long time to complete.
I'm updating roughly 150 rows in one table with some tens of thousands of rows exposed through a view. It's been suggested that I use the Partition By clause to speed up the process.
I'm not too familiar with Partition By statement but I've been looking around and I think maybe I need to use a field that has a numeric value that can be compared against.
Is this correct? Or can I partition the larger table with something else?
if that is the case I'm struggling with what in the larger table can be used. The table is composed as follows.
ID has a type of NUMBER and creates the unique id for a particular item.
Start_Date has a date type and indicates the start when the ID is valid.
End date has a date type and indicates the end time when the ID cease to be valid.
ID_Type is NVARCHAR2(30) and indicates what type of Identifier we are using.
ID_Type2 is NVARCHAR2(30) and indicates what sub_type of Identifier we are using.
Identifier is NVARCHAR2(30) and any one ID can be mapped to one or more Identifiers.
So for example - View_ID
ID | Start_Date | End_Date | ID_Type1| ID_Type2 | Identifier
1 | 2012-01-01 | NULL | Primary | Tertiary | xyz1
1 | 2012-01-01 | NULL | Second | Alpha | abc2
2 | 2012-01-01 | 2012-01-31 | Primary | Tertiary | ghv2
2 | 2012-02-01 | NULL | Second | Alpha | mno4
Would it be possible to Partition By the ID field of this view as long as there is a clause that the id is valid by date?
The update statement is quite basic although it selects against one of several possible identifiers and and ID_Type1's.
UPDATE Temp_Table t set ID =
(SELECT DISTINCT ID FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = (NVL(t.ID1, NVL(t.ID2, t.ID3)))
AND v.ID_Type1 in ('Primary','Secondary'));
Thanks in advance for any advice on any aspect of my question.
Additional Info ***
After investigating and following Gordon's advice I changed the update to three updates. This reduced the overall update process 75% going from just over a minute to just over 20 seconds. Thats a big improvement but I'd like to reduce the process even more if possible.
Does anyone think that Partition By clause would help even further? If so what would be the correct method for putting this clause into an update statement. I'm honestly not sure if I understand how this clause operates.
If the UPDATE using a SELECT statement only allows for 1 value to be selected does this exclude something like the following from working?
UPDATE Temp_Table t SET t.ID =
(SELECT DISTINCT ID,
Row_Number () (OVER PARTITION BY ID_Type1) AS PT1
FROM View_ID v
WHERE inDate BETWEEN v.Start_Date and v.End_Date
AND v.Identifier = t.ID1
AND PT1.Row_Number = 1 )
*Solution************
I combined advice from both Responders below to dramatically improve performance. From Gordon I removed the NVL from my UPDATE and changed it to three separate updates. (I'd prefer to combine them into a case but my trials were still slow.)
From Eggi, I looked working with some kind of Materialized view that I can actually index myself and settled on a WITH Clause.
UPDATE Temp_Table t set ID =
(WITH IDs AS (SELECT /*+ materialize */ DISTINCT ID, Identifier FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = ID1)
SELECT g.ID FROM IDs g
WHERE g.Identifier = t.ID1;
Thanks again.

It is very hard to imagine how windows/analytic functions would help with this update. I do highly recommend that you learn them, but not for this purpose.
Perhaps the suggestion was for partitioning the table space, used for the table. Note that this is very different from the "partition by" statement, which usually refers to window/analytic functions. Tablespace partitioning might help performance. However, here is something else you can try.
I think your problem is the join between the temp table and the view. Presumably, you are creating the temporary table. You should add in a new column, say UsedID, with the definition:
coalesce(t.ID1, t.ID2, t.ID3) as UsedId
The "WHERE" clause in the update would then be:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier = t.UsedId AND
v.ID_Type1 in ('Primary', 'Secondary')
I suspect that the performance problem is the use of NVL in the join, which interferes with optimization strategies.
In response to your comment . . . your original query would have the same problem as this version. Perhaps the logic you want is:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier in (t.ID1, t.ID2, t.ID3) AND
v.ID_Type1 in ('Primary', 'Secondary')

The best option for partitioning seems to be the start date, because it seems to always have a value and you also get it as input parameter in your query.
If you have not already done that I would add a bitmap index on ID_Type1.

Related

SQL Query with part of the key possibly being NULL

I've been working on a SQL query which needs to pull a value with a two-column key, where one of the columns may be null.And if it's null, I want to pick that value only if there is no row with the specific key
So.
CUSTOM_____PLAN_____COST
VENDCO_____LMNK_____50
VENDCO_____null_____25
BALLCO_____null_____10
I'm trying to run a query that will pull this into one field, i.e., the value of VENDCO at 50, and the value of BUYCO at 10, ignoring the VENDCO row with 25. This would be as part of a joined subquery, so I can't use the actual keys of VENDCO/BUYCO etc. Essentially, pick the cost value with the plan if it exists, but the one where it's null if the plan is not there.
It might also be worthwhile to point out that if I "select * from table where PLAN is null" I don't get results -- I have to select where PLAN=''. I'm not sure if that indicates anything weird about the data.
Hope I'm making myself clear.
I think that not exists should do what you want:
select t.*
from mytable t
where
plan is not null
or not exists (
select 1 from mytable t1 where t1.custom = t.custom and t1.plan is not null
)
Basically this gives priority to rows where plan is not null in groups sharing the same custom.
Demo on DB Fiddle:
CUSTOM | PLAN | COST
:----- | :--- | ---:
VENDCO | LMNK | 50
BALLCO | null | 10

Improve join query in Oracle

I have a query which takes 17 seconds to execute. I have applied indexes on FIPS, STR_DT, END_DT but still it's taking time. Any suggestions on how I can improve the performance?
My query:
SELECT /*+ALL_ROWS*/ K_LF_SVA_VA.NEXTVAL VAL_REC_ID, a.REC_ID,
b.VID,
1 VA_SEQ,
51 VA_VALUE_DATATYPE,
b.VALUE VAL_NUM,
SYSDATE CREATED_DATE,
SYSDATE UPDATED_DATE
FROM CTY_REC a JOIN FIPS_CONS b
ON a.FIPS=b.FIPS AND a.STR_DT=b.STR_DT AND a.END_DT=b.END_DT;
DESC CTY_REC;
Name Null Type
------------------- ---- -------------
REC_ID NUMBER(38)
DATA_SOURCE_DATE DATE
STR_DT DATE
END_DT DATE
VID_RECSET_ID NUMBER
VID_VALSET_ID NUMBER
FIPS VARCHAR2(255)
DESC FIPS_CONS;
Name Null Type
------------- -------- -------------
STR_DT DATE
END_DT DATE
FIPS VARCHAR2(255)
VARIABLE VARCHAR2(515)
VALUE NUMBER
VID NOT NULL NUMBER
Explain Plan:
Plan hash value: 919279614
--------------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SEQUENCE | K_VAL |
| 2 | HASH JOIN | |
| 3 | TABLE ACCESS FULL| CTY_REC |
| 4 | TABLE ACCESS FULL| FIPS_CONS |
--------------------------------------------------------------
I have added description of tables and explain plan for my query.
On the face of it, and without information on the configuration of the sequence you're using, the number of rows in each table, and the total number of rows projected from the query, it's possible that the execution plan you have is the most efficient one for returning all rows.
The optimiser clearly thinks that the indexes will not benefit performance, and this is often more likely when you optimise for all rows, not first rows. Index-based access is single block and one row at a time, so can be inherently slower than multiblock full scans on a per-block basis.
The hash join that Oracle is using is an extremely efficient way of joining data sets. Unless the hashed table is so large that it spills to disk, the total cost is only slightly more than full scans of the two tables. We need more detailed statistics on the execution to be able to tell if the hashed table is spilling to disk, and if it is the solution may just be modified memory management, not indexes.
What might also hold up your SQL execution is calling that sequence, if the sequence's cache value is very low and the number of records is high. More info required on that -- if you need to generate a sequential identifier for each row then you could use ROWNUM.
This is basically your query:
SELECT . . .
FROM CTY_REC a JOIN
FIPS_CONS b
ON a.FIPS = b.FIPS AND a.STR_DT = b.STR_DT AND a.END_DT = b.END_DT;
You want a composite index on (FIPS, STR_DT, END_DT), perhaps on both tables:
create index idx_cty_rec_3 on cty_rec(FIPS, STR_DT, END_DT);
create index idx_fipx_con_3 on cty_rec(FIPS, STR_DT, END_DT);
Actually, only one is probably necessary but having both gives the optimizer more choices for improving the query.
You should have at least these two indexes on the table:
CTY_REC(FIPS, STR_DT, END_DT)
FIPS_CONS(FIPS, STR_DT, END_DT)
which can still be sped up with covering indexes instead:
CTY_REC(FIPS, STR_DT, END_DT, REC_ID)
FIPS_CONS(FIPS, STR_DT, END_DT, VALUE, VID)
If you wish to drive the optimizer to use the indexes,
replace /*+ all_rows */ with /*+ first_rows */

How can I fetch the last N rows, WITHOUT ordering the table

I have tables with multiple million rows and need to fetch the last rows of specific ID's
for example the last row which has device_id = 123 AND the last row which has device_id = 1234
because the tables are so huge and ordering takes so much time, is it possible to select the last 200 without ordering the table and then just order those 200 and fetch the rows I need.
How would I do that?
Thank you in advance for your help!
UPDATE
My PostgreSQL version is 9.2.1
sample data:
time device_id data data ....
"2013-03-23 03:58:00-04" | "001EC60018E36" | 66819.59 | 4.203
"2013-03-23 03:59:00-04" | "001EC60018E37" | 64277.22 | 4.234
"2013-03-23 03:59:00-04" | "001EC60018E23" | 46841.75 | 2.141
"2013-03-23 04:00:00-04" | "001EC60018E21" | 69697.38 | 4.906
"2013-03-23 04:00:00-04" | "001EC600192524"| 69452.69 | 2.844
"2013-03-23 04:01:00-04" | "001EC60018E21" | 69697.47 | 5.156
....
See SQLFiddle of this data
So if device_id = 001EC60018E21
I would want the most recent row with that device_id.
It is a grantee that the last row with that device_id is the row I want, but it may or may not be the last row of the table.
Personally I'd create a composite index on device_id and descending time:
CREATE INDEX table1_deviceid_time ON table1("device_id","time" DESC);
then I'd use a subquery to find the highest time for each device_id and join the subquery results against the main table on device_id and time to find the relevant data, eg:
SELECT t1."device_id", t1."time", t1."data", t1."data1"
FROM Table1 t1
INNER JOIN (
SELECT t1b."device_id", max(t1b."time") FROM Table1 t1b GROUP BY t1b."device_id"
) last_ids("device_id","time")
ON (t1."device_id" = last_ids."device_id"
AND t1."time" = last_ids."time");
See this SQLFiddle.
It might be helpful to maintain a trigger-based materialized view of the highest timestamp for each device ID. However, this will cause concurrency issues if most than one connection can insert data for a given device ID due to the connections fighting for update locks. It's also a pain if you don't know when new device IDs will appear as you have to do an upsert - something that's very inefficient and clumsy. Additionally, the extra write load and autovacuum work created by the summary table may not be worth it; it might be better to just pay the price of the more expensive query.
BTW, time is a terrible name for a column because it's a built-in data type name. Use something more appropriate if you can.
The general way to get the "last" row for each device_id looks like this.
select *
from Table1
inner join (select device_id, max(time) max_time
from Table1
group by device_id) T2
on Table1.device_id = T2.device_id
and Table1.time = T2.max_time;
Getting the "last" 200 device_id numbers without using an ORDER BY isn't really practical, but it's not clear why you might want to do that in the first place. If 200 is an arbitrary number, then you can get better performance by taking a subset of the table that's based on an arbitrary time instead.
select *
from Table1
inner join (select device_id, max(time) max_time
from Table1
where time > '2013-03-23 12:03'
group by device_id) T2
on Table1.device_id = T2.device_id
and Table1.time = T2.max_time;

Maintaining logical consistency with a soft delete, whilst retaining the original information

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.
I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Tricky SQL statement over 3 tables

I have 3 different transaction tables, which look very similar, but have slight differences. This comes from the fact that there are 3 different transaction types; depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
As an example:
t1:
date,user,amount
t2:
date,user,who,amount
t3:
date,user,what,amount
Now I need a query who is going to get me all transactions in each table for the same user, something like
select * from t1,t2,t3 where user='me';
(which of course doesn't work).
I am studying JOIN statements but haven't got around the right way to do this. Thanks.
EDIT: Actually I need then all of the columns from every table, not just the ones who are the same.
EDIT #2: Yeah,having transaction_type doesn't break 3NF, of course - so maybe my design is utterly wrong. Here is what really happens (it's an alternative currency system):
- Transactions are between users, like mutual credit. So units get swapped between users.
- Inventarizations are physical stuff brought into the system; a user gets units for this.
- Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields). So the question changes to "Can/should this be in one table or be multiple tables (as I have it for now)?"
I need the previously described SQL statement to display running balances.
(Should this now become a new question altogether or is that OK to EDIT?).
EDIT #3: As EDIT #2 actually transforms this to a new question, I also decided to post a new question. (I hope this is ok?).
You can supply defaults as constants in the select statements for columns where you have no data;
so
SELECT Date, User, Amount, 'NotApplicable' as Who, 'NotApplicable' as What from t1 where user = 'me'
UNION
SELECT Date, User, Amount, Who, 'NotApplicable' from t2 where user = 'me'
UNION
SELECT Date, User, Amount, 'NotApplicable', What from t3 where user = 'me'
which assumes that Who And What are string type columns. You could use Null as well, but some kind of placeholder is needed.
I think that placing your additional information in a separate table and keeping all transactions in a single table will work better for you though, unless there is some other detail I've missed.
I think the meat of your question is here:
depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
I'm no 3NF expert, but I would approach your schema a little differently (which might clear up your SQL a bit).
It looks like your data elements are as such: date, user, amount, who, and what. With that in mind, a more normalized schema might look something like this:
User
----
id, user info (username, etc)
Who
---
id, who info
What
----
id, what info
Transaction
-----------
id, date, amount, user_id, who_id, what_id
Your foreign key constraint verbiage will vary based on database implementation, but this is a little clearer (and extendable).
You should consider STI "architecture" (single table inheritance). I.e. put all different columns into one table, and put them all under one index.
In addition you may want to add indexes to other columns you're making selection.
What is the result schema going to look like? - If you only want the minimal columns that are in all 3 tables, then it's easy, you would just UNION the results:
SELECT Date, User, Amount from t1 where user = 'me'
UNION
SELECT Date, User, Amount from t2 where user = 'me'
UNION
SELECT Date, User, Amount from t3 where user = 'me'
Or you could 'SubClass' them
Create Table Transaction
(
TransactionId Integer Primary Key Not Null,
TransactionDateTime dateTime Not Null,
TransactionType Integer Not Null,
-- Othe columns all transactions Share
)
Create Table Type1Transactions
{
TransactionId Integer PrimaryKey Not Null,
// Type 1 specific columns
}
ALTER TABLE Type1Transactions WITH CHECK ADD CONSTRAINT
[FK_Type1Transaction_Transaction] FOREIGN KEY([TransactionId])
REFERENCES [Transaction] ([TransactionId])
Repeat for other types of transactions...
What about simply leaving the unnecessary columns null and adding a TransactionType column? This would result in a simple SELECT statement.
select *
from (
select user from t1
union
select user from t2
union
select user from t3
) u
left outer join t1 on u.user=t1.user
left outer join t2 on u.user=t2.user
left outer join t3 on u.user=t3.user