Is there an efficient way to update rows of a table that has no indexes and no partitions (and ~50millions rows)?
I have a date field LOAD_DTTM and values of this field for rows that require update (around 2000 distinct dates).
WIll update be faster if i specify a date in a WHERE clause along with the UNIQUE_ID of a row?
If you want to update all, or a large number, of the rows then the quickest way is:
create table my_table_copy as
select ... -- all the columns, updating values as required
from my_table;
drop table my_table;
rename my_table_copy to my_table;
If your table had any indexes, constraints or triggers you would now need to re-add them - but it seems you don't have that issue!
You could create a PL/SQL procedure that loops and update and commit the table every n row count -- Say every 20.000 rows. I do not advise to update the full table as it will create a lock for a looong time and expose you to data loss in case of external factors.
The answer is NO.
Even if you specify both conditions in your WHERE clause as you stated, it won't help you to avoid a full scan of your table.
Even if one of your criteria will uniquely identify the row, it still won't help.
There is a real example tested on Oracle 12C ver.2 similar to your case. No indexes, no partitions, nothing. Just plain table with 4 columns
I have a table with 18mn records.
I also have CUSTOMER_ID which is a UNIQUE identifier for a row.
I also have ORDER_DATE column there.
Even if I do the query that you mentioned
update hit set status = 1 where customer_id = 408518625844 and order_date = '09-DEC-19';
it won't help me to avoid a full table scan. See below Execution Plan. Therefore under conditions, you've specified, you will be always getting the slowest execution time possible. Full Table Scan on 50mn rows is actually the worst-case scenario.
And pay attention to that Cost, it is 26539 on 18mn rows.
So if you have 50mn rows we can easily expect much more Cost for your query
Related
I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.
I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.
How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.
I have an SQL query as simple as:
select * from recent_cases where user_id=1000000 and case_id=10095;
It takes up to 0.4 seconds to execute it in Oracle. And when I do 20 requests in a row, it takes > 10s.
The table 'recent_cases' has 4 columns: ID, USER_ID, CASE_ID and VISITED_DATE. Currently there are only 38 records in this table.
Also, there are 3 indexes on this table: on ID column, on USER_ID column, and on (USER_ID, CASE_ID) columns pair.
Any ideas?
One theory -- the table has a very large data segment and high water mark near the end, but the statistics are not prompting the optimiser to use an index. Therefore you're getting a slow full table scan. You could ALTER TABLE ... MOVE and rebuild the indexes to fix such a problem, or COALESCE it.
Oracle Databases have a function called "analyze table". This function can speed up select statements a lot, even if there are just a few rows in the table.
Here are some links which might help you:
http://www.dba-oracle.com/t_oracle_analyze_table.htm
http://docs.oracle.com/cd/B28359_01/server.111/b28310/general002.htm
I have a following delete query in oracle. There will be about 1000 records to be deleted from the database at a time.
I have used "in" the query. Is there any better way to write this query?
DELETE FROM BI_EMPLOYEE_ACTIVITY
WHERE EMPLOYEE_ID in (
SELECT
EMP_ID
FROM
BI_EMPLOYEE
WHERE
PRODUCT_ID = IN_PRODUCT_ID
);
It is not really possible to answer this question as we're missing a description of the data distribution: How many rows are in each table? What's the relationship between the tables? How many rows are affected by the delete?
I'll be assuming that both tables are large (since this is an optimization question) and that BI_EMPLOYEE and BI_EMPLOYEE_ACTIVITY have a parent-child 1..N relationship.
If there are few rows affected by the delete, this means that not many employees have the same PRODUCT_ID and each employee has few activities. In this case it would make sense to index both BI_EMPLOYEE (product_id) and BI_EMPLOYEE_ACTIVITY (employee_id).
This is probably not the case though, the delete probably affects lots of rows. In that case the indexes could be a hindrance. If the delete affects lots of rows, the fastest access path probably is FULL TABLE SCAN + HASH JOIN.
We need some metrics here: how many rows are deleted? How long does it take? This is because large DML will always take time, especially DELETE since they produce the largest amount of undo.
There are alternatives to a large DELETE, as explained in "Deleting many rows from a big table" from asktom:
recreate the table without the deleted rows
partition the data, do a parallel delete
partition the data so that the delete is done by dropping a partition
Putting index on EMP_ID may help, I dont believe if any other optimization is possible, query is quite simple and straight forward
Create an index on PRODUCT_ID column. This would speed up the search. If the column is of varchar type, make use to function index if you are converting values to uppercase or lowercase
Maybe you can try EXIST instead of IN:
DELETE FROM BI_EMPLOYEE_ACTIVITY
WHERE EXISTS (
SELECT
EMP_ID
FROM
BI_EMPLOYEE
WHERE
PRODUCT_ID = IN_PRODUCT_ID
AND
EMP_ID = EMPLOYEE_ID
);
Create an index on BI_EMPLOYEE table for PRODUCT_ID, EMP_ID columns in this order (product_id on the first place).
And create an index on the BI_EMPLOYEE_ACTIVITY table for the column EMPLOYEE_ID
I'll just add that other than creating an index for the query, you need to take a look at the locking issue when your table grows really big, try to lock the table in exclusive mode (if possible) as this will only take a lock from the db, and if it's not possible try to commit the delete over each 2500 records so if you're stuck with row locking you don't endup starving the database of locks.
I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)
The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.
To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.
If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.
The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.
I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.
sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.
I have a table with millions of rows, and I need to do LOTS of queries which look something like:
select max(date_field)
where varchar_field1 = 'something'
group by varchar_field2;
My questions are:
Is there a way to create an index to help with this query?
What (other) options do I have to enhance performance of this query?
An index on (varchar_field1, varchar_field2, date_field) would be of most use. The database can use the first index field for the where clause, the second for the group by, and the third to calculate the maximum date. It can complete the entire query just using that index, without looking up rows in the table.
Obviously, an index on varchar_field1 will help a lot.
You can create yourself an extra table with the columns
varchar_field1 (unique index)
max_date_field
You can set up triggers on inserts, updates, and deletes on the table you're searching that will maintain this little table -- whenever a row is added or changed, set a row in this table.
We've had good success with performance improvement using this refactoring technique. In our case it was made simpler because we never delete rows from the table until they're so old that nobody ever looks up the max field. This is an especially helpful technique if you can add max_date_field to some other table rather than create a new one.