I have a table MyTable with multiple int columns with date and one column containing a date. The date column has an index created like follows
CREATE INDEX some_index_name ON MyTable(my_date_column)
because the table will often be queried for its contents between a user-specified date range. The table has no foreign keys pointing to it, nor have any other indexes other than the primary key which is an auto-incrementing index filled by a sequence/trigger.
Now, the issue I have is that the data on this table is often replaced for a given time period because it was out of date. So they way it is updated is by deleting all the entries within a given time period and inserting the new ones. The delete is performed using
DELETE FROM MyTable
WHERE my_date_column >= initialDate
AND my_date_column < endDate
However, because the number of rows deleted is massive (from 5 million to 12 million rows) the program pretty much blocks during the delete.
Is there something I can disable to make the operation faster? Or maybe specify an option in the index to make it faster? I read something about redo space having to do with this but I don't know how to disable it during an operation.
EDIT: The process runs every day and it deletes the last 5 days of data, then it brings the data for those 5 days (which may have changed in the external source) and reinserts the data.
The amount of data deleted is a tiny fraction compared to the whole amount of data in the table ( < 1%). So copying the data I want to keep into another table and dropping-recreating the table may not be the best solution.
I can only think of two ways to speed up this.
if you do this on a regular basis, you should consider partitioning your table by month. Then you just drop the partition of the month you want to delete. That is basically as fast as dropping a table. Partitioning requires an enterprise license if I'm not mistaken
create a new table with the data you want to keep (using create table new_table as select ...), drop the old table and rename the interims table. This will be much faster, but has the drawback that you need to re-create all indexes and (primary, foreign key) constraints on the new table.
Related
I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one.
Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table.
I added two extra columns into the first table and want to update the data based on join.
Query is quite easy:
update big_one set
col1=extra_one.col1,
col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;
But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.
Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).
Recap:
I have a wide table big_one.
Have another table extra_one (with 3 columns total), with same IDs and number of records.
I added two new columns to big_one.
I want to enrich big_one with data from extra_one. (into that 2 new columns)
Q1: Any advice on how to perform such big updates?
Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.
Do not use UPDATE on a large number of rows.
When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!
Instead:
Create a query that JOINs the two tables
Use the query to populate a new table (see below)
Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)
You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.
From CREATE TABLE - Amazon Redshift:
LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.
Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.
I have a table which has an id and a date. (id, date) make up the composite key for the table.
What I am trying to do is delete all entries older than a specific date.
delete from my_table where date < '2018-12-12'
The query plan explains that it will do a sequential scan for the date column.
I somehow want to make use of the index present since the number of distinct ids are very very small compared to total rows in the table.
How do I do it ? I have tried searching for it but to no avail
In case your use-case involves data-archival on monthly basis or some time period, you can think of updating your DataBase table to use partitions.
Let's say you collect data on monthly basis and want to keep data for the last 5 months. It would be really efficient to create partition over the table based on month of the year.
This will,
optimise your READ queries (table scans will reduce to partition scans)
optimise your DELETE requests (just delete the complete partition)
You need an index on date for this query:
create index idx_mytable_date on mytable(date);
Alternatively, you can drop your existing index and add a new one with (date, id). date needs to be the first key for this query.
I have been asked to look into a manual process that one of my colleagues is completing every now and again.
He sometimes needs to add a new column onto a large table (200 million rows), it is taking him more than 1 hour to do this. Before you ask, yes, the columns are nullable but sometimes the new column will have 90% data in it.
Instead of adding a new column to the existing table, he...
Creates a new table
Select (*) from old table (inserts into new)
Adds the new column as part of his script
Then he deletes the old table and renames the new table back to the original, adds index and then compresses. He says it much quicker like that.
If this is the best way then I will try and write SSIS package to try and make the process more seamless
Any advice is welcome!
Thanks
creating a new table structure and moving all the data to that table and delete the prior table is a good way just for a few data,you can do it by wizard in SQL Server. but it is the worst way for solving this problem(millions of data).
for large amount of data (millions of records) you should use "Alter Table".
Alter Table MyTable
ADD NewColumn nvarchar(10) null
the new column will add to the table as the last column.
if you use this script it takes less that one second because all data will not moving,you just add a new column in to the table.
but if you use the wizard method as you mentioned with millions of data records it takes hours.
as Ali says
alter Table MyTable
ADD NewColumn nvarchar(10) null
but then to fill in 90% of data. As he has a table already with it in and the key he's joining on in the copy so this is all he needs:
UPDATE MyTable
SET [NewColumn] = b.[NewColumn]
FROM MyTable a INNER JOIN NewColumnTable b ON a.[KeyField]= b.[KeyField]
would be a lot quicker. You could do it in SSIS but if this happens a lot then not really worth it for a few lines of SQL.
I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.
I have a table ( A ) in a database that doesn't have PK's it has about 300 k records.
I have a subset copy ( B ) of that table in other database, this has only 50k and contains a backup for a given time range ( july data ).
I want to copy from the table B the missing records into table A without duplicating existing records of course. ( I can create a database link to make things easier )
What strategy can I follow to succesfully insert into A the missing rows from B.
These are the table columns:
IDLETIME NUMBER
ACTIVITY NUMBER
ROLE NUMBER
DURATION NUMBER
FINISHDATE DATE
USERID NUMBER
.. 40 extra varchar columns here ...
My biggest concern is the lack of PK. Can I create something like a hash or a PK using all the columns?
What could be a possible way to proceed in this case?
I'm using Oracle 9i in table A and Oracle XE ( 10 ) in B
The approximate number of elements to copy is 20,000
Thanks in advance.
If the data volumes are small enough, I'd go with the following
CREATE DATABASE LINK A CONNECT TO ... IDENTIFIED BY ... USING ....;
INSERT INTO COPY
SELECT * FROM table#A
MINUS
SELECT * FROM COPY;
You say there are about 20,000 to copy, but not how many in the entire dataset.
The other option is to delete the current contents of the copy and insert the entire contents of the original table.
If the full datasets are large, you could go with a hash, but I suspect that it would still try to drag the entire dataset across the DB link to apply the hash in the local database.
As long as no duplicate rows should exist in the table, you could apply a Unique or Primary key to all columns. If the overhead of a key/index would be to much to maintain, you could also query the database in your application to see whether it exists, and only perform the insert if it is absent