I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one.
Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table.
I added two extra columns into the first table and want to update the data based on join.
Query is quite easy:
update big_one set
col1=extra_one.col1,
col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;
But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.
Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).
Recap:
I have a wide table big_one.
Have another table extra_one (with 3 columns total), with same IDs and number of records.
I added two new columns to big_one.
I want to enrich big_one with data from extra_one. (into that 2 new columns)
Q1: Any advice on how to perform such big updates?
Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.
Do not use UPDATE on a large number of rows.
When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!
Instead:
Create a query that JOINs the two tables
Use the query to populate a new table (see below)
Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)
You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.
From CREATE TABLE - Amazon Redshift:
LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.
Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.
Related
I have a database which contains more than 30m records, and I need to add two new columns to the database. The problem is that I need these columns to be NOT NULL, and without a default value. I thought that I would just add these columns without the NOT NULL constraint, fill them with data, then add the constraint, but Redshift doesn't support that. I have an other solution in my mind, but I wonder if there is any more simpler solution than this?
Create the two new columns with NOT NULL and DEFAULT
Filling the columns with data
Creating an empty table with the same columns as the target DB. (Of course the two new columns would be just NOT NULL)
Inserting everything from the target DB to the new DB.
Dropping the target DB
Renaming the new DB to the target.
I would suggest:
Existing Table-A
Create a new Table-B that contains the new columns, plus an identity column (eg customer_id) that matches Table-A.
Insert data into Table-B (2 columns + identity column)
Use CREATE TABLE AS to simultaneously create a new Table-C (specifying DISTKEY and SORTKEY) while querying Table-A and Table-B via a JOIN on the identity column
Verify contents of Table-C
VACCUM Table-C (shouldn't be necessary, but just in case, and it should be quick)
Delete Table-A and Table-B
Rename Table-C to desired table name (which was probably the same as Table-A)
In Summary: Existing columns in Table-A + Extra columns in Table-B ➞ Table-C
Reasoning:
UPDATE statements do not run very well in Redshift. It requires marking existing data rows for each column as 'deleted', then appending new rows to the end of each column. Doing lots of UPDATES will blow-out the size of a table and it will become unsorted. It's also relatively slow. You would need to Deep Copy or VACUUM the table afterwards to fix things.
Using CREATE TABLE AS with a JOIN will generate all "final state" data in one query and the resulting table will be sorted and in a 'clean' state
The process gives you a chance to verify the content of Table-C before committing to the switchover. Very handy for debugging the process!
See also: Performing a Deep Copy - Amazon Redshift
I have a table and need to create its duplicate in Oracle (including indices and sequences) as a history of the first one. To create the table with its data I can do like this:
create table new_table as select * from original_table;
Of course, this will not create any index, sequence or trigger which the original table has. Creating all that I can do in few ways. My question is not how can I create all that but my question is the following:
Since the newly created table is a copy of the original (columns are the same) can I use the same index which exists on the original table for the new one too?
An index entry points to the location of a records data at a particular location in a block. The data in your history table cannot exist in the same location as the data in the original table so therefore the locations being pointed to will never match. If you research how an index actually works in Oracle you will see why this is not possible.
I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.
For example, if the initial table is:
id initial_color
-- -------------
1 blue
2 red
3 yellow
I am modifying the table so that the result is:
id initial_color modified_color
-- ------------- --------------
1 blue blue_green
2 red red_orange
3 yellow yellow_brown
I have code that will read the initial_color column and update the value.
Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:
Copy the column and update the rows in the new column
Create an empty column and insert new values
I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.
The columns types are either character varying or character.
Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.
Any downsides of using data type "text" for storing strings?
Given that my table has 32 million rows and that I have to apply this
procedure on five of the 31 columns, what is the most efficient way to do this?
If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):
BEGIN;
LOCK TABLE tbl_org IN SHARE MODE; -- to prevent concurrent writes
CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);
ALTER tbl_new ADD COLUMN modified_color text
, ADD COLUMN modified_something text;
-- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
, myfunction(initial_color) AS modified_color -- etc
FROM tbl_org;
-- ORDER BY tbl_id; -- optionally order rows while being at it.
-- Add constraints and indexes like in the original table here
DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;
If you have depending objects, you need to do more.
Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.
Related cases with more details, links and explanation:
Updating database rows without locking the table in PostgreSQL 9.2
Best way to populate a new column in a large table?
Optimizing bulk update performance in PostgreSQL
While creating a new table you might also order columns in an optimized fashion:
Calculating and saving space in PostgreSQL
Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:
CREATE TABLE
This would create a new table and filling could be done using
CREATE TABLE .. AS SELECT.. for filling with creation or
using a separate INSERT...SELECT... later on
Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.
So, I' not seeing any way to realize a procedure that follows your choice 1.
Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.
Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.
create new column (modified colour), it will have a value of NULL or blank on all records,
run an update statement, assuming your table name is 'Table'.
update table
set modified_color = 'blue_green'
where initial_color = 'blue'
if I am correct this can also work like this
update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';
once you have done this you can do another update (assuming you have another column that I will call modified_color1)
update table set 'modified_color1'= 'modified_color'
I have a table MyTable with multiple int columns with date and one column containing a date. The date column has an index created like follows
CREATE INDEX some_index_name ON MyTable(my_date_column)
because the table will often be queried for its contents between a user-specified date range. The table has no foreign keys pointing to it, nor have any other indexes other than the primary key which is an auto-incrementing index filled by a sequence/trigger.
Now, the issue I have is that the data on this table is often replaced for a given time period because it was out of date. So they way it is updated is by deleting all the entries within a given time period and inserting the new ones. The delete is performed using
DELETE FROM MyTable
WHERE my_date_column >= initialDate
AND my_date_column < endDate
However, because the number of rows deleted is massive (from 5 million to 12 million rows) the program pretty much blocks during the delete.
Is there something I can disable to make the operation faster? Or maybe specify an option in the index to make it faster? I read something about redo space having to do with this but I don't know how to disable it during an operation.
EDIT: The process runs every day and it deletes the last 5 days of data, then it brings the data for those 5 days (which may have changed in the external source) and reinserts the data.
The amount of data deleted is a tiny fraction compared to the whole amount of data in the table ( < 1%). So copying the data I want to keep into another table and dropping-recreating the table may not be the best solution.
I can only think of two ways to speed up this.
if you do this on a regular basis, you should consider partitioning your table by month. Then you just drop the partition of the month you want to delete. That is basically as fast as dropping a table. Partitioning requires an enterprise license if I'm not mistaken
create a new table with the data you want to keep (using create table new_table as select ...), drop the old table and rename the interims table. This will be much faster, but has the drawback that you need to re-create all indexes and (primary, foreign key) constraints on the new table.
I have a table ( A ) in a database that doesn't have PK's it has about 300 k records.
I have a subset copy ( B ) of that table in other database, this has only 50k and contains a backup for a given time range ( july data ).
I want to copy from the table B the missing records into table A without duplicating existing records of course. ( I can create a database link to make things easier )
What strategy can I follow to succesfully insert into A the missing rows from B.
These are the table columns:
IDLETIME NUMBER
ACTIVITY NUMBER
ROLE NUMBER
DURATION NUMBER
FINISHDATE DATE
USERID NUMBER
.. 40 extra varchar columns here ...
My biggest concern is the lack of PK. Can I create something like a hash or a PK using all the columns?
What could be a possible way to proceed in this case?
I'm using Oracle 9i in table A and Oracle XE ( 10 ) in B
The approximate number of elements to copy is 20,000
Thanks in advance.
If the data volumes are small enough, I'd go with the following
CREATE DATABASE LINK A CONNECT TO ... IDENTIFIED BY ... USING ....;
INSERT INTO COPY
SELECT * FROM table#A
MINUS
SELECT * FROM COPY;
You say there are about 20,000 to copy, but not how many in the entire dataset.
The other option is to delete the current contents of the copy and insert the entire contents of the original table.
If the full datasets are large, you could go with a hash, but I suspect that it would still try to drag the entire dataset across the DB link to apply the hash in the local database.
As long as no duplicate rows should exist in the table, you could apply a Unique or Primary key to all columns. If the overhead of a key/index would be to much to maintain, you could also query the database in your application to see whether it exists, and only perform the insert if it is absent