I have a database which contains more than 30m records, and I need to add two new columns to the database. The problem is that I need these columns to be NOT NULL, and without a default value. I thought that I would just add these columns without the NOT NULL constraint, fill them with data, then add the constraint, but Redshift doesn't support that. I have an other solution in my mind, but I wonder if there is any more simpler solution than this?
Create the two new columns with NOT NULL and DEFAULT
Filling the columns with data
Creating an empty table with the same columns as the target DB. (Of course the two new columns would be just NOT NULL)
Inserting everything from the target DB to the new DB.
Dropping the target DB
Renaming the new DB to the target.
I would suggest:
Existing Table-A
Create a new Table-B that contains the new columns, plus an identity column (eg customer_id) that matches Table-A.
Insert data into Table-B (2 columns + identity column)
Use CREATE TABLE AS to simultaneously create a new Table-C (specifying DISTKEY and SORTKEY) while querying Table-A and Table-B via a JOIN on the identity column
Verify contents of Table-C
VACCUM Table-C (shouldn't be necessary, but just in case, and it should be quick)
Delete Table-A and Table-B
Rename Table-C to desired table name (which was probably the same as Table-A)
In Summary: Existing columns in Table-A + Extra columns in Table-B ➞ Table-C
Reasoning:
UPDATE statements do not run very well in Redshift. It requires marking existing data rows for each column as 'deleted', then appending new rows to the end of each column. Doing lots of UPDATES will blow-out the size of a table and it will become unsorted. It's also relatively slow. You would need to Deep Copy or VACUUM the table afterwards to fix things.
Using CREATE TABLE AS with a JOIN will generate all "final state" data in one query and the resulting table will be sorted and in a 'clean' state
The process gives you a chance to verify the content of Table-C before committing to the switchover. Very handy for debugging the process!
See also: Performing a Deep Copy - Amazon Redshift
Related
I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one.
Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table.
I added two extra columns into the first table and want to update the data based on join.
Query is quite easy:
update big_one set
col1=extra_one.col1,
col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;
But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.
Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).
Recap:
I have a wide table big_one.
Have another table extra_one (with 3 columns total), with same IDs and number of records.
I added two new columns to big_one.
I want to enrich big_one with data from extra_one. (into that 2 new columns)
Q1: Any advice on how to perform such big updates?
Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.
Do not use UPDATE on a large number of rows.
When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!
Instead:
Create a query that JOINs the two tables
Use the query to populate a new table (see below)
Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)
You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.
From CREATE TABLE - Amazon Redshift:
LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.
Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.
What is the statement to alter a table which holds about 10 million rows, adding a guid column which will hold a unique identifier for each row (without being part of the pk)
What datatype should the global unique identifier column be?
Is there a procedure which creates it?
How will it auto incremented or produced everytime a new record is inserted?
Break it down into the separate stages
First, we need a new column:
alter table MyTable
add guid_column raw(32) default sys_guid();
Then update the existing rows:
update MyTable
set guid_column = sys_guid();
Use identity columns feature of oracle 12c to add a column to the table which auto increments upon adding new rows to the table.
An ideal way to handle this task is to:
a) CREATE a "new" table with structure similar to the source table using CREATE TABLE AS (CTAS statement) with a new "identity column" instead of adding identity column using ALTER statement on existing table.
b) CTAS works faster compared to running ALTER on existing table.
c) After confirming that the "new" table has all the data from the source table along with an column containing unique values and all the indexes and constraints then, you can drop the original table.
Another way to avoid creating constraints, indexes present on original table onto the new table is to create an empty table with all constraints, indexes and identity column. Let DBA extract data from the original table and import it into the "new" table.
Benefits:
This approach will ensure that none of the objects dependent on the source table become INVALID which generally hampers some features of the application(s).
Here is the scenario... I've two databases (A & B) with same schema but different records. I'd like to transfer B's data into corresponding tables in DB A.
Lets say we have tables named Question and Answer in both databases. DB A contains 10 records in Question table and 30 in Answer table. Both tables have identity column Id starting with 1(& auto increment), and there is 1 to many relation between Question and Answer.
In DB B, we have 5 entries in Question table and 20 in Answer.
My requirement is to copy data of both tables from source DB B into destination DB A without having any conflict in identity column while maintaining the relation between two tables during data transfer.
Any solution or potential workaround would be highly appreciated.
I will not write SQL here but here is what I think can be done. Make sure to use Identity insert ON and OFF.
Take maxids of both tables from DB A like A_maxidquestion and A_maxidanswer.
Select from B_question . In select column add derived col QuestionID+A_maxidquestion.This will be your new ID.
Select from B_Answer . In select column add derived col AnswerID+A_maxidanswer and fk id as QuestionID+A_maxidquestion.
Note- Make sure Destination table is not beeing used by any other process for inserting values while you are inserting
One of the best approaches to something like this is to use the OUTPUT clause. https://learn.microsoft.com/en-us/sql/t-sql/queries/output-clause-transact-sql?view=sql-server-2017 You can insert the new parent and capture the newly inserted identity value which you can use to insert the children.
You can do this set based if you also include a temp table which would hold the original identity value and the new identity value.
With no details of the tables that is the best I can do.
I have been asked to look into a manual process that one of my colleagues is completing every now and again.
He sometimes needs to add a new column onto a large table (200 million rows), it is taking him more than 1 hour to do this. Before you ask, yes, the columns are nullable but sometimes the new column will have 90% data in it.
Instead of adding a new column to the existing table, he...
Creates a new table
Select (*) from old table (inserts into new)
Adds the new column as part of his script
Then he deletes the old table and renames the new table back to the original, adds index and then compresses. He says it much quicker like that.
If this is the best way then I will try and write SSIS package to try and make the process more seamless
Any advice is welcome!
Thanks
creating a new table structure and moving all the data to that table and delete the prior table is a good way just for a few data,you can do it by wizard in SQL Server. but it is the worst way for solving this problem(millions of data).
for large amount of data (millions of records) you should use "Alter Table".
Alter Table MyTable
ADD NewColumn nvarchar(10) null
the new column will add to the table as the last column.
if you use this script it takes less that one second because all data will not moving,you just add a new column in to the table.
but if you use the wizard method as you mentioned with millions of data records it takes hours.
as Ali says
alter Table MyTable
ADD NewColumn nvarchar(10) null
but then to fill in 90% of data. As he has a table already with it in and the key he's joining on in the copy so this is all he needs:
UPDATE MyTable
SET [NewColumn] = b.[NewColumn]
FROM MyTable a INNER JOIN NewColumnTable b ON a.[KeyField]= b.[KeyField]
would be a lot quicker. You could do it in SSIS but if this happens a lot then not really worth it for a few lines of SQL.
I am planning for an incremental load into warehouse (especially for updates of source tables in RDBMS).
Capturing the updated rows in staging tables from RDBMS based the updates datetime. But how do I determine which column of a particular row needs to be updated in the target warehouse tables?
Or do I just delete a particular row in the warehouse table (based on the primary key of the row in staging table) and insert the new updated row?
Which is the best way to implement the incremental load between the RDBMS and Warehouse using PL/SQL and SQL coding?
In my opinion, the easiest way to accomplish this is as follows:
Create a stage table identical to your host table. When you do your incremental/net-change load, load all changed records into this table (based on whatever your "last updated" field is)
Delete the records from your actual table based on the primary key. For example, if your primary key is customer, part, the query might look like this:
delete from main_table m
where exists (
select null
from stage_table s
where
m.customer = s.customer and
m.part = s.part
);
Insert the records from the stage to the main table.
You could also do an update existing records / insert new records, but either way that's two steps. The advantage of the method I listed is that it will work even if your tables have partitions and the newly updated data violates one of the original partition rules, whereas an update would not accomplish that. Also, the syntax is much simpler as your update would have to list every single field, whereas the delete from / insert into allows you list only the primary key fields.
Oracle also has a merge clause that will update if it exists or insert if it does not. I honestly don't know how that would be impacted if you had partitions.
One major caveat. If your updates include deletes -- records that need to be deleted from the main table, none of these will resolve that and you will need some other way to handle that. It may not be necessary, depending on your circumstances, but it's something to consider.