Column Copy and Update vs. Column Create and Insert - sql

I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.
For example, if the initial table is:
id initial_color
-- -------------
1 blue
2 red
3 yellow
I am modifying the table so that the result is:
id initial_color modified_color
-- ------------- --------------
1 blue blue_green
2 red red_orange
3 yellow yellow_brown
I have code that will read the initial_color column and update the value.
Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:
Copy the column and update the rows in the new column
Create an empty column and insert new values
I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.

The columns types are either character varying or character.
Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.
Any downsides of using data type "text" for storing strings?
Given that my table has 32 million rows and that I have to apply this
procedure on five of the 31 columns, what is the most efficient way to do this?
If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):
BEGIN;
LOCK TABLE tbl_org IN SHARE MODE; -- to prevent concurrent writes
CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);
ALTER tbl_new ADD COLUMN modified_color text
, ADD COLUMN modified_something text;
-- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
, myfunction(initial_color) AS modified_color -- etc
FROM tbl_org;
-- ORDER BY tbl_id; -- optionally order rows while being at it.
-- Add constraints and indexes like in the original table here
DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;
If you have depending objects, you need to do more.
Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.
Related cases with more details, links and explanation:
Updating database rows without locking the table in PostgreSQL 9.2
Best way to populate a new column in a large table?
Optimizing bulk update performance in PostgreSQL
While creating a new table you might also order columns in an optimized fashion:
Calculating and saving space in PostgreSQL

Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:
CREATE TABLE
This would create a new table and filling could be done using
CREATE TABLE .. AS SELECT.. for filling with creation or
using a separate INSERT...SELECT... later on
Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.
So, I' not seeing any way to realize a procedure that follows your choice 1.
Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.
Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.

create new column (modified colour), it will have a value of NULL or blank on all records,
run an update statement, assuming your table name is 'Table'.
update table
set modified_color = 'blue_green'
where initial_color = 'blue'
if I am correct this can also work like this
update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';
once you have done this you can do another update (assuming you have another column that I will call modified_color1)
update table set 'modified_color1'= 'modified_color'

Related

Querying a SQL table and only transferring updated rows to a different database

I have a database table which constantly gets updated. I am looking to query only the changes/additions that have been made on rows with a specific attribute in a column. e.g. get the rows which have been changed/added, the 'description' column of which is "xyz". My end goal is to copy these rows to another table in another database. Is this even possible? The reason for not just querying and overwriting the rows in the other database is to avoid inefficiency.
What I have tried so far?
I am able to select query on the table to get the rows but it gives me all the rows, not the ones that have been changed or recently added. If i add these rows to the table in the other database, the only option I have is to overwrite the rows.
Log table logs the changes in a table but I can't put additional filters in SQL which tells me which of these changes are associated with 'description' column as 'xyz'.
Write your update statements to make use of OUTPUT to capture the before and after values and log them to a table of your choice.
Here is a really simple example update example that uses output to store the RowID, before and after values for the ActivityType column:
DECLARE #MyTableVar table (
SummaryBefore nvarchar(max),
SummaryAfter nvarchar(max),
RowID int
);
update DBA.dbo.dtest set ActivityType = 3
OUTPUT deleted.ActivityType,
inserted.ActivityType,
inserted.RowID
INTO #MyTableVar
select * From #MyTableVar
You can do it two ways
Have new date fields/columns like update_time and/or create_time(Can be defaulted if needed). These fields will indicate the status of the record. You need to save your previous_run_time and then your select query will look for records with update_time/create_time greater than previous_run_time, and then you can move these records to the new DB.
Have CDC turned on the source table, which is available by default in SQL server and then move only those records that have been impacted.

How to update numerical column of one table based on matching string column from another table in SQL

I want to update numerical columns of one table based on matching string columns from another table.i.e.,
I have a table (let's say table1) with 100 records containing 5 string (or text) columns and 10 numerical columns. Now I have another table that has the same structure (columns) and 20 records. In this, few records contain updated data of table1 i.e., numerical columns values are updated for these records and rest are new (both text and numerical columns).
I want to update numerical columns for records with the same text columns (in table1) and insert new data from table2 into table1 where text columns are also new.
I thought of taking an intersect of these two tables and then update but couldn't figure out the logic as how can I update the numerical columns.
Note: I don't have any primary or unique key columns.
Please help here.
Thanks in advance.
The simplest solution would be to use two separate queries, such as:
UPDATE b
SET b.[NumericColumn] = a.[NumericColumn],
etc...
FROM [dbo].[SourceTable] a
JOIN [dbo].[DestinationTable] b
ON a.[StringColumn1] = b.[StringColumn1]
AND a.[StringColumn2] = b.[StringColumn2] etc...
INSERT INTO [dbo].[DestinationTable] (
[NumericColumn],
[StringColumn1],
[StringColumn2],
etc...
)
SELECT a.[NumericColumn],
a.[StringColumn1],
a.[StringColumn2],
etc...
FROM [dbo].[SourceTable] a
LEFT JOIN [dbo].[DestinationTable] b
ON a.[StringColumn1] = b.[StringColumn1]
AND a.[StringColumn2] = b.[StringColumn2] etc...
WHERE b.[NumericColumn] IS NULL
--assumes that [NumericColumn] is non-nullable.
--If there are no non-nullable columns then you
--will have to structure your query differently
This will be effective if you are working with a small dataset that does not change very frequently and you are not worried about high contention.
There are still a number of issues with this approach - most notably what happens if either the source or destination table is accessed and/or modified while the update statement is running. Some of these issues can be worked around other ways but so much depends on the context of how the tables are used that it is difficult to provide a more effective generically-applicable solution.

Update column in Amazon Redshift with join for big tables

I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one.
Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table.
I added two extra columns into the first table and want to update the data based on join.
Query is quite easy:
update big_one set
col1=extra_one.col1,
col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;
But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.
Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).
Recap:
I have a wide table big_one.
Have another table extra_one (with 3 columns total), with same IDs and number of records.
I added two new columns to big_one.
I want to enrich big_one with data from extra_one. (into that 2 new columns)
Q1: Any advice on how to perform such big updates?
Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.
Do not use UPDATE on a large number of rows.
When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!
Instead:
Create a query that JOINs the two tables
Use the query to populate a new table (see below)
Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)
You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.
From CREATE TABLE - Amazon Redshift:
LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.
Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.

Swapping columns in a table to match formatting of another table prior to row insertion

I want to swap columns within Visual Fox Pro 9 in table_1 before inserting its rows into table_2 so as to avoid data losses caused by datatype variations. I tried these two options based on other solutions on stackoverflow, but I get syntax error messages for both command inputs. The name field is of datatype = character(5)and it needs to be after the subdir field.
ALTER table "f:\csp" modify COLUMN name character(5) after subdir
ALTER table "f:\csp" change COLUMN name name character(5) after subdir
I attempted these commands based on solutions here:
How to move columns in a MySQL table?
You never need to change the column order, and you never should rely on column order to do something.
For inserting into another table from this one you could simply select the columns in the order you desired (and their column names do not even need to be the same in the case of "insert ... select ... "). ie:
insert into table_2 (subdir, name) ;
select subdir, name from table_1
Another way is to use the xBase commands like:
select table_2
append from table_1
In the case of latter, VFP would do the match on column names.
All in all, relying on column ordering is dangerous. If you really want to do that, then you can still do, in a number of ways. One of them is to select all data into a temp table, recreate the table in the order you want and fill back from temp (might not be as easy as it sounds if there are existing dependencies such as referential integrity - also you need to recreate the indexes).

Updating rows in order with SQL

I have a table with 4 columns. The first column is unique for each row, but it's a string (URL format).
I want to update my table, but instead of using "WHERE", I want to update the rows in order.
The first query will update the first row, the second query updates the second row and so on.
What's the SQL code for that? I'm using Sqlite.
Edit: My table schema
CREATE table (
url varchar(150),
views int(5),
clicks int(5)
)
Edit2: What I'm doing right now is a loop of SQL queries
update table set views = 5, click = 10 where url = "http://someurl.com";
There is around 4 million records in the database. It's taking around 16 seconds in my server to make the update. Since the loop update the row in order, so the first query update the first row; I'm thinking if updating the rows in order could be faster than using the WHERE clause which needs to browse 4 million rows.
You can't do what you want without using WHERE as this is the only way to select rows from a table for reading, updating or deleting. So you will want to use:
UPDATE table SET url = ... WHERE url = '<whatever>'
HOWEVER... SqlLite has an extra feature - the autogenerated column, ROWID. You can use this column in queries. You don't see this data by default, so if you want the data within it you need to explicitly request it, e.g:
SELECT ROWID, * FROM table
What this means is that you may be able to do what you want referencing this column directly:
UPDATE table SET url = ... WHERE ROWID = 1
you still need to use the WHERE clause, but this allows you to access the rows in insert order without doing anything else.
CAVEAT
ROWID effectively stores the INSERT order of the rows. If you delete rows from the table, the ROWIDs for remaining rows will NOT change - hence it is possible to have gaps in the ROWID sequence. This is by design and there is no workaround short of re-creating the table and re-populating the data.
PORTABILITY
Note that this only applies to SQLite - you may not be able to do the same thing with other SQL engines should you ever need to port this. It would be MUCH better to add an EXPLICIT auto-number column (aka an IDENTITY field) that you can use and manage.