Adding new columns to a column database having billions of rows

Adding new columns to a column database having billions of rows - sql

I want to add a new column to a table which already consists billions of rows. The new columns are derived from existing columns.
For example,
new_col1 = old_col1 + old_col2
new_col2 = old_col1 / old_col2
I am trying to do this in following way -
Add new columns
ALTER TABLE table_name
ADD ( column_1 column-definition,
column_2 column-definition,
...
column_n column_definition )
Read rows one by one from the table and fill the values for new columns.
There is no primary key in the database. So I can not refer to an individual row. To read rows one by one, I have to do a select * which would give a huge resultset (considering billions of records).
Is there any better way to do this?

Different DBMS have different SQL dialects, it is useful to specify which you are using in the question.
In SQL Server you could use a Computed Column but this would calculate the result every time you select the data, you could flag it as persisted but it may take a while to make the change. But you can't do that if you are going to remove the old columns.
Alternatively create the new column allowing nulls and then update it in batches
UPDATE TOP (1000) table_name SET new_col1 = old_col1 + col_col2 WHERE new_col1 IS NULL
Again the query is for SQL Server, but there will alternatives for your DBMS.
Also read Mr Hoopers comment about adding an index to the new column to make sure that the performance of the UPDATE doesn't get worse as more data is added. The update is a read and write operation, the index will speed up the reads and slightly delay the writes (maintaining the index), but it should be worthwhile.

I think Mr Diver's method would be fine if you also added an index on one of your new columns; otherwise, as the job progresses, it will have to do more and more scanning to find the rows it hasn't already updated. Adding an index will mean it doesn't have to do that. A possible drawback is that the index differentiation will be frightful when the column is created, but I don't think that would be a problem as you only care about NULL or NOT NULL. You could drop the index when the update is complete.

use stored procedures, do an update by 100 of them, add the stored procedure as a job to run every say 30 seconds.

Related

updating a very large oracle table

I have a very large table people with 60M rows indexed on id, wish to populate a field newid for every record based on a look up table id_conversion (1M rows) which contains id and newid, indexed on id.
when I run
update people p set p.newid=(select l.newid from id_conversion l where l.id=p.id)
it runs for an hour or so and then I get an archive error ora 00257.
Any suggestions for either running update in sections or better sql command?

To avoid writing to Oracle's undo log if your update statement hits every single row of the table then you are likely better off running a create table as select query which will bypass all undo logs, which is likely the issue you're running into as it is logging the impact across 60 million rows. You can then drop the old table and rename the new table to that of the old table's name.
Something like:
create table new_people as
select l.newid,
p.col2,
p.col3,
p.col4,
p.col5
from people p
join id_conversion l
on p.id = l.id;
drop table people;
-- rebuild any constraints and indexes
-- from old people table to new people table
alter table new_people rename to people;
For reference, read some of the tips here: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
If you are basically creating a new table and not just updating some of the rows of a table it will likely prove the faster method.

I doubt you will be able to get this to run in seconds. Your query, as written, needs to update all 60 million rows.
My first advice is to add an index on id_conversion(id, newid), to make the subquery more efficient. If that doesn't help, then doing the update in batches might be the best way to go.
I should add. Because you are updating all the rows, it might be faster to take the following approach:
Copy the data into a new table with the new values.
Truncate the original table.
Insert the new data into the old table.
Inserts are faster than updates.

In addition to the answers above, which probably will work better in this case, you should know the MERGE statement
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
that is used for updating one table according to another table and is far faster then update according to a select statement

Copying timestamp columns within a Postgres table

I have a table with about 30 million rows in a Postgres 9.4 db. This table has 6 columns, the primary key id, 2 text, one boolean, and two timestamp. There are indices on one of the text columns, and obviously the primary key.
I want to copy the values in the first timestamp column, call it timestamp_a into the second timestamp column, call it timestamp_b. To do this, I ran the following query:
UPDATE my_table SET timestamp_b = timestamp_a;
This worked, but it took an hour and 15 minutes to complete, which seems a really long time to me considering, as far as I know, it's just copying values from one column to the next.
I ran EXPLAIN on the query and nothing seemed particularly inefficient. I then used pgtune to modify my config file, most notably it increased the shared_buffers, work_mem, and maintenance_work_mem.
I re-ran the query and it took essentially the same amount of time, actually slightly longer (an hour and 20 mins).
What else can I do to improve the speed of this update? Is this just how long it takes to write 30 million timestamps into postgres? For context I'm running this on a macbook pro, osx, quadcore, 16 gigs of ram.

The reason this is slow is that internally PostgreSQL doesn't update the field. It actually writes new rows with the new values. This usually takes a similar time to inserting that many values.
If there are indexes on any column this can further slow the update down. Even if they're not on columns being updated, because PostgreSQL has to write a new row and write new index entries to point to that row. HOT updates can help and will do so automatically if available, but that generally only helps if the table is subject to lots of small updates. It's also disabled if any of the fields being updated are indexed.
Since you're basically rewriting the table, if you don't mind locking out all concurrent users while you do it you can do it faster with:
BEGIN
DROP all indexes
UPDATE the table
CREATE all indexes again
COMMIT
PostgreSQL also has an optimisation for writes to tables that've just been TRUNCATEd, but to benefit from that you'd have to copy the data to a temp table, then TRUNCATE and copy it back. So there's no benefit.

#Craig mentioned an optimization for COPY after TRUNCATE: Postgres can skip WAL entries because if the transaction fails, nobody will ever have seen the new table anyway.
The same optimization is true for tables created with CREATE TABLE AS:
What causes large INSERT to slow down and disk usage to explode?
Details are missing in your description, but if you can afford to write a new table (no concurrent transactions get in the way, no dependencies), then the fastest way might be (except if you have big TOAST table entries - basically big columns):
BEGIN;
LOCK TABLE my_table IN SHARE MODE; -- only for concurrent access
SET LOCAL work_mem = '???? MB'; -- just for this transaction
CREATE my_table2
SELECT ..., timestamp_a, timestamp_a AS timestamp_b
-- columns in order, timestamp_a overwrites timestamp_b
FROM my_table
ORDER BY ??; -- optionally cluster table while being at it.
DROP TABLE my_table;
ALTER TABLE my_table2 RENAME TO my_table;
ALTER TABLE my_table
, ADD CONSTRAINT my_table_id_pk PRIMARY KEY (id);
-- more constraints, indices, triggers?
-- recreate views etc. if any
COMMIT;
The additional benefit: a pristine (optionally clustered) table without bloat. Related:
Best way to populate a new column in a large table?

Overwrite ONE column in select statement with a fixed value without listing all columns

I have a number of tables with a large number of columns (> 100) in a SQL Server database. In some cases when selecting (using views) I need to replace exactly ONE of the columns with a fixed result value instead of the data from the row(s).
Is there a way to use something like
select table.*, 'value' as Column1 from table
if Column1 is a column name within the table?
Of course I can list all the columns which are expected as result in the select Statement, replacing the one with a value.
However, this is very inconvinient and having 3 or 4 those views I have to maintain them all if columns are added or removed from the table.

Nope, you have to specify columns in this case.
And you have much more serious problems if tables are being changed often. This may be a signal of large architectural defects.
Anyway, listing all columns instead of * is a good practice, because if columns number will change, it may cause cascade errors.

As other responses have noted, this can't be done in a single statement. There is a workaround, however, which is not perfect but does circumvent the need to list columns manually: save your initial, unmodified query to a temp table, update the column(s) you need to overwrite, then select the results:
--we're going to use a temp table; make sure it doesn't already exist
if (object_id('tempdb..#tmpTbl') is not null)
drop table #tmpTbl
--initial query to retrieve all the columns
select *
into #tmpTbl
from TblWithManycolumns
--update column(s) from another table or query
update #tmpTbl
set ColToBeReplaced = trv.ColWithReplacementValue
from #tmpTbl t
join TableWithReplacementValue trv
on trv.KeyCol = t.KeyCol
--where trv.FilterCol = #FilterVal -- if needed
--this select contains the final output data
select * from #tmpTbl
drop table #tmpTbl
This has plenty of drawbacks. Complexity, performance, etc. But it is very flexible and solves the major problem of preventing changes to the main table (TblWithManyColumns) from breaking the query or requiring manual changes. This is particularly important if you're trying to generate SQL.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.

(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;

Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.

The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

How to speed up SQL query with group by statement + max function?

I have a table with millions of rows, and I need to do LOTS of queries which look something like:
select max(date_field)
where varchar_field1 = 'something'
group by varchar_field2;
My questions are:
Is there a way to create an index to help with this query?
What (other) options do I have to enhance performance of this query?

An index on (varchar_field1, varchar_field2, date_field) would be of most use. The database can use the first index field for the where clause, the second for the group by, and the third to calculate the maximum date. It can complete the entire query just using that index, without looking up rows in the table.

Obviously, an index on varchar_field1 will help a lot.

You can create yourself an extra table with the columns
varchar_field1 (unique index)
max_date_field
You can set up triggers on inserts, updates, and deletes on the table you're searching that will maintain this little table -- whenever a row is added or changed, set a row in this table.
We've had good success with performance improvement using this refactoring technique. In our case it was made simpler because we never delete rows from the table until they're so old that nobody ever looks up the max field. This is an especially helpful technique if you can add max_date_field to some other table rather than create a new one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Adding new columns to a column database having billions of rows - sql

use stored procedures, do an update by 100 of them, add the stored procedure as a job to run every say 30 seconds.

Related

updating a very large oracle table

Copying timestamp columns within a Postgres table

Overwrite ONE column in select statement with a fixed value without listing all columns

SQL - renumbering a sequential column to be sequential again after deletion

How to speed up SQL query with group by statement + max function?

Categories

Resources