I have table with millions of record and I added two new columns
alter table hr.employees add (ind char(1Byte), remove Char(1 Byte)); commit;
I have another view hr.department which has more data than this and it has these two columns .
So if I write and update for these records it takes so long.
update hr.employees a
set (ind, remove) =(select ind, remove
from hr.department b
where a.dept_id = b.dept_id ) ;
It's been an hour it still goes with the update. Can some one help in this ?
If your table has millions of rows, using an UPDATE will very likely take too much time.
I would rename the old table, create a new table with the new columns already filled, then add indexes, constraints, comments and gather statistics.
RENAME employees to employees_old;
CREATE TABLE employees AS
SELECT a.col1, a.col2, ... a.coln, b.ind, b.remove
FROM employees_old a
LEFT JOIN department b
ON a.dept_id = b.dept_id;
The simplest way to handle this update will likely be the following:
Find a time when no one else will be critically using the system.
Make a backup.
If they don't already exist, copy DDL for creating all indexes/constraints on that table.
Drop all indexes/constraints on that table.
Update your two columns.
Recreate indexes/constraints (this may take some time, but usually orders of magnitude less than updating every row in a big table with multiple indexes).
As usual, Burleson Consulting is a good resource: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
Related
So I have a main table in Hive, it will store all my data.
I want to be able to load a incremental data update about every month
with a large amount of data couple billion rows. There will be new data
as well as updated entries.
What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.
What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.
The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.
Anyone got any ideas as to how to do this most efficiently and effectively ??
I'm pretty new to Hive and databases in general.
If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number.
To find all entries that will be updated you can join increment data with old data:
insert overwrite target_data [partition() if applicable]
SELECT
--select new if exists, old if not exists
case when i.PK is not null then i.PK else t.PK end as PK,
case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
...
case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
FROM
target_data t --restrict partitions if applicable
FULL JOIN increment_data i on (t.PK=i.PK);
It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.
Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://stackoverflow.com/a/44755825/2700344
Here is my solution/work around if you are using old hive version. This works better when you have large data in target table which we can't drop and recreate with full data every time.
create one more table say delete_keys table. This will hold all the key from main table which are deleted along with its surrogate key.
While loading incremental data into main table, do a left join with main table. For all the matching records, we ideally should update the main table. But instead, we take keys (along with surrogate key) from main table for all matching records and insert that to delete_keys table. Now we can insert all delta records into main table as it irrespective of whether they are to be updated or inserted.
Create view on main table using delete-keys table so that matching keys with delete-keys table are not fetched. So, this view will be final target table. This view will not show records from main table which are updated with latest records.
I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.
I have a process that runs every 60 minutes. On one table I need to remove all data then insert records from a different table. The problem is it takes a long time to delete and reinsert the data. When the table has no data I am afraid the users will see this. Is there a way to refresh the data without users seeing this?
If you want to remove all data from the table then use the TRUNCATE
TABLE instead of delete - It'll do it faster.
As for the insert it is a bit hard to say because you did not give any details but what you can try is:
Option 1 - Using temp table
create table table_temp as select * from original_table where rownum < 1;
//insert into table_temp
drop table original_table;
Exec sp_rename 'table_temp' , 'original_table'
Option 2 - Use 2 tables "Active-Passive" -
Have 2 tables for the data and a view to select over them. The view will join with a third table that will specify from which of the tables to select. kind of an "active-passive" concept.
To demonstrate concept:
with active_table as ( select 'table1_active' active_table )
select 1 data
where 'table1_active' in (select * from active_table)
union all
select 2
where 'table2_active' in (select * from active_table)
//This returns only one record with the "1"
Are you truncating instead of deleting? A truncate (while logged) is much, much, faster then a delete.
If you cannot truncate try deleting 1000-10000 rows at a time (smaller log buildup and on deleting large amounts of rows great increase in speed.)
If you really want fast performance you can create a second table, fill it with data, and then drop the first table and rename the second table as the first table. You will lose all the permissions on the table when you do this so be sure to reapply the permissions to the renamed table.
If you are deleting all rows in a table, you can consider using a TRUNCATE statement against the table instead of a DELETE. It will speed up part of your process. Keep in mind that this will reset any identity seeds you may have on the table.
As suggested, you can wrap this process in a transaction and depending on how you set your transaction isolation level, you can control what your users will see if they query the data during the transaction.
Make it sequence based, your copied in records all have have a series number (all the same for all copied in records) and another file holds which sequence is active, and you always select on a join to this table - when you copy in new records they have a new sequence that is not yet active, when they are all copied in, then the sequence table is updated to the new sequence - the redundant sequence records are deleted at your leisure.
Example
Let's suppose your table has field SeriesNo added and table ActiveSeries has field SeriesNo.
All queries of your table:
SELECT *
FROM YourTable Y
JOIN ActiveSeries A
ON A.SeriesNo = Y.SeriesNo
then updating SeriesNo in ActiveSeries makes new series of records available instantly.
I would follow below approach. While I troubleshoot why the delete and reinsert is taking time.
Create a new table ( t1 ) which has same data as oldtable ( maintable )
Now do your stuff on t1.
When your stuff is done, rename t1 to maintable.
We have two large tables (Clients and Contacts) which undergo an ETL process every night, being inserted into a single "People" table in the data warehouse. This table is used in many places and cannot be significantly altered without a lot of work.
The source tables are populated by third party software; we used to assume that we could identify the rows that had been updated since last night by using the "UpdateDate" column in each, but more recently identified some rows that were not touched by the ETL, as the "UpdateDate" column was not behaving as we had thought; the software company do not see this as a bug, so we have to live with this fact.
As a result, we now take all source rows, transformed into a temp staging table and then Merge that into the data warehouse, using the Merge to identify any changed values. We have noticed that this process is taking too long on some days and would like to limit the number of rows that the ETL process looks at, as we believe that the reason for the hold-up is the principally the sheer volume of data that is examined and stored on the temp database. We can see no way to look purely at the source data and identify when each row last changed.
Here is a simplified pseudocode of the ETL stored procedure, although what the procedure actually does is not really relevant to the question (included just in case you disagree with me!)
CREATE #TempTable (ClientOrContact BIT NOT NULL, Id INT NOT NULL, [Some_Other_Columns])
INSERT #TempTable
SELECT 1 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ClientsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
INSERT #TempTable
SELECT 0 AS ClientOrContact, C.Id, [SomeColumns] FROM
(SELECT [SomeColumns]
FROM Source_ContactsTable C
JOIN FieldsTable F JOIN [SomeOtherTables])
PIVOT (MAX(F.FieldValue) FOR F.FieldName IN ([SomeFieldNames]));
ALTER #TempTable ADD PRIMARY KEY (ClientOrContact, Id);
MERGE Target_PeopleTable AS Tgt
USING (SELECT [SomeColumns] FROM #TempTable JOIN [SomeOtherTables]) AS Src
ON Tgt.ClientOrContact = Src.ClientOrContact AND Tgt.Id = Src.Id
WHEN MATCHED AND NOT EXISTS (SELECT Tgt.* INTERSECT SELECT Src.*)
THEN UPDATE SET ([All_NonKeyTargetColumns] = [All_NonKeySourceColumns])
WHEN NOT MATCHED BY Target THEN INSERT [All_TargetColumns] VALUES [All_SourceColumns]
OUTPUT $Action INTO #Changes;
RETURN COUNT(*) FROM #Changes;
GO
The source tables have about 1.5M rows each, but each day only a relatively small number of rows are inserted or updated (never deleted). There are about 50 columns in each table, of those, about 40 columns can have changed values each night. Most columns are VARCHAR and each table contains an independent incremental primary key column. We can add indexes to the source tables, but not alter them in any other way (They have already been indexed by a predecessor) The source tables and target table are on the same server, but different databases. Edit: The Target Table has a composite primary key on the ClientOrContact and Id columns, matching that shown on the temp table in the script above.
So, my question is this - please could you suggest any general possible strategies that might be useful to limit the number of rows we look at or copy across each night? If we only touched the rows that we needed to each night, we would be touching less than 1% of the data we do at the moment...
Before you try the following suggestion, just one thing to check is that the Target_PeopleTable has an index or primary key on the id column. It probably does but without schema information to verify I am making no assumptions and this might speed up the merge stage.
As you've identified if you could somehow limit the records in TempTable to just the changed rows then this could offer a performance win for the actual MERGE statement (depending on how expensive determining just the changed rows is).
As a general strategy I would consider some kind of checksum to try and identify the changed records only. The T-SQL Checksum function could be used to calculate a check sum across the required columns by specifying the columns as a comma separated list to that function or there are actual column types available for this such as Binary_Checksum.
Since you cannot change the source schema you would have to maintain a list of record ids and associated checksums in your target database so that you can readily compare the source checksums to the target checksums from the last run in order to identify a difference.
You can then only insert into the Temp table where there is a checksum difference between the target and source or the id does not exist in the target db.
This might just be moving the performance problem to the temp insert part but I think it's worth a try.
Have you considered triggers? I avoid them like the plague, but they really are the solution to some problems.
Put an INSERT/UPDATE [/DELETE?] trigger on your two source tables. Program it such that when rows are added or updated, the trigger will log the IDs of these rows in a (you'll have to create this) audit table, where that table would contain the ID, the type of change (update or insert – and delete, if you have to worry about those) and when the change was made. When you run ETL, join this list of “to be merged” items with the source tables. When you’re done, delete the table and it’s reset for the next run. (Use the “added on” datetime column to make sure you don’t delete rows that may have been added while you were running ETL.)
There’s lots of details behind proper use and implementation, but overall this idea should do what you need.
I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.