Bulk updating existing rows in Redshift - sql

This seems like it should be easy, but isn't. I'm migrating a query from MySQL to Redshift of the form:
INSERT INTO table
(...)
VALUES
(...)
ON DUPLICATE KEY UPDATE
value = MIN(value, VALUES(value))
For primary keys we're inserting that aren't already in the table, those are just inserted. For primary keys that are already in the table, we update the row's values based on a condition that depends on the existing and new values in the row.
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html does not work, because filter_expression in my case depends on the current entries in the table. I'm currently creating a staging table, inserting into it with a COPY statement and am trying to figure out the best way to merge the staging and real tables.

I'm having to do exactly this for a project right now. The method I'm using involves 3 steps:
1.
Run an update that addresses changed fields (I'm updating whether or not the fields have changed, but you can certainly qualify that):
update table1 set col1=s.col1, col2=s.col2,...
from table1 t
join stagetable s on s.primkey=t.primkey;
2.
Run an insert that addresses new records:
insert into table1
select s.*
from stagetable s
left outer join table1 t on s.primkey=t.primkey
where t.primkey is null;
3.
Mark rows no longer in the source as inactive (our reporting tool uses views that filter inactive records):
update table1
set is_active_flag='N', last_updated=sysdate
from table1 t
left outer join stagetable s on s.primkey=t.primkey
where s.primkey is null;

Is posible to create a temp table. In redshift is better to delete and insert the record.
Check this doc
http://docs.aws.amazon.com/redshift/latest/dg/merge-replacing-existing-rows.html

Here is the fully working approach for Redshift.
Assumptions:
A.Data available in S3 in gunzip format with '|' separated columns, may have some garbage data see maxerror.
B.Sales fact with two dimension tables to keep it simple (TIME and SKU(SKU may have many groups and categories))).
C.You have Sales table like this.
CREATE TABLE sales (
sku_id int encode zstd,
date_id int encode zstd,
quantity numeric(10,2) encode delta32k,
);
1)Create Staging table, that should resemble with your Online Table used by app/apps.
CREATE TABLE stg_sales_onetime (
sku_number varchar(255) encode zstd,
time varchar(255) encode zstd,
qty_str varchar(20) encode zstd,
quantity numeric(10,2) encode delta32k,
sku_id int encode zstd,
date_id int encode zstd
);
2)Copy data from S3( this could done using SSH).
copy stg_sales_onetime (sku_number,time,qty_str) from
's3://<buecket_name>/<full_file_path>' CREDENTIALS 'aws_access_key_id=<your_key>;aws_secret_access_key=<your_secret>' delimiter '|' ignoreheader 1 maxerror as 1000 gzip;
3)This step is optional, in case you don't have good formatted data, this a your transformation step if needed(as converting String(12.555654) quantity to Number(12.56))
update stg_sales_onetime set quantity=convert(decimal(10,2),qty_str);
4)Populating the correct IDs from dimension table.
update stg_sales_onetime set sku_id=<your_sku_demesion_table>.sku_id from <your_sku_demesion_table> where stg_sales_onetime.sku_number=<your_sku_demesion_table>.sku_number;
update stg_sales_onetime set time_id=<your_time_demesion_table>.time_id from <your_time_demesion_table> where stg_sales_onetime.time=<your_time_demesion_table>.time;
5)Finally you have data good to go from Staging to Online Sales table.
insert into sales(sku_id,time_id,quantity) select sku_id,time_id,quantity from stg_sales_onetime;

Related

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

Add timestamp to existing table

I have a SQL Server table with just 3 columns, one of which is of type varbinary. The data in this column is actually a Json document which among other properties contains information about when the data was last modified. Unfortunately the SQL table itself does not contain information about when its rows were modified.
Now when doing sorting and filtering of the data I of course don't want fetch all rows in order to find e.g. the latest 100 entries.
So my question is: does SQL Server somehow remember when a row was added/modified? I have tried adding a timestamp and this is applied to all existing rows but this is applied randomly I think, because the sorting doesn't work. I don't need a datetime or anything, I just want to be able sort the records based on when they were last modified.
Thanks
For those looking to insert a tamestamp column of type DateTime into an existing DB table, you can do this like so:
ALTER TABLE TestTable
ADD DateInserted DATETIME NOT NULL DEFAULT (GETDATE());
The existing records will automatically get the value equal to the date/time of the moment when column is added.
New records will get up-to-date value upon insertion.
SQL Server will not track historically when a row was inserted or modified so you need to rely on the JSON data to figure that out yourself. You are going to need a new column to make this efficient to query. Once you have your new column you have some options:
Loop through all your records populating the new column with the relevant value from the JSON data.
If your version of SQL Server is recent enough, you can query the JSON data directly. Populate this column using a query like this:
UPDATE MyTable
SET MyNewColumn = JSON_VALUE(JsonDataColumn, '$.Customer.DateCreated')
The downside of this method is that you need to maintain this
Make SQL Server compute the value from the JSON automatically, for example:
ALTER TABLE MyTable
ADD MyNewColumn AS JSON_VALUE(JsonDataColumn, '$.Customer.DateCreated')
And, create an index to make it efficient:
CREATE INDEX IX_MyTable_MyNewColumn
ON MyTable(MyNewColumn)
Use a new column CreatedDate and store datetime every time you make an Insert.
You could use GetDate() for inserting date in the column.
A UpdatedDate column can be used for updates.
in order to find e.g. the latest 100 entries.
Timestamp is indeed what you need.
It's ever-increasing value, it's updated automatically, so you are always able to find all last modified/inserted rows.
Here is an example:
create table dbo.test1 (id int);
insert into dbo.test1 values(1), (2), (3);
alter table dbo.test1 add ts timestamp;
update dbo.test1
set id = 10
where id = 2
select top 1 *
from dbo.test1
order by ts desc;
--id ts
--10 0x000000001FCFABD2
insert into dbo.test1 (id)
values (100);
select top 1 *
from dbo.test1
order by ts desc;
--id ts
--100 0x000000001FCFABD3
As you see, you always get the last modified/inserted row.
For your purpose just use
select top 100 *
...
order by ts desc;
Thanks. Apparently I didn't look hard enough before I posted this question. The question has been asked a couple of times before and the answer is: Nope! There is no easy solution to this.
SQL Server does not keep track of when a record was created or modified, which was somehow what I was looking for. So I will go for the next best solution, which is probably to create a datetime column, retrieve the modified date from the Json document and then update the record. Or rather, the 1,4 million records:-(

Oracle: selfcopying data from Oracle tables

Application has different versions. Each version has it's own set of values in each table. I need to provide functionality to copy data from one version to another. Problem :
By inserting data I am trying to insert Ids which has already been in use in this table. So, I need to change ids of components which I want to insert but I must save relationship between those components. How cat I do that?
Create a master table which has a surrogate key as your primary key. A numeric value of type NUMBER(9) works well. You can create a sequence and trigger to automatically insert this.
The rest of the table is the column of your current table plus a column to indicate which version the row is for.
For simplicity you may wish to create views on top of the table along the lines of
select * from master_table where version_id = ####;
To copy the data from one version to another this will work:
Insert into master_table seq_master_table.nextval, new version_id,.....
from master_table
where version_id = ####;

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?

Change each record in a table with no primary key?

I have a table in a database that represents dates textually (i.e. "2008-11-09") and I would like to replace them with the UNIX timestamp. However, I don't think that MySQL is capable of doing the conversion on its own, so I'd like to write a little script to do the conversion. The way I can think to do it involves getting all the records in the table, iterating through them, and updating the database records. However, with no primary key, I can't easily get the exact record I need to update.
Is there a way to get MySQL to assign temporary IDs to records during a SELECT so that I refer back to them when doing UPDATEs?
Does this not do it?
UPDATE
MyTable
SET
MyTimeStamp = UNIX_TIMESTAMP(MyDateTime);
If for some reason you do have to iterate (the other answers cover the situation where you don't), I can think of two ways to do it (these aren't MySQL-specific):
Add a column to the table that's an auto-assigned number. Use that as the PK for your updates, then drop the column afterwards (or just keep it around for future use).
In a table with no defined PK, as long as there are no exact duplicate rows, you can use the entire row as a composite PK; just use every column in the row as your distinguishing characteristic. i.e., if the table has 3 columns, "name", "address", and "updated", do the following:
UPDATE mytable SET updated = [timestamp value] WHERE name = [name] AND address = [address] AND timestamp = [old timestamp]
Many data access frameworks use this exact strategy to implement optimistic concurrency.
No, you should be able to do this with a single update statement. If all of the dates are yyyy-mm-dd and they are just stored in some sort of text column instead of DATETIME, you can just move the data over. SQL would be like:
ALTER TABLE t ADD COLUMN dates DATETIME;
UPDATE t set t.dates=t.olddate;
This shouldn't be dependent on a PK because MySQL can scan through each row in the table. The only time PK's become an issue is if you need to update a single row, but the row may not be unique.
You can generate values during a SELECT using the MySQL user variables feature, but these values do not refer to the row; they're temporary parts of the result set only. You can't use them in UPDATE statements.
SET #v := 0;
SELECT #v:=#v+1, * FROM mytable;
Here's how I'd solve the problem. You're going to have to create another column for your UNIX timestamps anyway, so you can add it first. Then convert the values in the old datetime column to the UNIX timestamp and place it in the new column. Then drop the old textual datetime column.
ALTER TABLE mytable ADD COLUMN unix_timestamp INT UNSIGNED NOT NULL DEFAULT 0;
UPDATE mytable
SET unix_timestamp = UNIX_TIMESTAMP( STR_TO_DATE( text_timestamp, '%Y-%m-%d' ) );
ALTER TABLE mytable DROP COLUMN text_timestamp;
Of course you should confirm that the conversion has been done correctly before you drop the old column!
See UNIX_TIMESTAMP() and STR_TO_DATE()