VERTICA insert multiple rows in one statement with named columns - sql

I want to insert multiple rows efficiently into VERTICA. In PostgreSQL (and probably other SQL implementations) it is possible to INSERT multiple rows in one statement, which is a lot faster, than doing single inserts (especially when in Autocommit mode).
A minimal self-contained example to load two rows in a newly created table could look like this (a):
CREATE TABLE my_schema.my_table (
row_count int,
some_float float,
some_string varchar(8));
INSERT INTO my_schema.my_table (row_count, some_float, some_string)
VALUES (1,1.0,'foo'),(2,2.0,'bar');
But the beauty of this is, that the order in which the values are bunched can be changed to be something like (b):
INSERT INTO my_schema.my_table (some_float, some_string, row_count)
VALUES (1.0,'foo',1),(2.0,'bar',2);
Furthermore, this syntax allows for leaving out columns which are then filled by default values (such as auto incrementing integers etc.).
However, VERTICA does not seem to have the possibility to do a multi-row insert with the same fine-tuning. In fact, the only way to emulate a similar behaviour seems to be to UNION several selects together for something like (c):
INSERT INTO my_schema.my_table SELECT 1,1.0,'foo' UNION SELECT 2,2.0,'bar';
as in this answer: Vertica SQL insert multiple rows in one statement .
However, this seems to be working only, when the order of the inserted columns matches the order of their initial definition. My question is, it is possible to craft a single insert like (c) but with the possibility of changing column order as in (b)? Or am I tackling the problem completely wrong? If so, what alternative is there to a multi-row insert? Should I try COPY LOCAL?

Just list the columns in the insert:
INSERT INTO my_schema.my_table (row_count, some_float, some_string)
SELECT 1,1.0,'foo'
UNION ALL
SELECT 2,2.0,'bar';
Note the use of UNION ALL instead of UNION. UNION incurs overhead for removing duplicates, which is not needed.

Related

What is the most efficient way to implement a Redshift merge/upsert operation

I am in the process of writing a custom upsert function for a specific use case for a redshift table. On their docs, AWS suggests two methods which i'm drawing inspiration from. Here is what i want to accomplish:
Insert any new rows to an existing table, but only if they don't already exist.
There is never a need to delete or modify an existing row (for my use case)
I have so far come up with two separate ways to do this, but I'm wondering what the tradeoffs of each could be
using an EXCEPT query for insertion of only new rows from a temp table:
insert into persisted_table (
select *
from temp_table
except
select *
from persisted_table
);
store results of aUNION ALL query on temp table with persisted table, and use that as the persisted table
insert into new_table (
select *
from temp_table
union
select *
from persisted_table
);
alter table persisted_table rename to old_perisisted_table_marked_for_deletion;
alter table new_table rename to persisted_table;
I'm aware that union all is slow and generally not recommended for bulk/large scale operations. Apart from that though are there any arguments that could influence this decision?
The first advise I'd give is to remember that Redshift is a cluster. Whatever process you select, if the data is large, you will want the comparison to determine if the row already exists to stay "on node". You will want the tables in questions to be distributed by the same key.
Next I would think about what the indexes are into the data. The processes you laid out will compare all columns when comparing. Is this needed? If a subset of columns can be the key this can make things more efficient.
insert into persisted_table (
select * from temp_table a left join persisted_table b on {keys}
where a.keys is null );
Hopefully these aspects will help your decision process

Bulk Insert with Table Valued Parameter with duplicate rows

Need to insert multiple records into a SQL table. If there are duplicates (already inserted records) then I want to ignore them.
For sending multiple records from my code to SQL, I am using table valued parameter.
I was looking at two options.
Option 1: Make a get call to SQL table and check if there are duplicates and return the duplicate row key. Perform multiple insert with table valued parameter only for those not existing row keys into SQL table.
Option 2: Use table valued parameter and call bulk insert. In the SQL do the duplicate detection and ignore the duplicate rows.
The SQL that was implemented is as follows:
#tvpNewFMdata is the table valued parameter.
INSERT INTO
[dbo].[FMData]
(
[Id],
[Name],
[Path],
[CreatedDate],
[ModifiedDate]
)
SELECT
fm.Id, fm.Name, fm.Path, GETUTCDATE(), GETUTCDATE()
FROM
#tvpNewFMdata AS fm
WHERE
fm.Id NOT IN
(
SELECT
[Id]
FROM
[dbo].[FMdata]
)
In the SQL approach, I do a select first to check whether the row exist and only if does not exist, then I do an insert.
Want to get a better perspective on which approach is performance wise optimized. Also wanted to understand whether the above query is optimized.
Your code looks fine, although I might make some suggestions.
First, use default values for CreatedDate and ModifiedDate. That way, you don't need to set the values every time a row is inserted.
Second, I'm not a fan of NOT IN, preferring NOT EXISTS instead. I prefer NOT EXISTS because it works more intuitively when the subquery returns NULL values. However, I am guessing that Id is a primary key in FMData, so it could never be NULL.
Third, Id should have an index . . . which it would have as a primary key.
Fourth, the code is not thread safe, meaning that running the same code twice at the same time could generate an error. I'm guessing this is not a problem for this code, but if so, you can investigate table locking hints.
Except for the presence of an index on Id, none of these comments address performance. Your code should be fine from a performance perspective.

Oracle Insert Select with order by

I am working on a plsql procedure where i am using an insert-select statement.
I need to insert into the table in ordered manner. but the order by i used in the select sql is not working.
is there any specific way in oracle to insert rows in orderly fashion?
The use of an ORDER BY within an INSERT SELECT is not pointless as long as it can change the content of the inserted data, i.e. with a sequence NEXTVAL included in the SELECT clause. And this even if the inserted rows won't be sorted when fetched - that's the role of your ORDER BY clause in your SELECT clause when accessing the rows.
For such a goal, you can use a work-around placing your ORDER BY clause in a sub-query, and it works:
INSERT INTO myTargetTable
(
SELECT mySequence.nextval, sq.* FROM
( SELECT f1, f2, f3, ...fx
FROM mySourceTable
WHERE myCondition
ORDER BY mySortClause
) sq
)
The typical use case for an ordered insert is in order to co-locate particular value in the same blocks (effectively reducing the clustering factor on indexes on columns by which you have ordered the data).
This generally requires a direct path insert ...
insert /*+ append */ into ...
select ...
from ...
order by ...
There's nothing invalid about this as long as you accept that it's only worthwhile for bulk data, that the data will load above the high water mark only, and that there are locking issues involved.
Another approach which achieves mostly the same effect, but which is more arguably more suitable for OLTP systems, is to create the table in a cluster.
The standard Oracle table is a heap-organized table. A heap-organized table is a table with rows stored in no particular order.
Sorting has nothing to do while inserting rows. and is completely pointless. You need an ORDER BY only while projecting/selecting the rows.
That is how the Oracle RDBMS is designed.
I'm pretty sure that Oracle does not guarantee to insert rows to a table in any specific order (even if the rows were inserted in that order).
Performance and storage considerations far outweigh ordering considerations (as every user might have a different preference for order)
Why not just use an "ORDER BY" clause in your SELECT statement?
Or better yet, create a VIEW that already has the ORDER BY clause in it?
CREATE VIEW your_table_ordered
SELECT *
FROM your_table
ORDER BY your_column

Insert into combined with select where

Let's say we have a query like this (my actual query is similar to this but pretty long)
insert into t1(id1,c1,c2)
select id1,c1,c2 from t2
where not exists(select * from t1 where t1.id1=t2.id1-1)
Does this query select first and insert all, or insert each selected item one by one?
it matters because I'm trying insert a record depending on the previous inserted records and it doesn't seem to work.
First the select query is ran. So it will select all the rows that match your filter. After that the insert is performed. There is not row by row insertion when you use one operation.
Still if you want to do something recursive that will check after each insert you can use CTEs (Common Table Expressions). http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
This runs a select statement one time and then inserts based on that. It is much more efficient that way.
Since you already know what you will be inserting, you should be able to handle this in your select query rather than looking at what you have already inserted.

Subtracting minimum value from all values in a column

Is there a another way to subtract the smallest value from all the values of a column, effectively offset the values?
The only way I have found becomes horribly complicated for more complex queries.
CREATE TABLE offsettest(value NUMBER);
INSERT INTO offsettest VALUES(100);
INSERT INTO offsettest VALUES(200);
INSERT INTO offsettest VALUES(300);
INSERT INTO offsettest VALUES(400);
SELECT value - (SELECT MIN(value) FROM offsettest) FROM offsettest;
DROP TABLE offsettest;
I'd like to limit it to a single query (no stored procedures, variables, etc) if possible and standard SQL is preferred (although I am using Oracle).
I believe this works as of ANSI 1999.
SELECT value - MIN(value) OVER() FROM offsettest;
It would have helped you see your actual query, though, since depending on whether you need to manipulate more than one column this way, and the various minimums come from different rows, there may be more efficient ways to do it. If the OVER() works for you, then fine.