How to insert rows of data in order using MERGE Statement? - sql

I would like to have the inserted rows in the same order as in the
source select statement - i.e. ORDER BY TMP.DEF_DATA_SK. But they are
inserted somewhat randomly.
With Simple Insert into Select Statement it can be Done But i Want it to be done using MERGE.
SQL is as follows
MERGE
INTO
HCI_STD_STAGING.STAGE.DEF_DATA
TRG
USING
( SELECT TMP.DEF_DATA_SK,
TMP.VAL,
TMP.CD,
TMP.DESCR,
TMP.DEF_TP_SK TYPE_SK ,--
TMP.PRN_SK PARENT, --
PRN.DEF_DATA_SK PRN_SK, PRN.VAL PRN_VAL,
TYP.DEF_TP_SK
,PRN_PRN.VAL DB
,ROW_NUMBER() OVER (ORDER BY TMP.DEF_DATA_SK) AS RowNum
FROM
HCI_STD_STAGING.STAGE._DEF_DATA_TMP TMP
LEFT JOIN HCI_STD_STAGING.STAGE.DEF_TP TYP
ON TMP.DEF_TP_SK = TYP.CD --TYPE
LEFT JOIN HCI_STD_STAGING.STAGE.DEF_DATA PRN
ON TMP.PRN_SK = PRN.VAL -- SCH
INNER JOIN HCI_STD_STAGING.STAGE.DEF_DATA PRN_PRN
ON PRN.PRN_SK = PRN_PRN.DEF_DATA_SK AND TMP.DB = PRN_PRN.VAL --AND
TMP.SCH = PRN.VAL
WHERE TMP.DEF_TP_SK = 'Table Object'
GROUP BY
TMP.DEF_DATA_SK,
TMP.VAL,
TMP.CD,
TMP.DESCR,
TMP.DEF_TP_SK ,
TMP.PRN_SK ,
PRN.DEF_DATA_SK , PRN.VAL ,
TYP.DEF_TP_SK
,PRN_PRN.VAL
--order by TMP.DEF_DATA_SK
) SRC
ON SRC.VAL = TRG.VAL
AND SRC.PRN_SK = TRG.PRN_SK
AND SRC.DEF_TP_SK = TRG.DEF_TP_SK
WHEN NOT MATCHED
THEN
INSERT
(
VAL,CD, DESCR, DEF_TP_SK, PRN_SK
)
VALUES ( SRC.VAL, SRC.CD,SRC.DESCR,SRC.DEF_TP_SK,SRC.PRN_SK );

There is no guarantee for the order of insertion of the rows into the table.
However, if your table has primary key, the records will be ordered by the primary key because upon the creation of primary key, it will also create a clustered index on the table based on that primary key. If the primary key defined as identity, the only guarantee is that the identity values will be generated based on the ORDER BY clause.
But if you don't want to put primary key on the table, you can just create clustered index on the table based on a column. There are some other concerns about clustered index to be considered based on your needs. Please take a look at clustered index as follows
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-2017
so you could get the rows stored as you expected.

Related

Copying the rows within the same table

I need to move what's been appended at the end of my table to its very beginning, however
the same record is being copied into destination.
In other words, between the ids 1 and 3567 I only have the record from the id 3567 repeated until the end. I believe that my outer and even inner sub-query lacks something ?
Thanks for the hint
Query:
UPDATE dbo.TABLE
SET Xwgs = dt.Xwgs, Ywgs = dt.Ywgs
FROM
(
SELECT
Xwgs,
Ywgs
FROM dbo.TABLE
WHERE
Id BETWEEN 3567 AND 7243
) dt
WHERE
Id BETWEEN 1 AND 3566
Is this what you want?
update t
set xwgs = dt.xwgs, ywgs = dt.ywgs
from mytable t
inner join (
select xwgs, ywgs
from mytable
where id between 3567 and 7243
) dt
on t.id = dt.id - 3566
The main difference with your query is that it properly correlates the target table and the derived table.
Note that this does not actually move the rows; all it does is copy the values from the upper bucket to the corresponding value in the lower bucket.
You know that You can always sort Your table with ORDER BY id DESC right?
Sometimes its needed do something strange. I do it like that:
Copy the whole table into a temp table (it may be #temporary table)
Drop or Truncate or Delete records from that table
Insert those records again from my temp table
Drop temp table
But an UPDATE is also a solution.
Tip: You can allow inserting values into identity (autoincreament) id column with SET IDENTITY_INSERT
SELECT *
INTO tmp__MyTable -- this will create a new table
FROM MyTable
ORDER BY id
DELETE FROM dbo.MyTable -- will throw an error on foreign keys conflicts
INSERT INTO MyTable (col,col2) -- column list here
SELECT col,col2
FROM tmp__MyTable
ORDER BY id DESC
-- or something like that:
-- ORDER BY CASE WHEN id <= 3566 THEN -id ELSE id END
-- DROP TABLE tmp__MyTable

How to insert bulk data without changing order of item into table using merge statement

I wrote a stored procedure that can insert bulk data into table using the merge statement.
Problem is that when I insert itemid 1024,1000,1012,1025 in this order, then SQL Server automatically changes order of itemid 1000,1012,1024,1025.
I want to insert data that I actually pass.
Here is sample code. This will parse XML string into table object:
DECLARE #tblPurchase TABLE
(
Purchase_Detail_ID INT ,
Purchase_ID INT ,
Head_ID INT ,
Item_ID INT
);
INSERT INTO #tblPurchase (Purchase_Detail_ID, Purchase_ID, Head_ID, Item_ID)
SELECT
Tbl.Col.value('Purchase_Detail_ID[1]', 'INT') AS Purchase_Detail_ID,
Tbl.Col.value('Purchase_ID[1]', 'INT') AS Purchase_ID,
Tbl.Col.value('Head_ID[1]', 'INT') AS Head_ID,
Tbl.Col.value('Item_ID[1]', 'INT') AS Item_ID
FROM
#PurchaseDetailsXML.nodes('/documentelement/TRN_Purchase_Details') Tbl(Col)
This will insert bulk data into the TRN_Purchase_Details table:
MERGE TRN_Purchase_Details MTD
USING (SELECT
Purchase_Detail_ID,
Id AS Purchase_ID,
Head_ID, Item_ID
FROM
#tblPurchase
LEFT JOIN
#ChangeResult ON 1 = 1) AS TMTD ON MTD.Purchase_Detail_ID = TMTD.Purchase_Detail_ID
AND MTD.Purchase_ID = TMTD.Purchase_ID
WHEN MATCHED THEN
UPDATE SET MTD.Head_ID = TMTD.Head_ID,
MTD.Item_ID = TMTD.Item_ID
WHEN NOT MATCHED BY TARGET THEN
INSERT (Purchase_ID, Head_ID, Item_ID)
VALUES (Purchase_ID, Head_ID, Item_ID)
WHEN NOT MATCHED BY SOURCE AND
MTD.Purchase_ID = (SELECT TOP 1 Id
FROM #ChangeResult
WHERE Id > 0) THEN
DELETE;
Rows in a SQL table don't have any order. They come back in indeterminate order unless you specify an order by.
Try adding an identity column to your temporary table?
DECLARE #tblPurchase TABLE
(
ID int identity,
Purchase_Detail_ID INT ,
The identity column might capture the order of the XML elements.
If that doesn't work, you can calculate the position of the elements in the XML and store that position in the temporary table.
As mentioned elsewhere, data in a table is stored as an unordered set. If you need to be able to go back to your table after data is inserted and determine the order that it was inserted, you'll have to add a column to the table schema to record that information.
It could be something as simple as adding an IDENTITY column, which will increment on each row addition, or perhaps a column with a DATETIME data type and a GETDATE() default value so you not only know the order rows were added, but exactly when that happened.

Performance difference between Select count(ID) and Select count(*)

There are two queries below which return count of ID column excluding NULL values
and second query will return the count of all the rows from the table including NULL rows.
select COUNT(ID) from TableName
select COUNT(*) from TableName
My Confusion :
Is there any performance difference ?
TL/DR: Plans might not be the same, you should test on appropriate
data and make sure you have the correct indexes and then choose the best solution based on your investigations.
The query plans might not be the same depending on the indexing and the nullability of the column which is used in the COUNT function.
In the following example I create a table and fill it with one million rows.
All the columns have been indexed except column 'b'.
The conclusion is that some of these queries do result in the same execution plan but most of them are different.
This was tested on SQL Server 2014, I do not have access to an instance of 2012 at this moment. You should test this yourself to figure out the best solution.
create table t1(id bigint identity,
dt datetime2(7) not null default(sysdatetime()),
a char(800) null,
b char(800) null,
c char(800) null);
-- We will use these 4 indexes. Only column 'b' does not have any supporting index on it.
alter table t1 add constraint [pk_t1] primary key NONCLUSTERED (id);
create clustered index cix_dt on t1(dt);
create nonclustered index ix_a on t1(a);
create nonclustered index ix_c on t1(c);
insert into T1 (a, b, c)
select top 1000000
a = case when low = 1 then null else left(REPLICATE(newid(), low), 800) end,
b = case when low between 1 and 10 then null else left(REPLICATE(newid(), 800-low), 800) end,
c = case when low between 1 and 192 then null else left(REPLICATE(newid(), 800-low), 800) end
from master..spt_values
cross join (select 1 from master..spt_values) m(ock)
where type = 'p';
checkpoint;
-- All rows, no matter if any columns are null or not
-- Uses primary key index
select count(*) from t1;
-- All not null,
-- Uses primary key index
select count(id) from t1;
-- Some values of 'a' are null
-- Uses the index on 'a'
select count(a) from t1;
-- Some values of b are null
-- Uses the clustered index
select count(b) from t1;
-- No values of dt are null and the table have a clustered index on 'dt'
-- Uses primary key index and not the clustered index as one could expect.
select count(dt) from t1;
-- Most values of c are null
-- Uses the index on c
select count(c) from t1;
Now what would happen if we were more explicit in what we wanted our count to do? If we tell the query planner that we want to get only rows which have not null, will that change anything?
-- Homework!
-- What happens if we explicitly count only rows where the column is not null? What if we add a filtered index to support this query?
-- Hint: It will once again be different than the other queries.
create index ix_c2 on t1(c) where c is not null;
select count(*) from t1 where c is not null;

Remove duplicates from table based on multiple criteria and persist to other table

I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic.
So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.
Here are my attempts (some email values are null and blank)
DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0
GROUP BY lower(trim(both ' ' from email)))
Similarly for login_name and password
DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)
Is there any better way or any way to combine these two separate queries?
Also some other table have account_id as foreign key. How to update this change for those tables?`
I am using PostgreSQL 9.2.1
EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.
If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:
SELECT surviving rows into a temporary table.
Reroute FK references to survivors
DELETE all rows from the base table.
Re-INSERT survivors.
1a. Distill surviving rows
CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM (
SELECT DISTINCT ON (email) *
FROM taccounts
ORDER BY email, last_login DESC
) sub
ORDER BY login_name, password, last_login DESC;
About DISTINCT ON:
Select first row in each GROUP BY group?
To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login, so this is "serializable".
Inspect results and test for plausibility.
SELECT * FROM tmp;
Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.
1b. Alternative query for updated definition of "duplicates"
SELECT *
FROM taccounts t
WHERE NOT EXISTS (
SELECT FROM taccounts t1
WHERE ( NULLIF(t1.email, '') = t.email
OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password))
AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
);
This doesn't treat NULL or empty string ('') as identical in any of the "duplicate" columns.
The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. The one with the bigger account_id is chosen in this case - which is unique, since it is the PK.
2a. How to identify all incoming FKs
SELECT c.confrelid::regclass::text AS referenced_table
, c.conname AS fk_name
, pg_get_constraintdef(c.oid) AS fk_definition
FROM pg_attribute a
JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE c.confrelid = 'taccounts'::regclass -- (schema-qualified) table name
AND c.contype = 'f'
ORDER BY 1, contype DESC;
Only building on the first column of the foreign key. More about that:
Find the referenced table name using table, field and schema name
Or inspect the Dependents rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts.
2b. Reroute to new primary
If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t1 USING (email)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t2 USING (login_name, password)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
3. & 4. Go in for the kill
Now, dupes are not referenced any more. Go in for the kill.
ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;
Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.
In addition to Erwin's excellent answer, it can often be useful to create in intermediate link-table that relates the old keys with the new ones.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
( account_id SERIAL PRIMARY KEY
, login_name varchar
, email varchar
, last_login TIMESTAMP
);
-- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
, email = 'Joe' || (account_id %9)::text || '#somedomain.tld'
;
SELECT * FROM taccounts;
--
-- Create (temp) table linking old id <--> new id
-- After inspection this table can be used as a source for the FK updates
-- and for the final delete.
--
CREATE TABLE update_ids AS
WITH pairs AS (
SELECT one.account_id AS old_id
, two.account_id AS new_id
FROM taccounts one
JOIN taccounts two ON two.last_login > one.last_login
AND ( two.email = one.email OR two.login_name = one.login_name)
)
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
SELECT * FROM pairs nx
WHERE nx.old_id = pp.old_id
AND nx.new_id > pp.new_id
)
;
SELECT * FROM update_ids
;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM update_ids ex
WHERE ex.old_id = del.account_id
);
SELECT * FROM taccounts;
Yet another way to accomplish the same is to add a column with a pointer to the preferred key to the table itself and use that for your updates and deletes.
ALTER TABLE taccounts
ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
;
-- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
-- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
SELECT * FROM taccounts ORDER BY account_id;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM taccounts ex
WHERE ex.account_id = del.better_id
);
SELECT * FROM taccounts ORDER BY account_id;

Tricky MS Access SQL query to remove surplus duplicate records

I have an Access table of the form (I'm simplifying it a bit)
ID AutoNumber Primary Key
SchemeName Text (50)
SchemeNumber Text (15)
This contains some data eg...
ID SchemeName SchemeNumber
--------------------------------------------------------------------
714 Malcolm ABC123
80 Malcolm ABC123
96 Malcolms Scheme ABC123
101 Malcolms Scheme ABC123
98 Malcolms Scheme DEF888
654 Another Scheme BAR876
543 Whatever Scheme KJL111
etc...
Now. I want to remove duplicate names under the same SchemeNumber. But I want to leave the record which has the longest SchemeName for that scheme number. If there are duplicate records with the same longest length then I just want to leave only one, say, the lowest ID (but any one will do really). From the above example I would want to delete IDs 714, 80 and 101 (to leave only 96).
I thought this would be relatively easy to achieve but it's turning into a bit of a nightmare! Thanks for any suggestions. I know I could loop it programatically but I'd rather have a single DELETE query.
See if this query returns the rows you want to keep:
SELECT r.SchemeNumber, r.SchemeName, Min(r.ID) AS MinOfID
FROM
(SELECT
SchemeNumber,
SchemeName,
Len(SchemeName) AS name_length,
ID
FROM tblSchemes
) AS r
INNER JOIN
(SELECT
SchemeNumber,
Max(Len(SchemeName)) AS name_length
FROM tblSchemes
GROUP BY SchemeNumber
) AS w
ON
(r.SchemeNumber = w.SchemeNumber)
AND (r.name_length = w.name_length)
GROUP BY r.SchemeNumber, r.SchemeName
ORDER BY r.SchemeName;
If so, save it as qrySchemes2Keep. Then create a DELETE query to discard rows from tblSchemes whose ID value is not found in qrySchemes2Keep.
DELETE
FROM tblSchemes AS s
WHERE Not Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID);
Just beware, if you later use Access' query designer to make changes to that DELETE query, it may "helpfully" convert the SQL to something like this:
DELETE s.*, Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID)
FROM tblSchemes AS s
WHERE (((Exists (SELECT * FROM qrySchemes2Keep WHERE MinOfID = s.ID))=False));
DELETE FROM Table t1
WHERE EXISTS (SELECT 1 from Table t2
WHERE t1.SchemeNumber = t2.SchemeNumber
AND Length(t2.SchemeName) > Length(t1.SchemeName)
)
Depend on your RDBMS you may use function different from Length (Oracle - length, mysql - length, sql server - LEN)
delete ShortScheme
from Scheme ShortScheme
join Scheme LongScheme
on ShortScheme.SchemeNumber = LongScheme.SchemeNumber
and (len(ShortScheme.SchemeName) < len(LongScheme.SchemeName) or (len(ShortScheme.SchemeName) = len(LongScheme.SchemeName) and ShortScheme.ID > LongScheme.ID))
(SQL Server flavored)
Now updated to include the specified tie resolution. Although, you may get better performance doing it in two queries: first deleting the schemes with shorter names as in my original query and then going back and deleting the higher ID where there was a tie in name length.
I'd do this in multiple steps. Large delete operations done in a single step make me too nervous -- what if you make a mistake? There's no sql 'undo' statement.
-- Setup the data
DROP Table foo;
DROP Table bar;
DROP Table bat;
DROP Table baz;
CREATE TABLE foo (
id int(11) NOT NULL,
SchemeName varchar(50),
SchemeNumber varchar(15),
PRIMARY KEY (id)
);
insert into foo values (714, 'Malcolm', 'ABC123' );
insert into foo values (80, 'Malcolm', 'ABC123' );
insert into foo values (96, 'Malcolms Scheme', 'ABC123' );
insert into foo values (101, 'Malcolms Scheme', 'ABC123' );
insert into foo values (98, 'Malcolms Scheme', 'DEF888' );
insert into foo values (654, 'Another Scheme ', 'BAR876' );
insert into foo values (543, 'Whatever Scheme ', 'KJL111' );
-- Find all the records that have dups, find the longest one
create table bar as
select max(length(SchemeName)) as max_length, SchemeNumber
from foo
group by SchemeNumber
having count(*) > 1;
-- Find the one we want to keep
create table bat as
select min(a.id) as id, a.SchemeNumber
from foo a join bar b on a.SchemeNumber = b.SchemeNumber
and length(a.SchemeName) = b.max_length
group by SchemeNumber;
-- Select into this table all the rows to delete
create table baz as
select a.id from foo a join bat b where a.SchemeNumber = b.SchemeNumber
and a.id != b.id;
This will give you a new table with only records for rows that you want to remove.
Now check these out and make sure that they contain only the rows you want deleted. This way you can make sure that when you do the delete, you know exactly what to expect. It should also be pretty fast.
Then when you're ready, use this command to delete the rows using this command.
delete from foo where id in (select id from baz);
This seems like more work because of the different tables, but it's safer probably just as fast as the other ways. Plus you can stop at any step and make sure the data is what you want before you do any actual deletes.
If your platform supports ranking functions and common table expressions:
with cte as (
select row_number()
over (partition by SchemeNumber order by len(SchemeName) desc) as rn
from Table)
delete from cte where rn > 1;
try this:
Select * From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or this:,...
Select * From Table t
Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))
if either of these selects the records that should be deleted, just change it to a delete
Delete
From Table t
Where Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber )
And Id >
(Select Min (Id)
From Table
Where SchemeNumber = t.SchemeNumber
And SchemeName = t.SchemeName)
or using the second construction:
Delete From Table t Where Id >
(Select Min(Id) From Table
Where SchemeNumber = t.SchemeNumber
And Len(SchemeName) <
(Select Max(Len(Schemename))
From Table
Where SchemeNumber = t.SchemeNumber))