SQL Delete Rows Based on Multiple Criteria - sql

I am trying to delete rows from a data set based on multiple criteria, but I am receiving a syntax error. Here is the current code:
With cte As (
Select *,
Row_Number() Over(Partition By ID, Numb1 Order by ID) as RowNumb
from DataSet
)
Delete from cte Where RowNumb > 1;
Where DataSet looks like this:
I want to delete all records in which the ID and the Numb1 are the same. So I would expect the code to delete all rows except:

I am not very experienced with Vertica but it seems like it is not very flexible about delete statements.
One way to do it would be to use a temporary table to store the rows that you want to keep, then truncate the original the table, and insert back into it from the temp table:
create temporary table MyTempTable as
select id, numb1, state_coding
from (select t.*, count(*) over(partition by id, numb1) cnt from DataSet) as t
where cnt = 1;
truncate table DataSet;
insert into DataSet
select id, numb1, state_coding from MyTempTable;
Note that I used a window count instead of row_number. This will remove records for which at least another record exists with the same id and numb1, which is what I understand that you want from your sample data and expected results.
Important: make sure to backup your entire table before you do this!

WITH Clauses in Vertica only support SELECT or INSERT, not DELETE/UPDATE.
Vertica Documentation

The cte is a temporary table. You cannot delete from it. It is effectively read-only.
If you are trying to delete duplicates out of the original DataSet table, you have to delete from the DataSet, not from the cte table.
Try this:
with cte as
(
select
ID,
Row_Number() Over(Partition By ID, Numb1 Order by ID) as RowNumb
from
DataSet
)
delete from DataSet where ID in (select ID from cte where RowNumb > 1)

Can't delete from CTEs. Just manually use delete syntax but rollback transactions or if you have permissions you can always replicate it and test.

You'd have saved me ~5 min had you pasted the data as text and not as picture - as I could not copy-paste and had to retype ...
Having said that:
Rebuild the table here:
DROP TABLE IF EXISTS input;
CREATE TABLE input(id,numb1,state_coding) AS (
SELECT 202003,4718868,'D'
UNION ALL SELECT 202003, 35756,'AA'
UNION ALL SELECT 204281, 146199,'D'
UNION ALL SELECT 204281, 146199,'D'
UNION ALL SELECT 204346, 108094,'D'
UNION ALL SELECT 204346, 108094,'D'
UNION ALL SELECT 204389, 14642,'DD'
UNION ALL SELECT 204389, 96504,'F'
UNION ALL SELECT 204392, 22010,'D'
UNION ALL SELECT 204392, 8051,'G'
UNION ALL SELECT 204400, 74118,'D'
UNION ALL SELECT 204400, 103900,'D'
UNION ALL SELECT 204406,1387304,'D'
UNION ALL SELECT 204406, 0,'HJ'
UNION ALL SELECT 204516, 894,'D'
UNION ALL SELECT 204516, 3927,'D'
UNION ALL SELECT 204586, 234235,'D'
UNION ALL SELECT 204586, 234235,'D'
)
;
And then:
Based on what was said in other responses, and keeping in mind that a mass delete of an important part of the table, not only in Vertica, is best implemented as an INSERT ... SELECT with inverted WHERE condition - here goes:
CREATE TABLE input_help AS
SELECT * FROM input
GROUP BY id,numb1,state_coding
HAVING COUNT(*) = 1;
DROP TABLE input;
ALTER TABLE input_help RENAME TO input;
At least, it works with that simplicity if the whole row is the same - I notice you don't put state_coding into the condition yourself. Otherwise, it gets slightly more complicated.
Or did you want to re-insert one row of the duplicates each afterwards?
Then, just build input_help as SELECT DISTINCT * FROM input; , then drop, then rename.

Related

Inserting Multiple of same records into SQL temp table based on value in column

So I have table with the following records:
I want to create a script to iteratively look at the Cnt_Repeat column and insert that same record in a temp table X times depending on the value in Cnt_Repeat so it would look like the following table:
One method supported by most databases is the use of recursive CTEs. The exact syntax might vary, but the idea is:
with cte as (
select loannum, document, cnt_repeat, 1 as lev
from t
union all
select loannum, document, cnt_repeat, lev + 1
from cte
where lev < cnt_repeat
)
select loannum, document, cnt_repeat
from cte;

Remove duplicates from table in bigquery

I found duplicates in my table by doing below query.
SELECT name, id, count(1) as count
FROM [myproject:dev.sample]
group by name, id
having count(1) > 1
Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message.
Am i missing something?
DELETE FROM PRD.GPBP WHERE
id not in(select id from [myproject:dev.sample] GROUP BY id) and
name not in (select name from [myproject:dev.sample] GROUP BY name)
I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.
You can find duplicates like below:
Create table new_table as
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *,
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample]
)a
WHERE rnk = 1;
Then drop the older table and rename new_table with old table name.
Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done
Or if you want you can use DML's INSERT to new table and then copy over original one
Meantime, the easiest way is as below (using DDL)
#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id
)

Create a UNION query that identifies which table the unique data came from

I have two tables with data. Both tables have a CUSTOMER_ID column (which is numeric). I am trying to get a list of all the unique values for CUSTOMER_ID and know whether or not the CUSTOMER_ID exists in both tables or just one (and which one).
I can easily get a list of the unique CUSTOMER_ID:
SELECT tblOne.CUSTOMER_ID
FROM tblOne.CUSTOMER_ID
UNION
SELECT tblTwo.CUSTOMER_ID
FROM tblTwo.CUSTOMER_ID
I can't do just add an identifier column to the SELECT statemtn (like: SELECT tblOne.CUSTOMER_ID, "Table1" AS DataSource) because then the records wouldn't be unique and it will get both sets of data.
I feel I need to add it somewhere else in this query but am not sure how.
Edit for clarity:
For the union query output I need an additional column that can tell me if the unique value I am seeing exists in: (1) both tables, (2) table one, or (3) table two.
If the CUSTOMER_ID appears in both tables then we'll have to arbitrarily pick which table to call the source. The following query uses "tblOne" as the [SourceTable] in that case:
SELECT
CUSTOMER_ID,
MIN(Source) AS SourceTable,
COUNT(*) AS TableCount
FROM
(
SELECT DISTINCT
CUSTOMER_ID,
"tblOne" AS Source
FROM tblOne
UNION ALL
SELECT DISTINCT
CUSTOMER_ID,
"tblTwo" AS Source
FROM tblTwo
)
GROUP BY CUSTOMER_ID
Gord Thompson's answer is correct. But, it is not necessary to do a distinct in the subqueries. And, you can return a single column with the information you are looking for:
select customer_id,
iif(min(which) = max(which), min(which), "both") as DataSource
from (select customer_id, "tblone" as which
from tblOne
UNION ALL
select customer_id, "tbltwo" as which
from tblTwo
) t
group by customer_id
We could add an identifier column with the integer data type and then do an outer query:
SELECT
CUSTOMER_ID,
sum(Table)
FROM
(
SELECT
DISTINCT CUSTOMER_ID,
1 AS Table
FROM tblOne
UNION
SELECT
DISTINCT CUSTOMER_ID,
2 AS Table
FROM tblTwo
)
GROUP BY CUSTOMER_ID`
So if the "sum is 1" then it comes from tablOne and if it is 2 then it comes from tableTwo an if it is 3 then it exists in both
If you want to add a 3rd table in the union then give it a value of 4 so that you should have a unique sum for each combination

SQL Server - Inserting a specific set of rows from one table to another table

I have a table called table_one. (7 Mil) rows
I want to insert 0 - 1 Mil on a new table (table_two) and then insert 1Mil one - 2mil to the same table.
SET ROWCOUNT 1000000
How can this be achieved? Is there a way to specify range of rows to be inserted?
You can use row_number:
;with cte as (
select
*,
row_number() over(order by some_field ) as rn
from table_one
)
insert into table_two ( fields )
select fields from cte
where rn < 1000000
If you can get the start and end IDs in your old table, you can do something like this:
INSERT INTO NewTable (...)
SELECT ... FROM OldTable
WHERE OldTableID BETWEEN #StartID AND #EndID
If you don't already have a useful ID, use danihp's solution using ROW_NUMBER().
If you don't have a range of ids, you can generate them using row_number():
with toinsert (
select *, row_number() over (partition by NULL order by <whatever>) as rownum
from OldTable
)
insert into NewTable(...)
select ... from toinsert
If you are interested in getting exact number of rows, you might employ TOP:
insert into Table2
select top 1000000 *
from Table1
order by ... ID? or newid() if you want random rows.
You might be better off exporting the entire table in bulk import format, splitting it as a text file, then bulk importing the seven or so pieces into the several tables.
Of course there may be keys in the original table that make it possible to do with SQL INSERT operations, but this requires information not provided in the Question posted.

Fastest way to identify differences between two tables?

I have a need to check a live table against a transactional archive table and I'm unsure of the fastest way to do this...
For instance, let's say my live table is made up of these columns:
Term
CRN
Fee
Level Code
My archive table would have the same columns, but also have an archive date so I can see what values the live table had at a given date.
Now... How would I write a query to ensure that the values for the live table are the same as the most recent entries in the archive table?
PS I'd prefer to handle this in SQL, but PL/SQL is also an option if it's faster.
SELECT term, crn, fee, level_code
FROM live_data
MINUS
SELECT term, crn, fee, level_code
FROM historical_data
Whats on live but not in historical. Can then union to a reverse of this to get whats in historical but not live.
Simply:
SELECT collist
FROM TABLE A
minus
SELECT collist
FROM TABLE B
UNION ALL
SELECT collist
FROM TABLE B
minus
SELECT collist
FROM TABLE A;
You didn't mention how rows are uniquely identified, so I've assumed you also have an "id" column:
SELECT *
FROM livetable
WHERE (term, crn, fee, levelcode) NOT IN (
SELECT FIRST_VALUE(term) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(crn) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(fee) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(levelcode) OVER (ORDER BY archivedate DESC)
FROM archivetable
WHERE livetable.id = archivetable.id
);
Note: This query doesn't take NULLS into account - if any of the columns are nullable you can add suitable logic (e.g. NVL each column to some "impossible" value).
unload to table.unl
select * from table1
order by 1,2,3,4
unload to table2.unl
select * from table2
order by 1,2,3,4
diff table1.unl table2.unl > diff.unl
Could you use a query of the form:
SELECT your columns FROM your live table
EXCEPT
SELECT your columns FROM your archive table WHERE archive date is most recent;
Any results will be rows in your live table that are not in your most recent archive.
If you also need rows in your most recent archive that are not in your live table, simply reverse the order of the selects, and repeat, or get them all in the same query by performing a (live UNION archive) EXCEPT (live INTERSECTION archive)