Cannot recognize input near 'DELETE' 'FROM' 'CTE' in statement - sql

I want to drop duplicates in mytable if there are identical value in col1.
WITH CTE AS
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) AS RN
FROM
mytable
)
DELETE FROM CTE
WHERE RN <> 1
I got error:
Cannot recognize input near 'DELETE' 'FROM' 'CTE' in statement

I don't think Hive supports that syntax for DELETE. Try this:
DELETE FROM mytable t
WHERE t.id > (SELECT MIN(t2.id) -- some sort of unique id
FROM t t2
WHERE t2.id = t.id
);
If you have complete duplicates, then the above won't work. In the most recent versions of Hive you can use MERGE. In older versions:
create table temp_t as
select distinct t.*
from t;
truncate table t;
insert into t
select * from temp_t;
Of course, backup the table before trying this!

Alternative way: assuming you have UNIQUE ID Column.
Delete from MyTable where ID in
(SELECT ID FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) AS RN
FROM mytable) a where RN <> 1)

Related

Oracle: UPDATE with ORDER BY [duplicate]

I want to populate a table column with a running integer number, so I'm thinking of using ROWNUM. However, I need to populate it based on the order of other columns, something like ORDER BY column1, column2. That is, unfortunately, not possible since Oracle does not accept the following statement:
UPDATE table_a SET sequence_column = rownum ORDER BY column1, column2;
Nor the following statement (an attempt to use WITH clause):
WITH tmp AS (SELECT * FROM table_a ORDER BY column1, column2)
UPDATE tmp SET sequence_column = rownum;
So how do I do it using an SQL statement and without resorting to cursor iteration method in PL/SQL?
This should work (works for me)
update table_a outer
set sequence_column = (
select rnum from (
-- evaluate row_number() for all rows ordered by your columns
-- BEFORE updating those values into table_a
select id, row_number() over (order by column1, column2) rnum
from table_a) inner
-- join on the primary key to be sure you'll only get one value
-- for rnum
where inner.id = outer.id);
OR you use the MERGE statement. Something like this.
merge into table_a u
using (
select id, row_number() over (order by column1, column2) rnum
from table_a
) s
on (u.id = s.id)
when matched then update set u.sequence_column = s.rnum
UPDATE table_a
SET sequence_column = (select rn
from (
select rowid,
row_number() over (order by col1, col2)
from table_a
) x
where x.rowid = table_a.rowid)
But that won't be very fast and as Damien pointed out, you have to re-run this statement each time you change data in that table.
First Create a sequence :
CREATE SEQUENCE SEQ_SLNO
START WITH 1
MAXVALUE 999999999999999999999999999
MINVALUE 1
NOCYCLE
NOCACHE
NOORDER;
after that Update the table using the sequence:
UPDATE table_name
SET colun_name = SEQ_SLNO.NEXTVAL;
A small correction just add AS RN :
UPDATE table_a
SET sequence_column = (select rn
from (
select rowid,
row_number() over (order by col1, col2) AS RN
from table_a
) x
where x.rowid = table_a.rowid)

PostgreSQL how to delete duplicated values

I have a table in my Postgres database where I forgot to insert a unique index. because of that index that i have now duplicated values. How to remove the duplicated values? I want to add a unique index on the fields translationset_Id and key.
I think you are asking for this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
It appears that you only want to delete records which are duplicate with regard to the translationset_id column. In this case, we can use Postgres' row number functionality to discern between duplicate rows, and then to delete those duplicates.
WITH cte AS
(
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY translationset_id, key) AS rnum
FROM yourTable t
)
DELETE FROM yourTable
WHERE translationset_id IN (SELECT translationset_id FROM cte WHERE rnum > 1)
I think the most efficient way to do this is below.
DELETE FROM
table_name a
USING table_name b
WHERE
a.id < b.id and
a.same_column = b.same_column;
delete from mytable
where exists (select 1
from mytable t2
where t2.name = mytable.name and
t2.address = mytable.address and
t2.zip = mytable.zip and
t2.ctid > mytable.ctid
);

How to get duplicate text values from SQL query

I have to get table only with duplicate text values using SQL query. I have used Having count(columnname) > 1 but I'm not getting result, only with duplicate values instead getting all values.
Can anyone suggest whether I have to add anything to my query?
Thanks.
Use the below query. mention the column which is getting duplicated in the patition by clause..
with CTE_1
AS
(SELECT *,COUNT(1) OVER(PARTITION BY LTRIM(RTRIM(REPLACE(yourDuplicateColumn,' ',''))) Order by -anycolunm- ) cnt
FROM YourTable
)
SELECT *
FROM CTE_1
WHERE cnt>1
Assuming id is a primary key
select *
from myTable t1
where exists (select 1
from myTable t2
where t2.text = t1.text and t2.id != t1.id)
You can use similar to following query:
SELECT
column1, COUNT(*)
FROM table
GROUP BY column1
HAVING COUNT(*) > 1

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

Oracle: Updating a table column using ROWNUM in conjunction with ORDER BY clause

I want to populate a table column with a running integer number, so I'm thinking of using ROWNUM. However, I need to populate it based on the order of other columns, something like ORDER BY column1, column2. That is, unfortunately, not possible since Oracle does not accept the following statement:
UPDATE table_a SET sequence_column = rownum ORDER BY column1, column2;
Nor the following statement (an attempt to use WITH clause):
WITH tmp AS (SELECT * FROM table_a ORDER BY column1, column2)
UPDATE tmp SET sequence_column = rownum;
So how do I do it using an SQL statement and without resorting to cursor iteration method in PL/SQL?
This should work (works for me)
update table_a outer
set sequence_column = (
select rnum from (
-- evaluate row_number() for all rows ordered by your columns
-- BEFORE updating those values into table_a
select id, row_number() over (order by column1, column2) rnum
from table_a) inner
-- join on the primary key to be sure you'll only get one value
-- for rnum
where inner.id = outer.id);
OR you use the MERGE statement. Something like this.
merge into table_a u
using (
select id, row_number() over (order by column1, column2) rnum
from table_a
) s
on (u.id = s.id)
when matched then update set u.sequence_column = s.rnum
UPDATE table_a
SET sequence_column = (select rn
from (
select rowid,
row_number() over (order by col1, col2)
from table_a
) x
where x.rowid = table_a.rowid)
But that won't be very fast and as Damien pointed out, you have to re-run this statement each time you change data in that table.
First Create a sequence :
CREATE SEQUENCE SEQ_SLNO
START WITH 1
MAXVALUE 999999999999999999999999999
MINVALUE 1
NOCYCLE
NOCACHE
NOORDER;
after that Update the table using the sequence:
UPDATE table_name
SET colun_name = SEQ_SLNO.NEXTVAL;
A small correction just add AS RN :
UPDATE table_a
SET sequence_column = (select rn
from (
select rowid,
row_number() over (order by col1, col2) AS RN
from table_a
) x
where x.rowid = table_a.rowid)