DELETE rows with specific common column

DELETE rows with specific common column - sql

Using PostgreSQL, the table sessions has columns id (PK), user_id and expire. I would like to delete rows with id = 'deleteme' but also expired sessions from the same person, namely whose user_id match the deleted row and expire < now().
The query that I found to be working is
WITH temp AS (
DELETE FROM sessions
WHERE id = 'deleteme'
RETURNING user_id)
DELETE FROM sessions
WHERE user_id IN (
SELECT user_id from temp)
AND expire < now()
What did not work was
WITH temp AS (
DELETE FROM sessions
WHERE id = 'deleteme'
RETURNING user_id)
DELETE FROM sessions
WHERE user_id = temp.user_id
AND expire < now()
which has error "missing FROM-clause entry for table 'temp'"
Are there simpler queries that achieve the same effect as my first query?
EDIT: if there are ways to do this with joins please let me know as well for I am quite new to SQL and eager to learn. I just don't know if that will delete from the original table in addition to the joined table.

Here the error message is not exactly clear.
For delete queries you need to have a USING clause for your CTE.
So:
WITH temp AS (
DELETE FROM sessions
WHERE id = 'deleteme'
RETURNING user_id)
DELETE FROM sessions
USING temp
WHERE user_id = temp.user_id
AND expire < now()

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

I have a table like this
CREATE TABLE userinteractions
(
userid bigint,
dobyr int,
-- lots more fields that are not relevant to the question
);
My problem is that some of the data is polluted with multiple dobyr values for the same user.
The table is used as the basis for further processing by creating a new table. These cases need to be removed from the pipeline.
I want to be able to create a clean table that contains unique userid and dobyr limited to the cases where there is only one value of dobyr for the userid in userinteractions.
For example I start with data like this:
userid,dobyr
1,1995
1,1995
2,1999
3,1990 # dobyr values not equal
3,1999 # dobyr values not equal
4,1989
4,1989
And I want to select from this to get a table like this:
userid,dobyr
1,1995
2,1999
4,1989
Is there an elegant, efficient way to get this in a single sql query?
I am using postgres.
EDIT: I do not have permissions to modify the userinteractions table, so I need a SELECT solution, not a DELETE solution.

Clarified requirements: your aim is to generate a new, cleaned-up version of an existing table, and the clean-up means:
If there are many rows with the same userid value but also the same dobyr value, one of them is kept (doesn't matter which one), rest gets discarded.
All rows for a given userid are discarded if it occurs with different dobyr values.
create table userinteractions_clean as
select distinct on (userid,dobyr) *
from userinteractions
where userid in (
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1 )
order by userid,dobyr;
This could also be done with an not in, not exists or exists conditions. Also, select which combination to keep by adding columns at the end of order by.
Updated demo with tests and more rows.
If you don't need the other columns in the table, only something you'll later use as a filter/whitelist, plain userid's from records with (userid,dobyr) pairs matching your criteria are enough, as they already uniquely identify those records:
create table userinteractions_whitelist as
select userid
from userinteractions
group by userid
having count(distinct dobyr)=1

Just use a HAVING clause to assert that all rows in a group must have the same dobyr.
SELECT
userid,
MAX(dobyr) AS dobyr
FROM
userinteractions
GROUP BY
userid
HAVING
COUNT(DISTINCT dobyr) = 1

Massive Delete statement - How to improve query execution time?

I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST that contains information about all the batchs that were already executed.
BATCH_LIST has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.
APP_USERS has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.
What has been done so far :
Get all the BATCH_ID that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).
Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.

If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

How can I schedule a query in Google BigQuery to append new data to table?

I'm trying the new query scheduling feature in Google BigQuery, but I can't seem to get it to append new records to my table correctly.
I set Custom schedule to every 15 minutes and the Destination table write preference to Append to table.
SELECT DATETIME_TRUNC(DATETIME(logtime, 'America/Los_Angeles'), MINUTE) log_minute,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT product_id) unique_products
FROM mytable
WHERE DATE(logtime, 'America/Los_Angeles') >= "2019-05-01"
GROUP BY log_minute
ORDER BY log_minute
I expected to see 1 row per log_minute, but I'm seeing duplicates: 1 row per log_minute for each scheduled run so that after an hour, there are 5 duplicates of each row (1 at the start + 1 for every 15 minutes).

I expected to see 1 row per log_minute, but I'm seeing duplicates: 1 row per log_minute for each scheduled run
Do you want to append new rows? Of course you'll see a new row every time the query runs - because you are appending rows.
If you want to UPDATE the existing ones instead and add new ones, schedule a MERGE.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_examples

Thanks for the tip, Felipe! For anyone who's trying to do the same thing, I edited the query to the following:
MERGE nextvr.sched_test_15min H
USING
(
SELECT TIMESTAMP(DATETIME_TRUNC(DATETIME(logtime, 'America/Los_Angeles'), MINUTE)) log_minute,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT product_id) products
FROM mytable
WHERE DATE(logtime, 'America/Los_Angeles') >= "2019-05-01"
GROUP BY log_minute
) N
ON H.log_minute = N.log_minute
WHEN MATCHED THEN
UPDATE
SET users = N.users, products = N.products
WHEN NOT MATCHED THEN
INSERT (log_minute, users, products)
VALUES (log_minute, users, products)
When creating the scheduled query, under Destination for query results section, leave the Table name field blank.

How to delete duplicate rows with SQL?

I have a table with some rows in. Every row has a date-field. Right now, it may be duplicates of a date. I need to delete all the duplicates and only store the row with the highest id. How is this possible using a SQL query?
Now:
date id
'07/07' 1
'07/07' 2
'07/07' 3
'07/05' 4
'07/05' 5
What I want:
date id
'07/07' 3
'07/05' 5

DELETE FROM table WHERE id NOT IN
(SELECT MAX(id) FROM table GROUP BY date);

I don't have comment rights, so here's my comment as an answer in case anyone comes across the same problem:
In SQLite3, there is an implicit numerical primary key called "rowid", so the same query would look like this:
DELETE FROM table WHERE rowid NOT IN
(SELECT MAX(rowid) FROM table GROUP BY date);
this will work with any table even if it does not contain a primary key column called "id".

For mysql,postgresql,oracle better way is SELF JOIN.
Postgresql:
DELETE FROM table t1 USING table t2 WHERE t1.date=t2.date AND t1.id<t2.id;
MySQL
DELETE FROM table
USING table, table as vtable
WHERE (table.id < vtable.id)
AND (table.date=vtable.date)
SQL aggregate (max,group by) functions almost always are very slow.

How to do an update based on a count - SQL (postgres)

I have a table, let's call it 'entries' that looks like this (simplified):
id [pk]
user_id [fk]
created [date]
processed [boolean, default false]
and I want to create an UPDATE query which will set the processed flag to true on all entries except for the latest 3 for each user (latest in terms of the created column). So, for the following entries:
1,456,2009-06-01,false
2,456,2009-05-01,false
3,456,2009-04-01,false
4,456,2009-03-01,false
Only entry 4 would have it's processed flag changed to true.
Anyone know how I can do this?

I don't know postgres, but this is standard SQL and may work for you.
update entries set
processed = true
where (
select count(*)
from entries as E
where E.user_id = entries.user_id
and E.created > entries.created
) >= 3
In other words, update the processed column to true whenever there are three or more entries for the same user_id on later dates. I'm assuming the [created] column is unique for a given user_id. If not, you'll need an additional criterion to pin down what you mean as "latest".
In SQL Server you can do this, which is a little easier to follow and will probably be more efficiently executed:
with T(id, user_id, created, processed, rk) as (
select
id, user_id, created, processed,
row_number() over (
partition by user_id
order by created desc, id
)
from entries
)
update T set
processed = true
where rk > 3;
Updating a CTE is a non-standard feature, and not all database systems support row_number.

First, let's start with query that will list all rows to be updated:
select e.id
from entries as e
where (
select count(*)
from entries as e2
where e2.user_id = e.user_id
and e2.created > e.created
) > 2
This lists all ids of records, that have more than 2 such records that user_id is the same, but created is later than created in row to be returned.
That is it will list all records but last 3 per user.
Now, we can:
update entries as e
set processed = true
where (
select count(*)
from entries as e2
where e2.user_id = e.user_id
and e2.created > e.created
) > 2;
One thing thought - it can be slow. In this case you might be better off with custom aggregate, or (if you're on 8.4) window functions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

DELETE rows with specific common column - sql

Here the error message is not exactly clear. For delete queries you need to have a USING clause for your CTE. So: WITH temp AS ( DELETE FROM sessions WHERE id = 'deleteme' RETURNING user_id) DELETE FROM sessions USING temp WHERE user_id = temp.user_id AND expire < now()

Related

How to group by one column and limit to rows where another column has the same value for all rows in group?

Massive Delete statement - How to improve query execution time?

How can I schedule a query in Google BigQuery to append new data to table?

How to delete duplicate rows with SQL?

How to do an update based on a count - SQL (postgres)

Categories

Resources