Remove duplicate batches of data - sql

Due to a bug in my application a table that was built to carry daily records of each delivery period, was populated many times.
Lets say I have a delivery from 1st of June to 5 of June. My table should be populated with 5 records, one for each day. Now, I have havoc because I have many "batches" of the same content.
The table layout is as:
dummy_id -- identity column
delivery_id -- id of the delivery
on_date -- the day
charge -- the daily cost
Is there an elegant way to keep only the first batch of records and delete the batches that were inserted by mistake for all the deliveries?

To delete all dupes for delivery_id, on_date, charge keeping the one with the lowest dummy_id
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY delivery_id,
on_date,
charge
ORDER BY dummy_id) RN
FROM YourTable)
DELETE FROM cte
WHERE RN > 1

You can try:
This is to know which rows you will delete:
SELECT * FROM YOUR_TABLE WHERE DUMMY_ID NOT IN (
SELECT MIN(DUMMY_ID) FROM YOUR_TABLE GROUP BY DELIVERY_ID)
This will delete these rows:
DELETE FROM YOUR_TABLE WHERE DUMMY_ID NOT IN (
SELECT MIN(DUMMY_ID) FROM YOUR_TABLE GROUP BY DELIVERY_ID)

Try
DELETE FROM table WHERE dummy_id NOT IN (SELECT MIN(dummy_id) FROM table GROUP BY on_date)

Related

Removing duplicate rows based on one column same values but keep one record

SQL Server Version
Remove all dupe rows (row 3 thru 18) with service_date = '2018-08-29 13:05:00.000' but keep the oldest row (row 2) and of course keep row 1 since its different service_date. Don't mind the create_timestamp or document_file since it's the same customer. Any idea?
In SQL Server, we can try deleting using a CTE:
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY service_date ORDER BY create_timestamp) rn
FROM yourTable
)
DELETE
FROM cte
WHERE rn > 1;
The strategy here is to assign a row number to each group of records sharing the same service_date, with 1 being assigned to the oldest record in that group. Then, we can phrase the delete by just targeting all records which have a row number greater than 1.
You don't need to use Partition function.please use the below query for efficient performance.i have tested its working fine.
with result as
(
select *, row_number() over(order by create_timestamp) as Row_To_Delete from TableName
)
delete from result where result.Row_To_Delete>2
I think you will want to remove these data per customer basis
I mean, if customers are different you will want to keep the entries even on the same date
If you you will require the addition of Customer column in partition by clause used to identify duplicate rows in SQL
By copying and modifying Tim's solution, you can check following
;WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer, service_date ORDER BY create_timestamp) rn
FROM yourTable
)
DELETE
FROM cte
WHERE rn > 1;

How to retain last 1000 ids (from 99001 to 100000) from a table and deleting rest of them?

I have a table with 100000 rows. I want to retain only last 1% of the rows? How can I do that? Also ID should start from 1. I am using MS SQL 2012
I will transfer last 10% of rows to a temporary table, truncate the original table and later transfer them back to the original table.
Example:
SELECT TOP 10 PERCENT [Required Columns] INTO #temp FROM Table1 ORDER BY ID DESC
TRUNCATE TABLE Table1
INSERT INTO Table1
SELECT [Required Columns excluding ID] FROM #temp ORDER BY ID
Delete the lines you're no longer interested in
DELETE FROM <yourTable> WHERE ID < 99000
and then update the ID
UPDATE <yourTable> SET ID=ID-99000
Here's another option...
WITH
cte_AddRN AS (
SELECT
td.ID,
rn = ROW_NUMBER() OVER (ORDER BY td.id DESC)
FROM
#TestData td
)
DELETE arn
FROM
cte_AddRN arn
WHERE
arn.rn > 1000;

SQL to Generate Periodic Snapshots from Transactions Table

I'm trying to create a periodic snapshot view from a database's transaction table after the fact. The transaction table has the following fields:
account_id (foreign key)
event_id
status_dt
status_cd
Every time an account changes status in the application, a new row is added to the transaction table with the new status. I'd like to produce a view that shows the count of accounts by status on every date; it should have the following fields:
snapshot_dt
status_cd
count_of_accounts
This will get the count for any given day, but not for all days:
SELECT status_cd, COUNT(account_id) AS count_of_accounts
FROM transactions
JOIN (
SELECT account_id, MAX(event_id) AS event_id
FROM transactions
WHERE status_dt <= DATE '2014-12-05') latest
USING (account_id, event_id)
GROUP BY status_cd
Thank you!
Okay, this is going to be hard to explain.
On each date for each status, you should count up two values:
The number of customers who start with that status.
The number of customers who leave with that status.
The first value is easy. It is just the aggregation of the transactions by the date and the status.
The second value is almost as easy. You get the previous status code and count the number of times that that status code "leaves" on that date.
Then, the key is the cumulative sum of the first value minus the cumulative sum of the second value.
I freely admit that the following code is not tested (if you had a SQL Fiddle, I'd be happy to test it). But this is what the resulting query looks like:
select status_dte, status_cd,
(sum(inc_cnt) over (partition by status_cd order by status_dt) -
sum(dec_cnt) over (partition by status_cd order by status_dt)
) as dateamount
from ((select t.status_dt, t.status_cd, count(*) as inc_cnt, 0 as dec_cnt
from transactions t
group by t.status_dt, t.status_cd
) union all
(select t.status_dt, prev_status_cd, 0, count(*)
from (select t.*
lag(t.status_cd) over (partition by t.account_id order by status_dt) as prev_status_cd
from transactions t
) t
where prev_status_cd is null
group by t.status_dt, prev_status_cd
)
) t;
If you have dates where there is no change for one or more statuses and you want to include those in the output, then the above query would need to use cross join to first create the rows in the result set. It is unclear if this is a requirement, so I'm leaving out that complication.

Top 10% of sum() Postgres

I'm looking to pull the top 10% of a summed value on a Postgres sever.
So i'm summing a value with sum(transaction.value) and i'd like the top 10% of the value
From what I gather in your comments, I assume you want to:
Sum transactions per customer to get a total per customer.
List the top 10 % of customers who actually have transactions and spent the most.
WITH cte AS (
SELECT t.customer_id, sum(t.value) AS sum_value
FROM transaction t
GROUP BY 1
)
SELECT *, rank() OVER (ORDER BY sum_value DESC) AS sails_rank
FROM cte
ORDER BY sum_value DESC
LIMIT (SELECT count(*)/10 FROM cte)
Major points
Best to use a CTE here, makes the count cheaper.
The JOIN between customer and transaction automatically excludes customers without transaction. I am assuming relational integrity here (fk constraint on customer_id).
Dividing bigint / int effectively truncates the result (round down to the nearest integer). You may be interested in this related question:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?
I added a sails_rank column which you didn't ask for, but seems to fit your requirement.
As you can see, I didn't even include the table customer in the query. Assuming you have a foreign key constraint on customer_id, that would be redundant (and slower). If you want additional columns from customer in the result, join customer to the result of above query:
WITH cte AS (
SELECT t.customer_id, sum(t.value) AS sum_value
FROM transaction t
GROUP BY 1
)
SELECT c.customer_id, c.name, sub.sum_value, sub.sails_rank
FROM (
SELECT *, rank() OVER (ORDER BY sum_value DESC) AS sails_rank
FROM cte
ORDER BY sum_value DESC
LIMIT (SELECT count(*)/10 FROM cte)
) sub
JOIN customer c USING (customer_id);

PostgreSQL/SQL query optimization

So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.
The fastest and easiest thing that i tried is this
delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)
But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?
UPDATE
I decided to do it in bucket way through celery task.
you can try
delete from indexing_log as l
where
exists
(
select *
from indexing_log as i
where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
);
Dump the distinct rows to a temporary table
create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;
Truncate the original
truncate table indexing_log;
Since the table is now empty use the opportunity to do an instantaneous vacuum:
vacuum full indexing_log;
Move the rows from the temporary to the original
insert into indexing_log
select *
from t;
Truncate Table should be much quicker. But there you cannot say "delete everything except..."
If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)
SELECT * from indexing_log
INTO #temptable
WHERE id IN (
SELECT max(id)
FROM indexing_log
GROUP BY company_id,
"date")
Not Exists is sometimes faster than Not in
delete from indexing_log
where not exists (select 1
from (select max(id) as iid
from indexing_log
group by company_id,
"date") mids
where id = mids.iid
)