SQL to delete excess rows outside valid sub-range - sql

I have a table named "A" which have 2 columns, "A1" and "A2".
I want each unique value in column "A1" to have MAX 2 rows in the table, if a unique value in column "A1" have 5 rows, 3 rows should be deleted.
Which 3 rows to delete is determinated by the lowest values in column "A2".
The table consist of +20 million rows, +300000 unique values in column "A1" and up to 3000 rows per unique value in column "A1".
I have solved this with the following query:
with excess as
(
select
id,
row_number() over(partition by A1 order by A2 desc) as rownum
from
A
)
delete from excess
where rownum > 2
I'm satisfied with this query since it took 8 minutes for the initial batch and ~20 seconds in recurring executions.
Is this the most efficient query to achieve the requirements?

yes, this is is the most efficient query without copying the data into another table because it is making it in a single run against the table instead of joining back to itself. I would suggest that you use "delete top(N)" and keep the number under 5,000, if there are any other consumers of the table. this will attempt to prevent the lock escalation from escalating to a full table lock. it will also free up the tlogs on the server to be reused in between batches. if you do it all in one go, all of the deleted rows have to be accounted for in the tlogs, and the space can't be reused until the statement is complete. I would also suggest creating a composite index on (A1, A2).
if the number of rows that need to be deleted are a significant percentage, it would be faster to copy the rows where rownum <= 2 into a new table. then, drop the original table and rename the new table back to the original. if you have other consumers of the table and/or don't want to copy the data, then this may not be a valid solution.

Related

COPY FROM / INSERT Millions of Rows Into Same PostgreSQL table

I have a table with hundreds of millions of rows that I need to essentially create a "duplicate" of each existing row in, doubling its row count. I'm currently using an insert operation (and unlogging the table prior to inserting) which still takes a long while as one transaction. Looking for guidance on if there may be a more efficient way to execute the query below.
INSERT INTO costs(
parent_record, is_deleted
)
SELECT id, is_deleted
FROM costs;

How to Duplicate a Small Table To All Amps?

I have a Small Table in a Teradata Database that consists of 30 rows and 9 columns.
How do I duplicate the Small Table across all amps?
Note: this is the opposite of what one usually wants to do with a Large Table, distribute the rows evenly
You can not "duplicate" the same table content across all amps. You can try to store all rows from the table to one AMP through unevenly distributed rows. So if I understand the request you want all rows from your small table to be stored on one amp only.
If so, you can create a column that has the same value for all rows(if you don't already have this). You can make it INTEGER column in order to use less space. Then you have to make this column primary index of the table and your actual keys you can make them as secondary keys.
You can check how the rows are stored on the amps true the code below.
SELECT
TABLENAME
,VPROC AS NUM_AMP
,CAST(SUM(CURRENTPERM)/(1024*1024*1024) AS DECIMAL(18,5)) AS USEDSPACE_IN_GB
FROM DBC.TABLESIZEV
WHERE UPPER(DATABASENAME) = UPPER('databasename') AND UPPER(TABLENAME) = UPPER('tablename')
GROUP BY 1, 2
ORDER BY 1;
or
SELECT
HASHAMP(HASHBUCKET(HASHROW(primary_index_columns))) AS "AMP"
,COUNT(*) AS CNT
FROM databasename.tablename
GROUP BY 1
ORDER BY 2 DESC;

Remove a particular column data in SQL without deleting entire row

I have a table in SQL server, where one column contains excel files.
Now we need to remove those excel files only without deleting the entire row. because the size of this table is increasing day by day, we need to
remove old data to decrease the size of this table
Id file_name Code
1001 abc.xlsx A1
1002 das.xlsx A2
1003 kap.xlsl A3
I have done the below
Update rec_table
set file_name = Null where id = '1001'
Will this help to reduce the size of the table?
Thanks
In SQL Server the size of a Table is calculated by adding the Sie of every row in the table. Ie if a Table is having 10 rows, then the Total size of the table would be the sum of the total size of the 10 rows.
For a Row, the Total size is calculated by calculating the size of every column.
For Example in your case, the Size of the row with ID 1001 will be the
Size of the value in Column ID + Size Of Value in Column File_Name + Size of Value in COlumn Code
so if a Column Holds the value NULL for a Particular Row, then that COlumn will have a Data size of 0
updating the values to NULL for a particular column will reduce the size of the column, But how much it gets reduced will depend on the Type of the Column and Data stored in it
Which means, If your Column File_name holds a Data of 100 bytes for Row id 1001, then updating the Value to NULL will reduce the Table size by 100 Bytes
You may use the following queries to find out the table / Row Size
For getting the Size and Details for the whole table
dbcc showcontig ('Person.Person') with tableresults
To Get the Data Size of a Particular column for each Row in the Database
SELECT DATALENGTH(FirstName) FROM Person.Person
Will this reduce the size of the table? Probably.
Will this release free space to the database? Probably not -- until you compact the database.
Is this an expensive operation? Very.
Databases store tables on data pages. Each data page contains one or more rows. If you have wide columns, then these might be stored on their own data pages.
The number of rows that fit on a page depends on the size of the rows. A page is about 8k bytes. If a row is 100 bytes, then a table with 1 row occupies the same space as one with 50 rows.
When you remove a column from a table, the entire table needs to be rewritten. This is a very expensive operation. And it might take a long time. Often it is faster to select the columns that you do want and reload the original table.
Removing a column is -- to me -- a very curious way to reduce the size of a table. More typically, older data would be removed. The most efficient method is to partition the table by time, using whatever the appropriate date/time column might be. Then you can quickly recover space by dropping a partition.

What is the most efficient way to do rows ordering in SQLite table?

I need to keep order of about 4000 rows in SQLite table. Inserting and deleting of rows need to be quick operation.
Current solution: I have integer column Ord. After every insertion Ord in new row just gets next integer so no need to change old rows but after deletion sometimes I need to re-populate almost all 4000 rows and it takes too long (~10s). For updating of this column I use this solution
Update SQLite table with values from 1 to N
Is there a better way to maintain ordering?
I solved this by moving Ord column out from my main table MyTable (which has 21 other columns!) into new Ordering table with just two columns: Ordering.MyTableId and Ordering.Ord.
All operations (including complete clearing and re-inserting 4000 rows) on this new table execute in 2 seconds!

Eliminating Duplicate Records in a DB2 Table

How do delete duplicate records in a DB2 table? I want to be left with a single record for each group of dupes.
Create another table "no_dups" that has exactly the same columns as the table you want to eliminate the duplicates from. (You may want to add an identity column, just to make it easier to identify individual rows).
Insert into "no_dups", select distinct column1, column2...columnN from the original table. The "select distinct" should only bring back one row for every duplicate in the original table. If it doesn't you may have to alter the list of columns or have a closer look at your data, it may look like duplicate data but actually is not.
When step 2 is done, you will have your original table, and "no_dups" will have all the rows without duplicates. At this point you can do any number of things - drop and rename tables, or delete all from the original and insert into the original, select * from no_dups.
If you're running into problems identifying duplicates, and you've added an identity column to "no_dups," you should be able to delete rows one by one using the identity column value.