Remove duplicates from table which doesn't have any key column - sql

I have 2 tables TabA and TabB. Both don't have any key columns.
Column wise both are replica and have more than 80 columns.
TabA has 30 million records. TabB has 2000 records only.
Now I need to compare all the columns between both tables since NO key column is there and remove duplicate records from TabA.
I would like to find best approach to compare both tables instead of placing all 80 columns either in JOINS or WHERE clause.

If all 80 columns are needed to identify the record, you'd have a hard time not using all of them in your query, one way or another.
You could calculate a hashcode using HASHBYTES() on all the columns, and then compare only the resulting hashcode.
There's also CHECKSUM(*) function that calculates hash on all the columns, without the need to explicitly list them, but it returns int as a result, which is too weak if false positives now and then are not acceptable.

Related

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

SQL - Delete rows if one of the columns has a null value

Is there any way to delete all the rows of a table if the value of one of the columns is null, without specifying the particular column?
I am dealing with a table that has a lot of columns, and it is not efficient to specify all of them, so I was wondering if something like this is possible?

Compare 2 tables based on range values

We have big transaction tables, it has all the values (including duplicates), need to eliminate the duplicate values based on other table values.
Table A (Transaction table) has Store, Date, Index , Etc values
Table B maintain the Index ranges, it has Store, Date, Index Begin, Index End etc.
Based on Store, Date need to compare index from table A with Table B (Table B has index Range values), eliminate the ranges of index values from Table A, so I can avoid duplicate values.
If the given index is not in range of Index Begin and Index End, I can keep that. Indexes range starts from 1. But I need to keep 1, it's a header record.
It has to check from Index 2 onwards. If you could please help with SQL statement that would be great.
Tried with few statements, did not work.
Need to eliminate duplicate records based on Index ranges from table B
To eliminate the duplicates use the key word DISTINCT after SELECT, so SELECT DISTINCT. You'll need to write a JOIN statement that compares the two tables based on the common value.
I assume you already have a query so I won't write one unless you comment needing help:)

String Grouping from a single column in Oracle database having million rows and removing duplicates

We have a huge table and one of the column contains queries like e.g. in row 1
1. (((firstname:Adam OR firstname:Neil ) AND lastname:Lee) ) AND category:"Legal" AND type:Individual
and in row 2 of same column
2. (((firstname:Adam* OR firstname:Neil ) AND lastname:Lee) ) AND category:"Legal" AND type:Organization
Similarly there are few other types of Query strings which are used eventually to query external services.
Issue is based on certain criteria I have to group and remove duplicates from this table.
There are few rules to determine grouping of Strings in different rows.One of them is that if first name and lastname are same then ignore category and type values, therefore above two rows will be grouped to one. There are around million rows. Comparing Strings and doing grouping is not looking elegant solution. What could be best possible solution using sql.

Eliminating Duplicate Records in a DB2 Table

How do delete duplicate records in a DB2 table? I want to be left with a single record for each group of dupes.
Create another table "no_dups" that has exactly the same columns as the table you want to eliminate the duplicates from. (You may want to add an identity column, just to make it easier to identify individual rows).
Insert into "no_dups", select distinct column1, column2...columnN from the original table. The "select distinct" should only bring back one row for every duplicate in the original table. If it doesn't you may have to alter the list of columns or have a closer look at your data, it may look like duplicate data but actually is not.
When step 2 is done, you will have your original table, and "no_dups" will have all the rows without duplicates. At this point you can do any number of things - drop and rename tables, or delete all from the original and insert into the original, select * from no_dups.
If you're running into problems identifying duplicates, and you've added an identity column to "no_dups," you should be able to delete rows one by one using the identity column value.