If I have a query for example
SELECT * FROM MY_TABLE WHERE FIRSTNAME = 'HENRY';
thats returns say twenty results for HENRY that are identical.
Is there a way to then query the results of the original query to only return non duplicates.
This is a trivial example but basically I have a query where I am trying to perform a SELECT DISTINCT on a large data set. If I don't specify DISTINCT I get a relatively small and fast return of some duplicate data. Is there any logic in SQL I can apply to then perform a SELECT DISTINCT on those results. Essentially breaking up the query to reduce response times? Assume everything of value is indexed.
Thanks
To return the first of a group of records you can do something like this:
select *
from
(
SELECT *, row_number() over (partition by firstname order by id) r
FROM MY_TABLE
--WHERE FIRSTNAME = 'HENRY'
) x
where x.r = 1
If the records are exact duplicates, you're not worried about the first since they're all the same, so you just want distinct records:
SELECT distinct *
FROM MY_TABLE
WHERE FIRSTNAME = 'HENRY'
or to see how many duplicates:
SELECT *, count(*)-1 NoOfDuplicates
FROM MY_TABLE
WHERE FIRSTNAME = 'HENRY'
group by firstname, lastname --, ...
Be warned that for the database to divide the data set up into those records which have a duplicate and those which do not is generally no more efficient than performing the actual distinct, unless the number of columns on which duplication occurs is very much less than the total number of columns.
In some cases of very wide tables where duplication exist only on a subset of columns and on a small proportion of the rows it might be more efficient to do something like:
select *
from my_table t1
where not exists (
select null
from my_table t2
where t2.duplication_column = t1.duplication_column and
t2.rowid != t1.rowid)
union all
select distinct *
from my_table t1
where exists (
select null
from my_table t2
where t2.duplication_column = t1.duplication_column and
t2.rowid != t1.rowid)
This would generally not be worth doing unless it avoided something very inefficient, like a very large sort spilling to disk.
Edit: modified the query
Related
I have some SQL code that reads like this. It intends to grab all of the data meeting the two conditions, but not to grab the data if we already have a row with the same ID as it. Select Distinct (t1.ID) works as intended, but when I add in the additional variables, it no longer filters properly.
Select Distinct (t1.ID),t1.Var2, t1.Var3...
FROM table_location AS t1
WHERE t1.FCT_BILL_CURRENCY_CODE_LCL = 'USD'
AND t1.RQ_GLOBAL_REGION = 'North America'
enter image description here
This clearly contains multiple rows with the same ID, contrary to how it should work. How do I fix this?
I'm not sure what DB you're using, but most will have the concept of numbering rows by a partition.
To get a distinct by a certain value, you need to make a subquery that selects your data plus a row number that is partitioned by your distinct property, then have your parent query select only the rows with 1 as the row number to get just the first of each.
I have added a query by looking into the sample query you mentioned in the problem. If you add sample data, the we will have better understanding of the problem.
Query
SELECT
ID,
Var2,
Var3
FROM (
SELECT
DENSE_RANK() OVER (PARTITION BY t1.ID ORDER BY t1.ID) AS Rnk_ID,
t1.ID,
t1.Var2,
t1.Var3
FROM table_location AS t1
WHERE t1.FCT_BILL_CURRENCY_CODE_LCL = 'USD'
AND t1.RQ_GLOBAL_REGION = 'North America'
) qry1
WHERE Rnk_ID = 1
I have a table with the structure given below:
A User_ID has values for its respective items in the specific time interval. Item value can be text or integer depends upon the item.
I want to check if any Two or more UserId as same values, meaning their items are same with same values and in the same time interval.
As in above table UserId 213456 and UserId 213458 has same records.
I tried using cursor and loops, but it's taking too long. My table has more than 50 million UserId. Is there a way to do this in an efficient way?
I also tried using group by with subqueries but all the attempts were failed to create a good query for it.
I have created the following query using How do I find duplicate values in a table in Oracle?
select t1.USERID, count(t1.USERID)
from USERS_ITEM_VAL t1
where exists ( select *
from USERS_ITEM_VAL t2
where t1.rowid <> t2.rowid and
t2.ITEMID = t1.ITEMID and
t2.TEXT_VALUE = t1.TEXT_VALUE and
--t2.INTEGER_VALUE = t1.INTEGER_VALUE and
t2.INIT_DATE = t1.INIT_DATE and
t2.FINAL_DATE = t1.FINAL_DATE )
group by t1.USERID having count(t1.USERID) > 1 order by count(t1.USERID);
But the problem is its working when excluding the INTEGER_VALUE columns but not giving me output when I include INTEGER_VALUE column in the join, though my data in INTEGER_VALUE column is same.
Here is the structure of my table:
USERID - NUMBER
ITEMID - NUMBER
TEXT_VALUE - VARCHAR2(500)
INTEGER_VALUE - NUMBER
INIT_DATE - DATE
FINAL_DATE - DATE
One way to approach this uses a self join. The idea is to count the number of items that two users have in common (taking the date columns into account). Then compare this to the number of items that each has:
with t as (
select t.*, count(*) over (partition by userid) as numitems
from t
)
select t1.userid, t2.userid
from t t1 join
t t2
on t1.userid < t2.userid and
t1.itemid = t2.itemid and
t1.init_date = t2.init_date and
t1.final_date = t2.final_date and
t1.numitems = t2.numitems
group by t1.userid, t2.userid, t1.numitems
having count(*) = t1.numitems;
The reason your query failed is that either text_value or integer_value will be NULL in every row. For this reason, it's not possible to use an equality predicate in the self-join without using NVL functions to plug the NULL values.
However, below is a query that uses an analytic function to accomplish the goal:
Select * From (
Select t.*, Count(*) Over (Partition By t.itemId,
t.text_value,
t.integer_value,
t.init_date,
t.final_date) as Cnt)
Where cnt > 1;
The query returns all rows where multiple records have identical values in the five columns of the Partition By clause.
A benefit of this technique over the self-join approach is that the table is scanned only once, whereas it would be scanned twice with a self join. This could result in better performance if the table is large.
I have a table in Access that is set up where there are multiple records with the same ID, they correspond to each other.
I'd like to find certain records that have a specific date value. However, I want all the corresponding information WITH that ID (i.e. all the other records with the same ID). I've tried things like this:
SELECT *
FROM myTable
WHERE LEFT(Field1,7) = '2016-11' IN (SELECT ID
FROM myTable
GROUP BY ID
HAVING COUNT(*)>1)
and
SELECT *
FROM myTable
WHERE ID = (SELECT * FROM myTable WHERE LEFT(Field1,7) = '2016-11'
Neither of these are giving me the proper output. I think I may need a For loop of some sort but don't have much experience doing this with SQL. That way I can loop through all IDs that are returned with that date-part. Any suggestions? I would put the table format in the post but the table formatting isn't working for me for some reason. The frustration is real!
Haha thanks ahead of time for taking the time to even read my question. Much appreciated.
EDIT
Here is a visual of what my table is like:
ExampleTable
I'd like to choose all the records that occur during November, but also get the corresponding information (i.e. records with same ID number as the November records).
Consider adding WHERE condition in subquery:
SELECT *
FROM myTable
WHERE ID IN (SELECT ID FROM myTable
WHERE LEFT(Field1, 7) = '2016-11');
Alternatively to avoid subquery, try an INNER JOIN on a filtered self join by ID:
SELECT myTable.*
FROM myTable
INNER JOIN
(SELECT ID FROM myTable
WHERE LEFT(Field1, 7) = '2016-11') sub
ON sub.ID = myTable.ID
I have a search screen where the user has 5 filters to search on.
I constructed a dynamic query, based on these filter values, and page 10 results at a time.
This is working fine in SQL2012 using OFFSET and FETCH, but I'm using two queries to do this.
I want to show the 10 results and display the total number of rows found by the query (let's say 1000).
Currently I do this by running the query twice - once for the Total count, then again to page the 10 rows.
Is there a more efficient way to do this?
You don't have to run the query twice.
SELECT ..., total_count = COUNT(*) OVER()
FROM ...
ORDER BY ...
OFFSET 120 ROWS
FETCH NEXT 10 ROWS ONLY;
Based on the chat, it seems your problem is a little more complex - you are applying DISTINCT to the result in addition to paging. This can make it complex to determine exactly what the COUNT() should look like and where it should go. Here is one way (I just want to demonstrate this rather than try to incorporate the technique into your much more complex query from chat):
USE tempdb;
GO
CREATE TABLE dbo.PagingSample(id INT,name SYSNAME);
-- insert 20 rows, 10 x 2 duplicates
INSERT dbo.PagingSample SELECT TOP (10) [object_id], name FROM sys.all_columns;
INSERT dbo.PagingSample SELECT TOP (10) [object_id], name FROM sys.all_columns;
SELECT COUNT(*) FROM dbo.PagingSample; -- 20
SELECT COUNT(*) FROM (SELECT DISTINCT id, name FROM dbo.PagingSample) AS x; -- 10
SELECT DISTINCT id, name FROM dbo.PagingSample; -- 10 rows
SELECT DISTINCT id, name, COUNT(*) OVER() -- 20 (DISTINCT is not computed yet)
FROM dbo.PagingSample
ORDER BY id, name
OFFSET (0) ROWS FETCH NEXT (5) ROWS ONLY; -- 5 rows
-- this returns 5 rows but shows the pre- and post-distinct counts:
SELECT PostDistinctCount = COUNT(*) OVER() -- 10,
PreDistinctCount -- 20,
id, name
FROM
(
SELECT DISTINCT id, name, PreDistinctCount = COUNT(*) OVER()
FROM dbo.PagingSample
-- INNER JOIN ...
) AS x
ORDER BY id, name
OFFSET (0) ROWS FETCH NEXT (5) ROWS ONLY;
Clean up:
DROP TABLE dbo.PagingSample;
GO
My solution is similar to "rs. answer"
DECLARE #PageNumber AS INT, #RowspPage AS INT
SET #PageNumber = 2
SET #RowspPage = 5
SELECT COUNT(*) OVER() totalrow_count,*
FROM databasename
where columnname like '%abc%'
ORDER BY columnname
OFFSET ((#PageNumber - 1) * #RowspPage) ROWS
FETCH NEXT #RowspPage ROWS ONLY;
The return result will include totalrow_count as the first column name
Can you try something like this
SELECT TOP 10 * FROM
(
SELECT COUNT(*) OVER() TOTALCNT, T.*
FROM TABLE1 T
WHERE col1 = 'somefilter'
) v
or
SELECT * FROM
(
SELECT COUNT(*) OVER() TOTALCNT, T.*
FROM TABLE1 T
WHERE col1 = 'somefilter'
) v
ORDER BY COL1
OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY
Now you have total count in your totalcnt column and you can use this column to show total number of rows
In my testing with a complex join and ~6,000 records returned, it's much faster to do two separate queries. Faster, as in milliseconds total to get the total and separately bring back a subset of 100 records, vs 17 seconds to do the combined query. Anyone else see this kind of performance hit? Obviously, it could have something to do with the data structure but this is still a huge difference.
I hope I'm not too late to jump in on this question, but I ran across a very similar problem tonight. I had a paging class that was over inflating the number of results returned because the previous developer was dropping the DISTINCT and just doing a SELECT count(*) of the table joins. While this doesn't solve the 2 query problem I ended up using a nested query so that it looked like this:
Original Query
SELECT DISTINCT
field1, field2
FROM
table1 t1
left join table2 t2 on t2.id = t1.id
Over Inflated Results Query
SELECT
count(*)
FROM
table1 t1
left join table2 t2 on t2.id = t1.id
My Results Query Solution
SELECT
count(*)
FROM
(SELECT DISTINCT
field1, field2
FROM
table1 t1
left join table2 t2 on t2.id = t1.id) as tbl;
Can i select all rows that have same column value (for example SSN field) but display them all separably. ?
I've searched for this answer but they all have "count(*) and group by" section that demands the rows to be exactly same.
Try This:
SELECT A, B FROM MyTable
WHERE A IN
(
SELECT A FROM MyTable GROUP BY A HAVING COUNT(*)>1
)
I have done with SQL server. But hope this is what you need
Here is another approach, which only references the table once, using an analytic function instead of a subquery to get the duplicate counts It might be faster; it also might not, depending on the particular data.
SELECT * FROM (
SELECT col1, col2, col3, ssn, COUNT(*) OVER (PARTITION BY ssn) ssn_dup_count
)
WHERE ssn_dup_count > 1
ORDER BY ssn_dup_count DESC
SELECT
*
FROM
MyTable
WHERE
EXISTS
(
SELECT
NULL
FROM
MyTable MT
WHERE
MyTable.SameColumnName = MT.SameColumnName
AND MyTable.DifferentColumnName <> MT.DifferentColumnName)
This will fetch the required data and show them in order so that we can see the grouped data together.
SELECT * FROM TABLENAME
WHERE SSN IN
(
SELECT SSN FROM TABLENAMEGROUP BY SSN HAVING COUNT(SSN)>1
)
ORDER BY SSN
Here SSN is the column names fro which similar value check is done.