Pick duplicate record within grouping records? - sql

I had an history table in which duplicate records are entered due to history_date_time column.
Because of some process or loading issue i get this duplicate record.
I used a query like
SELECT (col1, col2), COUNT(*)
FROM table_name
GROUP BY (col1, col2)
HAVING COUNT(*) >1;
I grouped a records based on col1 and col2 but the problem is
i may have different column with different records. I want to pick unique records within the grouping records by checking all columns
How can i achieve this... using oracle sql query.. i need query
Sorry i dont have a proper table structure right now.

Should be
SELECT col1, col2, COUNT(*)
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;
Alternatively, if it must be a "group" of columns, perhaps you meant something like this?
SELECT col1 ||'-'|| col2 col, COUNT(*)
FROM table_name
GROUP BY col1 ||'-'|| col2
HAVING COUNT(*) >1;
Sample data would help.

Related

SQL Server Query - How to append row showing total record count?

What is the best approach to append a row to a SQL Server query showing the total count of rows resulting from the query? UNION is one way, but seems very inefficient:
SELECT col1, col2 FROM tbl1
UNION ALL
SELECT STR(COUNT(col1)), NULL FROM tbl1
ROLLUP isn't an option because it requires GROUP BY, which we're not using for the queries in question.
You can use GROUPING SETS for this
SELECT
CASE WHEN GROUPING(col1) = 0 THEN col1 ELSE CAST(COUNT(*) AS varchar(30)) END AS col1,
col2
FROM tbl1
GROUP BY GROUPING SETS (
(col1, col2),
()
);
The GROUPING function will tell you whether the row is the Total row or not.
This does have the effect of grouping the columns which could be a different result and possibly less efficient. But if you include a unique/primary key as the first column in the grouping list then this shouldn't make a difference, and should be almost as performant as the original query.
You can also use a window function, which will return the total on each row as another column
SELECT
col1,
col2,
COUNT(*) OVER ()
FROM tbl1;

Counting matching rows of two same tables and counting rows of the table

I have the same table structure called "table1" under two different schemas "schema1" and "schema2". "table1" contains columns "col1, col2, col3". Initialy I want see whether there are records having the same entries of col1 and col2 in the table schema1.table1 and schema2.table1. But I had mistyped schema2.table1 as schema1.table1. And now I am confused by the query result.
SELECT COUNT(*) FROM schema1.table1 AS s1t, schema1.table1 AS s2t
WHERE s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2;
I got
count
-------
530
(1 row)
However, SELECT COUNT(*) FROM schema1.table1; shows that there are 17815 rows.
Why would the first query show there are only 530 satisfied records? Shouldn't it be 17815 as well?
You can try to use FULL OUTER JOIN to see even mismatched rows, including null values for columns(col1 and 2). This way, at least(more than or equal to) 17815 rows return
SELECT COUNT(*)
FROM schema1.table1 AS s1t
FULL OUTER JOIN schema1.table1 AS s2t
ON s1t.col1 = s2t.col1 AND s1t.col2 = s2t.col2
In your case, only matched rows return for those columns (col1 and 2).
You are joining the table to itself. That is really strange.
In any case, your join is going to filter out any rows where col1 or col2 are NULL.
In addition, the self-join might multiply the number of rows if there are duplicates (with respect to the two columns) in the table.
It is really unclear why you would be doing this, but the above explains the results you are seeing.
If you want to compare the results in the two schemas allowing for duplicates and missing values, I recommend union all/group by:
select col1, col2, sum(cnt1) as cnt1, sum(cnt2) as cnt2
from ((select col1, col2, count(*) as cnt1, 0 as cnt2
from schema1.table1
group by col1, col2
) union all
(select col1, col2, 0 as cnt1, count(*) as cnt2
from schema2.table1
group by col1, col2
)
) t12
group by col1, col2
having sum(cnt1) <> sum(cnt2);
This returns pairs where the counts are not the same in the two tables. It even works for NULL values. If you ran this on the same table, no rows would be returned.

db2 select distinct rows, but select all columns

Experts, I have a single table with multiple columns. col1, col2, col3, col4, col5, col6
I need to select distinct (col4), but I need all other columns also on my output.
If I run, this ( select distinct(col4 ) from table1 ), then I get only col4 on my output.
May I know, how to do it on db2?.
Thank you
You simply do this...
Select * From Table1 Where col4 In (Select Distinct(col4) From Table1)
I'm not sure if you will be able to do this.
You might try to run group by on this column. You will be able to run some aggregate functions on other columns.
select count(col1), col4 from table1 group by (col4);
none of the answers worked for me so here is one that i got working. use group by on col4 while taking max values of other columns
select max(col1) as col1,max(col2) as col2,max(col3) as col3
, col4
from
table1
group by col4
At least in DB2, you can execute
SELECT
DISTINCT *
FROM
<YOUR TABLE>
Which will give you every distinct combination of your (in this case) 6 columns.
Otherwise, you'll have to specify what columns you want to include. If you do that, you can either use select distinct or group by.

Hive: Select all rows with a range from the max of a column

So I am trying to write a query in Hive that will then be automated. The idea is I have a table that shows Requests with a timestamp field called updated. So there are alot of rows with the date and time at which the Request was made. Regardless of when the query is run I want to get the Requests from the last 7 days.
I tried:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN date_sub(SELECT MAX(updated) AS maxdate FROM table, 7)
AND SELECT MAX(updated) AS maxdate FROM table
GROUP BY col1, col2, col3
HAVING cnt > 10
I have looked over this and It seems like it should do what I am looking for, however I get:
ParseException line 4:79 cannot recognize input near 'select' 'max' '(' in function specification
Any help on this error or a suggested diffrent approach would be great.
Can you try this query, if the data type of column "updated" is datatime in all tables:
SELECT col1, col2, col3, count(*) cnt
FROM table
WHERE updated BETWEEN (SELECT MAX(updated)-7 AS maxdate FROM table)
AND (SELECT MAX(updated) AS maxdate FROM table)
GROUP BY col1, col2, col3
HAVING count(*) > 10

GROUP BY and ORDER BY on different columns

I want to execute a query on POSTGRESQL server whose structure is as below:
SELECT col1, SUM(col2) GROUP BY col1 ORDER BY colNotInSelect;
I have tried to include the colNotInSelect in the GROUP BY clause but since it is a column with a distinct value, it defeats the purpose of using GROUP BY in the first place.
Any help is appreciated.
You cannot order by that column because it potentially has many values for each value of col1.
However you can apply an aggregate function to the column, and order by that.
for example:
SELECT col1,
SUM(col2)
GROUP BY col1
ORDER BY MIN(colNotInSelect);
You question actually makes no sense because rows are grouped by col1 so there is no colNotInSelect in the grouped rows. Try to aggregate colNotInSelect before ordering, for example:
SELECT col1, SUM(col2), AVG(colNotInSelect) as col3 GROUP BY col1 ORDER BY col3;
If it isn't fit your need, maybe you should clarify what you're doing.