how to find the most appears one in a table using sql? - sql

I have a table A with two columns named B and C as following:
('W1','F2')
('W1','F7')
('W2','F1')
('W2','F6')
('W2','F8')
('W4','F7')
('W6','F2')
('W6','F15')
('W7','F1')
('W7','F4')
('W7','F17')
('W8','F13')
How can I find which one in the B column appears with the most time using sql in oracle? (In this case, it's W2 and W7). Thank you!

Use a subquery to calculate the number of items in columC for each value in columnB and rank() the results of the subquery based on that count. Then in your main select return just the values of columnB where the rank of the rows returned by the subquery is 1:
SELECT ColB
FROM (
SELECT ColB,
Count(ColC),
rank() over (ORDER BY Count(ColC) DESC) AS rnk
FROM yourTable
GROUP BY ColB)
WHERE rnk = 1
Here's a sql fiddle: http://sqlfiddle.com/#!4/fa6bd/2

/*
C2 REFERS TO THE COLUMN B
T1 Refers to an alias
*/
WITH T1 AS
(
SELECT C2,COUNT(*) AS COUNT
FROM YOURTABLE
GROUP BY C2
)
SELECT C2,COUNT FROM T1 WHERE COUNT=(SELECT MAX(COUNT) FROM T1 )
;

Select ColB, Count(*)
FROM yourTable
GROUP BY ColB
ORDER BY count(*) desc

Related

(Impala) Selecting most common value in field results in "Subqueries are not supported in select list"

I am trying to do an aggregation that takes the most common value of the group, like this:
with t1 as (
select
id
, colA
, colB
from some_Table
)
select
id
, count(*) as total
, max(colA) as maxColA
, most_common(colB) -- this is what I'm trying to achieve
from t1
group by id
This is what I have tried to do:
with t1 as (
select
id
, colA
, colB
from some_Table
)
select
id
, count(*) as total
, max(colA) as maxColA
, (select colB, count(colB) as counts from t1 group by colB order by counts desc limit 1) as most_freq_colB_per_id
from t1
group by id
However, it tells me AnalysisException: Subqueries are not supported in the select list. How else can I do this?
Impala does not -- as far as I know -- have a built-in aggregation function to calculate the mode (the statistical name of what you are trying to calculate).
You can use two levels of aggregation. Your CTE isn't doing anything, so you can do:
select id, sum(total) as total, max(maxColA) as maxColA,
max(case when seqnum = 1 then colB end) as mode
from (select id, colB, count(*) as total, max(colA) as maxColA,
row_number() over (partition by id order by count(*) desc) as seqnum
from sometable
group by id, colb
) t
group by id;

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

remove rows with some duplicate column value

Suppose I have a table with column A like following :
a
--
x
y
m
x
n
y
I want to delete all rows that have duplicate a column value and keep just one value.
After this operation, my column would be like If you do :
select distinct a from A;
I know how to select rows with repeated a column values But I can't just replace select with DELETE because it would delete the unique values too.
Any help would be greatly appreciated.
In Oracle, you can do this by using the hidden column rowid and a correlated subquery:
delete from a
where rowid > (select min(rowid)
from a a2
where a.a = a2.a
);
Alternatively, you can phrase this as a not in:
delete from a
where rowid not in (select min(rowid)
from a a2
group by a2.a
);
You can use combination of CTE and Ranking function
;With cte As
(
Select ROW_NUMBER() OVER (PARTITION BY colA ORDER BY colA) as rNum
From yourTable
)
Delete From cte
Where rNum<>1
In SQL, You can use CTE and delete the duplicated rows. See the query below.
WITH CTE AS(
SELECT a,
RN = ROW_NUMBER()OVER(PARTITION BY a ORDER BY a)
FROM A
)
DELETE FROM CTE WHERE RN > 1

How to select all columns for rows where I check if just 1 or 2 columns contain duplicate values

I'm having difficulty with what I figure should be an easy problem. I want to select all the columns in a table for which one particular column has duplicate values.
I've been trying to use aggregate functions, but that's constraining me as I want to just match on one column and display all values. Using aggregates seems to require that I 'group by' all columns I'm going to want to display.
If I understood you correctly, this should do:
SELECT *
FROM YourTable A
WHERE EXISTS(SELECT 1
FROM YourTable
WHERE Col1 = A.Col1
GROUP BY Col1
HAVING COUNT(*) > 1)
You can join on a derived table where you aggregate and determine "col" values which are duplicated:
SELECT a.*
FROM Table1 a
INNER JOIN
(
SELECT col
FROM Table1
GROUP BY col
HAVING COUNT(1) > 1
) b ON a.col = b.col
This query gives you a chance to ORDER BY cola in ascending or descending order and change Cola output.
Here's a Demo on SqlFiddle.
with cl
as
(
select *, ROW_NUMBER() OVER(partition by colb order by cola ) as rn
from tbl)
select *
from cl
where rn > 1

How to get Original Rows filtered by a HAVING Condition?

What is the method in T-SQL to select the orginal values limited by a HAVING attribute. For example, if I have
A|B
10|1
11|2
10|3
How would I get all the values of B (Not An Average or some other summary stat), Grouped by A, having a Count (Occurrences of A) greater than or equal two 2?
Actually, you have several options to choose from
1. You could make a subquery out of your original having statement and join it back to your table
SELECT *
FROM YourTable yt
INNER JOIN (
SELECT A
FROM YourTable
GROUP BY
A
HAVING COUNT(*) >= 2
) cnt ON cnt.A = yt.A
2. another equivalent solution would be to use a WITH clause
;WITH cnt AS (
SELECT A
FROM YourTable
GROUP BY
A
HAVING COUNT(*) >= 2
)
SELECT *
FROM YourTable yt
INNER JOIN cnt ON cnt.A = yt.A
3. or you could use an IN statement
SELECT *
FROM YourTable yt
WHERE A IN (SELECT A FROM YourTable GROUP BY A HAVING COUNT(*) >= 2)
A self join will work:
select B
from table
join(
select A
from table
group by 1
having count(1)>1
)s
using(A);
You can use window function (no joins, only one table scan):
select * from (
select *, cnt=count(*) over(partiton by A) from table
) as a
where cnt >= 2