Finding the highest COUNT of a group per individual GROUP BY query in Hive - sql

I have a table of customer transactions where an individual_id appears once for every different transaction.
There is a category column called Name_desc which i would like to group by individual and find the most common category of name_desc per individual.
Suppose data is like below
Id Name_desc
---- ------
1 a
2 c
1 b
2 c
1 b
I want below output
Id Name_desc( most occuring category)
------ ------
1 b
2 c
I tried with below query and got an
Error while compiling statement: FAILED: ParseException line 4:19 cannot recognize input near 'select' 'max' '(' in expression specification
error
select name_desc, count(*) as count_e
from db.cust_scan
group by id, name_desc
having count(*)= ( select max(count_e),id
from
(
select id, name_desc, count(*) as count_e
from
db.cust_scan
where
base_div_nbr =1
and
country_code ='US'
and
retail_channel_code=1
and visit_date between '2019-01-01' and '2019-12-31'
GROUP by
individual_id, tt_id_desc
order by individual_id, count_e desc
) as t
group by individual_id )
I would appreciate any suggestions or help with regard to query. If there is an efficient way of getting this job done. Let me know.

This following script written and tested for MSSQL. But as HIVE also support the same Row_Number() ans sub query, this following query should help you getting your required output-
SELECT A.Id, A.Name_desc
FROM
(
SELECT Id,Name_desc,
row_number() over (partition by id order by COUNT(*) desc) AS RN
FROM your_table
GROUP BY Id,Name_desc
) A
WHERE RN = 1

You need subquery in Hive:
SELECT s.Id, s.Name_desc
FROM
(
select s.*, row_number() over (partition by s.id order by s.cnt desc) rn
from
(
SELECT Id, Name_desc, COUNT(*) cnt
FROM your_table
GROUP BY Id, Name_desc
) s
) s
WHERE rn= 1;

Related

Group by column and get max and min id on sql

I got a table with theses Column :
ID_REAL,DATE_REAL,NAME_REAL
I want to make a query to get result like this with a group by on the name
NAME | MAX(DATE_REAL) | ID_REAL of the MAX(DATE_REAL) | MIN(DATE_REAL) | ID_REAL of the MIN(DATE_REAL)
I dont know how to make it for the moment I have
select NAME_REAL,max(DATE_REAL),ID_REAL from MYREALTABLE group by NAME_REAL,ID_REAL
select NAME_REAL,min(DATE_REAL),ID_REAL from MYREALTABLE group by NAME_REAL,ID_REAL
But is not whats I need, and also I need only 1 query
Thanks you
I think the following should work by finding the records which have the minimum and maximum dates per name and joining those two queries.
select
mn.NAME_REAL,
MIN_DATE_REAL,
ID_REAL_OF_MIN_DATE_REAL,
MAX_DATE_REAL,
ID_REAL_OF_MAXDATE_REAL
from
(
select NAME_REAL,
DATE_REAL as MIN_DATE_REAL,
ID_REAL as ID_REAL_OF_MIN_DATE_REAL,
from (
select
NAME_REAL,
ID_REAL,
DATE_REAL,
row_number() over (partition by NAME_REAL order by DATE_REAL asc) as date_order_asc
from MYREALTABLE
)
where date_order_asc = 1
) mn
inner join
(
select NAME_REAL,
DATE_REAL as MAX_DATE_REAL,
ID_REAL as ID_REAL_OF_MAX_DATE_REAL,
from (
select
NAME_REAL,
ID_REAL,
DATE_REAL,
row_number() over (partition by NAME_REAL order by DATE_REAL desc) as date_order_desc
from MYREALTABLE
)
where date_order_desc = 1
) mx
on mn.NAME_REAL = mx.NAME_REAL
You can join the two results into a single query result as follows
select o.NAME_REAL,o.max,o.id_real,t.min,o.id_real from (
select NAME_REAL,max(DATE_REAL) as max,ID_REAL, from MYREALTABLE group by NAME_REAL,ID_REAL)
as o inner join
(select NAME_REAL,min(DATE_REAL),ID_REAL from MYREALTABLE group by NAME_REAL,ID_REAL
) as t on o.NAME_REAL=t.NAME_REAL
Try the below -
select NAME_REAL,ID_REAL,max(DATE_REAL) as max_date, min(DATE_REAL) as min_date
from MYREALTABLE
group by NAME_REAL,ID_REAL

Finding top count of a value in a table using SQL

I'm looking for a way to find the top count value of a column by SQL.
If for example this is my data
id type
----------
1 A
1 B
1 A
2 C
2 D
2 D
I would like the result to be:
1 A
2 D
I'm looking for a way to do it without groping by the column I count (type in the example)
Thanks
Statistically, this is called the "mode". You can calculate it using window functions:
select id, type, cnt
from (select id, type, count(*) as cnt,
row_number() over (partition by id order by count(*) desc) as seqnum
from t
group by id, type
) t
where seqnum = 1;
If there are ties, then an arbitrary value is chosen from among the ties.
You are looking for the statistic mode (the most often ocurring value):
select id, stats_mode(type)
from mytable
group by id
order by id;
Not all DBMS support this however. Check your docs, wheher this function or a similar one is available in your DBMS.
Just GROUP BY id, type and keep the rows with the maximum counter:
select id, type
from tablename
group by id, type
having count(*) = (
select count(*) from tablename group by id, type order by count(*) desc limit 1
)
See the demo
Or
select id, type
from tablename
group by id, type
having count(*) = (
select max(t.counter) from (select count(*) counter from tablename group by id, type) t
)
See the demo

Select most recent status for each ID and department code

I have the following table:
I want to get the most recent status for each dept_code that a CL_ID has. So the desired output would be this:
I have tried the following but this give me just the most recent status for each client and not each of their dept_codes.
SELECT *
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT] C
INNER JOIN
(SELECT CLIENT_NUMBER, MAX(STATUS_DATE) AS SDATE
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
GROUP BY CLIENT_NUMBER) X
ON X.CLIENT_NUMBER = C.CLIENT_NUMBER
AND X.SDATE = C.STATUS_DATE
ORDER BY C.CLIENT_NUMBER
Any help would be much appreciated. Thanks.
A convenient method that works in SQL Server is:
select top (1) cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
order by row_number() over (partition by cl_id, dept_code order by status_date desc);
A method that is efficient with the right indexes in almost any database is:
select cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
where cl.status_date = (select max(cl2.status_date)
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl2
where cl2.cl_id = cl.cl_id and cl2.dept_code = cl.dept_code
);
The right index is on (cl_id, dept_code, status_date).
I would also use ROW_NUMBER, but with a subquery:
SELECT CL_ID, Status_date, Status, Dept_code
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CL_ID, Dept_code ORDER BY Status_date DESC) rn
FROM CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) t
WHERE rn = 1;
1) Firstly group everything on Dept_Code,CL_ID and assign rank for each row with in the group in descending order.
2) Select all the rows with rnk=1 which would display your desired result.
SELECT Z.CL_ID,
Z.Status_Date,
Z.Status,
Z.Dept_Code
FROM
(
SELECT *,
RANK() OVER( PARTITION BY Dept_Code,CL_ID, ORDER BY Status_Date DESC ) AS rnk
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) Z
WHERE Z.rnk = 1;
This would work for almost all databases
select * from c3clstat c
where exists
(select 1 from c3clstat c1
where c1.cl_id=c.cl_id
and c1.dept_code=c.dept_code
group by cl_id,dept_code
having c.status_date=max(c1.status_date)
)

How to select maximum no. of similar records from a table

I have two tables in Oracle
Flight :
-------------------------
fl_no | fl_date | fl_time
-------------------------
Passenger :
------------------------------------
ps_id | fl_no | ps_name | ps_address
------------------------------------
Now I need to show the flight that has most passenger. But I'm just not able to figure out the query. Can anyone help me?
You can use analytic functions (dense_rank) to reduce the query to one table scan: (to improve the speed at which the query runs)
select *
from (select fl_no,
num_pass,
dense_rank() over(order by num_pass desc) as rnk
from (select fl_no, count(*) as num_pass
from passenger
group by fl_no))
where rnk = 1
Fiddle: http://sqlfiddle.com/#!4/7a1d1/10/0
(notice how in the fiddle I made a tie between flights 1 and 3; both have exactly 3 passengers, the highest # of passengers in any flight)
As noted already, join into the flight table if you need additional fields returned from that table. For instance if you wanted to add on the date of each such flight with the most passengers you could run:
select x.*, f.fl_date
from (select fl_no,
num_pass,
dense_rank() over(order by num_pass desc) as rnk
from (select fl_no, count(*) as num_pass
from passenger
group by fl_no)) x
join flight f
on x.fl_no = f.fl_no
where rnk = 1
Fiddle: http://sqlfiddle.com/#!4/286bf/1/0
I would do this with order by and rownum:
select *
from (select fl_no, count(*) as cnt
from passenger p
group by fl_no
order by cnt desc
) f
where rownum = 1;
Although window functions can be used, they seem like overkill.
Note that in Oracle 12+, this can be simplified to:
select fl_no, count(*) as cnt
from passenger p
group by fl_no
order by cnt desc
fetch first 1 row only;
EDIT:
If you want all such flights, then analytic functions are the way to go. Here is another way to write such a query (Brian already has one method):
select *
from (select fl_no, count(*) as cnt, max(count(*)) over () as maxcnt
from passenger p
group by fl_no
order by cnt desc
) f
where cnt = maxcnt;

Column is invalid error when using derived table

I'm using ROW_NUMBER() and a derived table to fetch data from the derived table result.
However, I get the error message telling me I don't have the appropriate columns in the GROUP BY clause.
Here's the error:
Column 'tblCompetition.objID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
What column am I missing? Or am I doing something else wrong? Find below the query that is not working, and the (more simple) query that is working.
SQL Server 2008.
Query that isn't working:
SELECT
objID,
objTypeID,
userID,
datAdded,
count,
sno
FROM
(
SELECT scc.objID,scc.objTypeID,scc.userID,scc.datAdded,
COUNT(sci.favID) as count,
ROW_NUMBER() OVER(PARTITION BY scc.userID ORDER BY scc.unqID DESC) as sno
FROM tblCompetition scc
LEFT JOIN tblFavourites sci
ON sci.favID = scc.objID
AND sci.datTimeStamp BETWEEN #datStart AND #datEnd
) as t
WHERE sno <= 2 AND objTypeID = #objTypeID
AND datAdded BETWEEN #datStart AND #datEnd
GROUP BY objID,objTypeID,userID,datAdded,count,sno
Simple query that is working:
SELECT objId,objTypeID,userId,datAdded FROM
(
SELECT objId,objTypeID,userId,datAdded,
ROW_NUMBER() OVER(PARTITION BY userId ORDER BY unqid DESC) as sno
FROM tblRdbCompetition
) as t
WHERE sno<=2 AND objtypeid=#objTypeID
AND datAdded BETWEEN #datStart AND #datEnd
Thank you!
you need the GROUP BY in your subquery since that's where the aggregate is:
SELECT
objID,
objTypeID,
userID,
datAdded,
count,
sno
FROM
(
SELECT scc.objID,scc.objTypeID,scc.userID,scc.datAdded,
COUNT(sci.favID) as count,
ROW_NUMBER() OVER(PARTITION BY scc.userID ORDER BY scc.unqID DESC) as sno
FROM tblCompetition scc
LEFT JOIN tblFavourites sci
ON sci.favID = scc.objID
AND sci.datTimeStamp BETWEEN #datStart AND #datEnd
GROUP BY scc.objID,scc.objTypeID,scc.userID,scc.datAdded) as t
WHERE sno <= 2 AND objTypeID = #objTypeID
AND datAdded BETWEEN #datStart AND #datEnd
You cannot have count in a group by clause. Infact the count is derived when you have other fields in group by. Remove count from your Group by.
In the innermost query you are using
COUNT(sci.favID) as count,
which is an aggregate, and you select other non-aggregating columns along with it.
I believe you wanted an analytic COUNT instead:
SELECT objID,
objTypeID,
userID,
datAdded,
count,
sno
FROM (
SELECT scc.objID,scc.objTypeID,scc.userID,scc.datAdded,
COUNT(sci.favID) OVER (PARTITION BY scc.userID ) AS count,
ROW_NUMBER() OVER (PARTITION BY scc.userID ORDER BY scc.unqID DESC) as sno
FROM tblCompetition scc
LEFT JOIN
tblFavourites sci
ON sci.favID = scc.objID
AND sci.datTimeStamp BETWEEN #datStart AND #datEnd
) as t
WHERE sno = 1
AND objTypeID = #objTypeID