HIVEQL/HIVE Find the most common field in a column

HIVEQL/HIVE Find the most common field in a column - sql

DATE WindDirection
1/1/2000 SW
1/2/2000 SW
1/3/2000 SW
1/4/2000 NW
1/5/2000 NW
Question below
Every day is unqiue, and wind direction is not unique, SO now we are trying to get the COUNT of the most COMMON wind direction
My query was
SELECT Wind_Direction,COUNT(Wind_Direction) FROM Weather
GROUP BY DISTINCT(Wind_Direction);
The logic is to find the DISTINCT WindDirections, there are like 7 AND then
group by WindDirection and apply count

Group on count of occurrences of each direction while ordering by number of occurrences and limit 1 to get the one occurring on top
select w.wind_direction as most_common_wd
from (
select wind_direction, count(*) as cnt
from weather
group by wind_direction
order by cnt desc
) w
limit 1;

You could try to execute your logic using hive analytic functions:
with q1 as (select wind_direction, count(wind_direction) over (partiton by wind_direction) as total_counts from weather) select distinct wind_direction, total_counts from q1;

Related

Count query with timestamp value

I would like to create a count query (in Postgres) which counts data.data_name dependent on data.todb_date.
So what I want to is that the query counts all the rows that are higher than the requirement in the WHERE clause. I tried Count(data.data_name) and Count(*) but they didn't work.
My planned result looks like this:
todb_date: 2016-01-01
data.data_name : test1
count: 150
todb_date: 2017-01-01
data.data_name : test1
count: 130
This is the query I have tried:
SELECT data.data_name, parentdata.data_id,
data.data_id, parentdata.todb_date,
COUNT (data.data_name)
FROM parentdata, data
WHERE parentdata.data_id = data.data_id
AND parentdata.todb_date > '2016-01-01'
GROUP BY parentdata.data_id, data.data_id, data.data_name, parentdata.todb_date

As #Usagi Miyamoto suggested, you should use a data_trunc() function to group your results according to certain time increments (here: per year):
SELECT d.data_name nam, date_trunc('year',p.todb_date) yr, COUNT(*) cnt
FROM parentdata p
INNER JOIN data d ON p.data_id = d.data_id AND p.todb_date > '2016-01-01'
GROUP BY d.data_name,date_trunc('year',p.todb_date)
ORDER BY nam, yr
If you replace 'year' by 'date' you will get daily counts, see here.

Oracle SQL query, getting a a maximum of a sum

Hey, guys. I'm struggling to solve one query, just cant get around it.
Basically, I got a some tables from data mart :
DimTheatre(TheatreId(PK), TheatreNo, Name, Address, MainTel);
DimTrow(TrowId(PK), TrowNo, RowName, RowType);
DimProduction(ProductionId(PK), ProductionNo, Title, ProductionDir, PlayAuthor);
DimTime(TimeId(PK), Year, Month, Day, Hour);
TicketPurchaseFact( TheatreId(FK), TimeId(FK), TrowId(FK),
PId(FK), TicketAmount);
The thing I'm trying to achieve in oracle is - I need to retrieve the most popular row type in each theatre by value of ticket sale
Thing I'm doing now is :
SELECT dthr.theatreid, dthr.name, max(tr.rowtype) keep(dense_rank last order
by tpf.ticketamount), sum(tpf.ticketamount) TotalSale
FROM TicketPurchaseFact tpf, DimTheatre dthr, DimTrow tr
WHERE dthr.theatreid = tpf.theatreid
GROUP BY dthr.theatreid, dthr.name;
It does give me the output, but the 'TotalSale' column is totally out of place, it gives much way higher numbers than they should be.. How could I approach this issue :) ?

I am not sure how MAX() KEEP () would help your case if I understand the problem correctly. But the below approach should work:
SELECT x.theatreid, x.name, x.rowtype, x.total_sale
FROM
(SELECT z.theatreid, z.name, z.rowtype, z.total_sale, DENSE_RANK() OVER (PARTITION BY z.theatreid, z.name ORDER BY z.total_sale DESC) as popular_row_rank
FROM
(SELECT dthr.theatreid, dthr.name, tr.rowtype, SUM(tpf.ticketamount) as total_sale
FROM TicketPurchaseFact tpf, DimTheatre dthr, DimTrow tr
WHERE dthr.theatreid = tpf.theatreid AND tr.trowid = tpf.trowid
GROUP BY dthr.theatreid, dthr.name, tr.rowtype) z
) x
WHERE x.popular_row_rank = 1;

You want the row type per theatre with the highest ticket amount. So join purchases and rows and then aggregate to get the total per rowtype. Use RANK to rank your row types per theatre and stay with the best ranked ones. At last join with the theatre table to get the theatre name.
select
theatreid,
t.name,
tr.trowid
from
(
select
p.theatreid,
r.rowtype,
rank() over (partition by p.theatreid order by sum(p.ticketamount) desc) as rn
from ticketpurchasefact p
join dimtrow r using (trowid)
group by p.theatreid, r.rowtype
) tr
join dimtheatre t using (theatreid)
where tr.rn = 1;

Can we modify the previous row and use it in current row in a SQL query for a list?

I've looked around and found a few posts with LAG() and running total type queries, but none seem to fit what I'm looking for. Maybe i'm not using the correct terms in my search or maybe I might be over complicating the situation. Hope someone could help me out.
But what I'm looking to do is to take the previous result and multiple it by the current row for a range of dates. The starting is always some base number lets do 10 to keep it simple. The values will be float, but i kept it to round numbers here to better explain my inquiry.
The first is showing the calculation part and the 2nd table below is showing what the result should look like in the end.
date val1 calc_result
20120930 null 10
20121031 2 10*2=20
20121130 3 20*3=60
20121231 1 60*1=60
20130131 2 60*2=120
20130228 1 120*1=120
The query would return
20120930 10
20121031 20
20121130 60
20121231 60
20130131 120
20130228 120
I'm trying to see if this can be done in a query type solution or would a PL/SQL table/cursors need to be used?
Any help would be appreciated.

You can do this with a recursive CTE:
with dates as (
select t.*, row_number() over (order by date) as seqnum
from t
),
cte as (
select t.date, t.val, 10 as calc_result
from dates t
where t.seqnum = 1
union all
select t.date, t.val, cte.calc_result * t.val
from cte join
dates t
on t.seqnum = cte.seqnum + 1
)
select cte.date, cte.calc_result
from cte
order by cte.date;

This is calculating a cumulative product. You can do it with some exponential arithmetic. Replace 10 in the query with the desired start value.
select date,val1
,case when row_number() over(order by date) = 1 then 10 --set start value for first row
else 10*exp(sum(ln(val1)) over(order by date)) end as res
from tbl

How do I get the top 10 results of a query?

I have a postgresql query like this:
with r as (
select
1 as reason_type_id,
rarreason as reason_id,
count(*) over() count_all
from
workorderlines
where
rarreason != 0
and finalinsdate >= '2012-12-01'
)
select
r.reason_id,
rt.desc,
count(r.reason_id) as num,
round((count(r.reason_id)::float / (select count(*) as total from r) * 100.0)::numeric, 2) as pct
from r
left outer join
rtreasons as rt
on
r.reason_id = rt.rtreason
and r.reason_type_id = rt.rtreasontype
group by
r.reason_id,
rt.desc
order by r.reason_id asc
This returns a table of results with 4 columns: the reason id, the description associated with that reason id, the number of entries having that reason id, and the percent of the total that number represents.
This table looks like this:
What I would like to do is only display the top 10 results based off the total number of entries having a reason id. However, whatever is leftover, I would like to compile into another row with a description called "Other". How would I do this?

with r2 as (
...everything before the select list...
dense_rank() over(order by pct) cause_rank
...the rest of your query...
)
select * from r2 where cause_rank < 11
union
select
NULL as reason_id,
'Other' as desc,
sum(r2.num) over() as num,
sum(r2.pct) over() as pct,
11 as cause_rank
from r2
where cause_rank >= 11

As said above Limit and for the skipping and getting the rest use offset... Try This Site

Not sure about Postgre but SELECT TOP 10... should do the trick if you sort correctly
However about the second part: You might use a Right Join for this. Join the TOP 10 Result with the whole table data and use only the records not appearing on the left side. If you calculate the sum of those you should get your "Sum of the rest" result.
I assume that vw_my_top_10 is the view showing you the top 10 records. vw_all_records shows all records (including the top 10).
Like this:
SELECT SUM(a_field)
FROM vw_my_top_10
RIGHT JOIN vw_all_records
ON (vw_my_top_10.Key = vw_all_records.Key)
WHERE vw_my_top_10.Key IS NULL

SQL query count divided by a distinct count of same query

Having some trouble with some SQL.
Take the following result for instance:
LOC_CODE CHANNEL
------------ --------------------
3ATEST-01 CHAN2
3ATEST-01 CHAN3
3ATEST-02 CHAN4
What I need to do is get a count of the above query, grouped by channel, but i want that count to be divided by the count that the "LOC_CODE" appears.
Example of the result I am after is:
CHANNEL COUNT
---------------- ----------
CHAN2 0.5
CHAN3 0.5
CHAN4 1
Above explaination is that the CHAN2 appears next to "3ATEST-01", but that LOC_CODE of "3ATEST-01" appears twice, so the count should be divided by 2.
I know I can do this by basically duplicating the query with a distinct count, but the underlying query is quite complex and don't really want to harm performance.
Please let me know if you would like more information!

Try:
select channel,
count(*) over (partition by channel, loc_code)
/ count(*) over (partition by loc_code) as count_ratio
from my_table

SELECT t.CHANNEL, COUNT(*) / gr.TotalCount
FROM my_table t JOIN (
SELECT LOC_CODE, COUNT(*) TotalCount
FROM my_table
GROUP BY LOC_CODE
) gr USING(LOC_CODE)
GROUP BY t.LOC_CODE, t.CHANNEL
Create a index on (LOC_CODE, CHANNEL)
If are no duplicate channels, replace COUNT(*) / gr.TotalCount with 1 / gr.TotalCount and remove the GROUP BY clause

First, find a query that gets you the correct results. Then, see if it can be optimised. My guess is that it's hard to optimise as you require two different groupings, one per Channel and one pre Loc_Code.
I'm not even sure that this fits your description:
SELECT t.CHANNEL
, COUNT(*) / SUM(grp.TotalCount)
FROM my_table t
JOIN
( SELECT LOC_CODE
, COUNT(*) TotalCount --- or is it perhaps?:
--- COUNT(DISTINCT CHANNEL)
FROM my_table
GROUP BY LOC_CODE
) grp
ON grp.LOC_CODE = t.LOC_CODE
GROUP BY t.CHANNEL

Your requirements are still a bit unclear to me when it comes to duplicate CHANNELs, but this should work if you want grouping on both CHANNEL and LOC_CODE to sum up later;
SELECT L1.CHANNEL, 1/COUNT(L2.LOC_CODE)
FROM Locations L1
LEFT JOIN Locations L2 ON L1.LOC_CODE = L2.LOC_CODE
GROUP BY L1.CHANNEL, L1.LOC_CODE
Demo here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVEQL/HIVE Find the most common field in a column - sql

Group on count of occurrences of each direction while ordering by number of occurrences and limit 1 to get the one occurring on top select w.wind_direction as most_common_wd from ( select wind_direction, count(*) as cnt from weather group by wind_direction order by cnt desc ) w limit 1;

You could try to execute your logic using hive analytic functions: with q1 as (select wind_direction, count(wind_direction) over (partiton by wind_direction) as total_counts from weather) select distinct wind_direction, total_counts from q1;

Related

Count query with timestamp value

Oracle SQL query, getting a a maximum of a sum

Can we modify the previous row and use it in current row in a SQL query for a list?

How do I get the top 10 results of a query?

SQL query count divided by a distinct count of same query

Categories

Resources