count most repeated value per group in hive? - sql

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
tpep_pickup_datetime
pulocationid
2022-01-28 23:32:52.0
100
2022-02-28 23:02:40.0
202
2022-02-28 17:22:45.0
102
2022-02-28 23:19:37.0
102
2022-03-29 17:32:02.0
102
2022-01-28 23:32:40.0
101
2022-02-28 17:28:09.0
201
2022-03-28 23:59:54.0
100
2022-02-28 21:02:40.0
100
I want to find out what was the most common hour in each locationid, this being the result:
locationid
hour
100
23
101
17
102
17
201
17
202
23
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error:
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?

with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)

You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime)
Then add the row_number but you need to order it by the total count in a descending way:
select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)
Or in total :
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1

Related

Sum iteratively in sql based on what value next row has?

I want to aggregate a transaction table in a way that it sums checking a variable in next row and sums if condition is met, otherwise breaks and start summing again creating a new row. This might be confusing, so adding example below -
What I have -
ID
date
type
amount
a
1/1/2023
incoming
10
a
2/1/2023
incoming
10
a
3/1/2023
incoming
10
a
4/1/2023
incoming
10
a
5/1/2023
outgoing
20
a
6/1/2023
outgoing
10
a
7/1/2023
incoming
10
a
8/1/2023
incoming
10
a
9/1/2023
outgoing
30
a
10/1/2023
incoming
10
Summary Output I want -
ID
type
min_date
max_date
amount
a
incoming
1/1/2023
4/1/2023
40
a
outgoing
5/1/2023
6/1/2023
30
a
incoming
7/1/2023
8/1/2023
20
a
outgoing
9/1/2023
9/1/2023
30
a
incoming
10/1/2023
10/1/2023
10
Basically keep summing until the next row has same transaction type (after sorting on date), if it changes create a new row and repeat same process.
Thanks!
I tried approaches like using window function (dense_rank) and sum() over (partition by) but not getting the output I am looking for.
Using window functions is the correct approach, you need to identify when the type changes (one way is to use Lag or Lead) and then assign grouping to each set, see if the following gives your expected results:
with d as (
select *,
case when lag(type) over(partition by id order by date) = type then 0 else 1 end diff
from t
), grp as (
select *, Sum(diff) over(partition by id order by date) grp
from d
)
select Id, Type,
Min(date) Min_Date,
Max(Date) Max_Date,
Sum(Amount) Amount
from grp
group by Id, Type, grp
order by Min_Date;
See this example Fiddle

Exclude group of records—if number ever goes up

I have a road inspection table:
INSPECTION_ID ROAD_ID INSP_DATE CONDITION_RATING
--------------------- ------------- --------- ----------------
506411 3040 01-JAN-81 15
508738 3040 14-APR-85 15
512461 3040 22-MAY-88 14
515077 3040 17-MAY-91 14 -- all ok
505967 3180 01-MAY-81 11
507655 3180 13-APR-85 9
512374 3180 11-MAY-88 17 <-- goes up; NOT ok
515626 3180 25-APR-91 16.5
502798 3260 01-MAY-83 14
508747 3260 13-APR-85 13
511373 3260 11-MAY-88 12
514734 3260 25-APR-91 12 -- all ok
I want to write a query that will exclude the entire road -- if the road's condition ever goes up over time. For example, exclude road 3180, since the condition goes from 9 to 17 (an anomaly).
Question:
How can I do that using Oracle SQL?
Sample data: db<>fiddle
Here's one option:
find "next" condition_rating value (within the same road_id - that's the partition by clause, sorted by insp_date)
return road_id whose difference between the "next" and "current" condition_rating is less than zero
SQL> with temp as
2 (select road_id,
3 condition_rating,
4 nvl(lead(condition_rating) over (partition by road_id order by insp_date),
5 condition_rating) next_cr
6 from test
7 )
8 select distinct road_id
9 from temp
10 where condition_rating - next_cr < 0;
ROAD_ID
----------
3180
SQL>
Based on OPs own answer, which make the expected outcome more clear.
In my permanent urge to avoid self-joins I'd go for the nested window function:
SELECT road_id, condition_rating, insp_date
FROM ( SELECT prev.*
, COUNT(CASE WHEN condition_rating < next_cr THEN 1 END) OVER(PARTITION BY road_id) bad
FROM (select t.*
, lead(condition_rating) over (partition by road_id order by insp_date) next_cr
from t
) prev
) tagged
WHERE bad = 0
ORDER BY road_id, insp_date
NOTE
lead() gives null for the last row which the query considers by the case expression to mark bad rows: condition_rating < next_cr — if next_cr is null, the condition won't be true so that the case maps it as "not bad".
The case is just to mimic the filter clause: https://modern-sql.com/feature/filter
MATCH_RECOGNIZE might be another option to this problem, but due to the lack of '^' and '$' I'm worried that the backtracking might cause more problems it is worth.
Nested window functions are typically no big performance hit if they use compatible OVER clauses, like in this query.
Here's an answer that's similar to #Littlefoot's answer:
with insp as (
select
road_id,
condition_rating,
insp_date,
case when condition_rating > lag(condition_rating,1) over(partition by road_id order by insp_date) then 'Y' end as condition_goes_up
from
test_data
)
select
insp.*
from
insp
left join
(
select distinct
road_id,
condition_goes_up
from
insp
where
condition_goes_up = 'Y'
) insp_flag
on insp.road_id = insp_flag.road_id
where
insp_flag.condition_goes_up is null
--Note: I removed the ORDER BY, because I think the window function already orders the rows the way I want.
db<>fiddle
Edit:
Here's a version that's similar to what #Markus Winand did:
insp as (
select
road_id,
condition_rating,
insp_date,
case when condition_rating > lag(condition_rating,1) over(partition by road_id order by insp_date) then 'Y' end as condition_goes_up
from
test_data
)
select
insp_tagged.*
from
(
select
insp.*,
count(condition_goes_up) over(partition by road_id) as condition_goes_up_count
from
insp
) insp_tagged
where
condition_goes_up_count = 0
I ended up going with that option.
db<>fiddle

Calculate the first and second most maximum value at every row and the average of both snowflake SQL

I have a table with the following schema:
uid
visit name
visit date
sales quantity
xyz
visit 1
2020-01-01
29
xyz
visit 2
2020-01-03
250
xyz
visit 3
2020-01-04
20
xyz
visit 4
2020-01-27
21
abc
visit 1
2020-02-01
29
abc
visit 2
2020-03-03
34
abc
visit 3
2020-04-04
35
abc
visit 4
2020-04-27
41
base table sales
Each unique id has a few unique visits that repeat for every unique id, at every visit I have to calculate what the two most highest sales quantity is per user- across their prior visits(ascending order) up until the current visit named in the row for each unique id and excluding the current row.
output would be- the same table plus these columns
max sale
2nd max sale
avg of both max sales
output table
I have used window functions for the maximum value, but I am struggling to get the second highest value of sales for every user for every row. Is this doable using sql? If so what would the script look like?
Update: I re-wrote my answer, because the previous one ignored certain requirements.
To keep track of the 2 previous top values, you can write a UDTF in JS to hold that ranking:
create or replace function udtf_top2_before(points float)
returns table (output_col array)
language javascript
as $$
{
processRow: function f(row, rowWriter, context){
rowWriter.writeRow({OUTPUT_COL: this.prevmax.slice().reverse()});
this.prevmax.push(row.POINTS);
// silly js sort https://stackoverflow.com/a/21595293/132438
this.prevmax = this.prevmax.sort(function (a, b) {return a - b;}).slice(-2);
}
, initialize: function(argumentInfo, context) {
this.prevmax = [];
}
}
$$;
Then that tabular UDF can will give you the numbers as expected:
with data as (
select v:author::string author, v:score::int score, v:subreddit, v:created_utc::timestamp ts
from reddit_comments_sample
where v:subreddit = 'wallstreetbets'
)
select author, score, ts
, output_col[0] prev_max
, output_col[1] prev_max2
, (prev_max+ifnull(prev_max2,prev_max))/2 avg
from (
select author, score, ts, output_col
from data, table(udtf_top2_before(score::float) over(partition by author order by ts))
order by author, ts
limit 100
)
UDTF based on my previous post:
https://towardsdatascience.com/sql-puzzle-optimization-the-udtf-approach-for-a-decay-function-4b4b3cdc8596
Previously:
You can use row_number() over() to select the top 2, and then pivot with an array_agg():
with data as (
select v:author author, v:score::int score, v:subreddit, v:created_utc::timestamp ts
from reddit_comments_sample
where v:subreddit = 'wallstreetbets'
)
select author, arr[0] max_score, arr[1] max_score_2, (max_score+ifnull(max_score_2,max_score))/2 avg
from (
select author
, array_agg(score) within group (order by score::int desc) arr
from (
select author, score, ts
from data
qualify row_number() over(partition by author order by score desc) <= 2
)
group by 1
)
order by 4 desc

How to Select Last Entry If None On Input Date SQL

I have a set of data that I am passing a user input of date to pull from a table.
What I am trying to do is setup up logic as such:
Here is the original table:
User DATE_TIME Value
HH1 5/20/2018 1:00 50
HH1 5/20/2018 10:00 50
HH1 5/20/2018 18:00 120
HH1 5/25/2018 12:00 10
HH1 5/26/2018 10:00 15
User passes 05/20/2018 into the sql query for DATE_TIME
The output is as follows:
User DATE_TIME Value
HH1 5/20/2018 1:00 50
HH1 5/20/2018 10:00 50
HH1 5/20/2018 18:00 120
Now the user passes 05/21/2018 into DATE_TIME
Result is nothing
What I am trying to accomplish is...if there are no results on the current day, then pull the latest value in the database, in this case:
User DATE_TIME Value
HH1 5/20/2018 18:00 120
I am not sure how to find this most recent value.
Any help would be appreciated, thanks!
In my solution below, I named the base table TBL and I changed the name of the first column to USR since "user" is an Oracle keyword.
Most of the work is done in the subquery. Look at the FROM clause first: I cross-join to a small subquery that creates an actual date from the input (assumed to be given as a string in MM/DD/YYYY format). Depending on your application, you may be able to input a date directly and not have to convert to date in the query. One way or another, you should be able to use the input date.
The WHERE clause limits the rows to dates up to the input date given (with that date included). Then we rank the rows in descending order, but with a modification we make first: If there are any rows on the input date, their time-of-day component is truncated to zero. (The DATE_TIME value is replaced with the input date for those rows only.) So if there are any rows for the input date, all the rows for that date will get rank = 1, and all other rows will get higher ranks. However, if there are no rows for the input date, then the most recent row before that date will get rank = 1 (and only that row, assuming there are no duplicates in the DATE_TIME column).
So then the job of the outer query is easy: keep only the row OR ROWS where rank = 1. This means either ALL the rows for the input date (if there were any), or the single most recent row before that date.
The subquery-outer query structure cannot be avoided, because the WHERE clause is evaluated before the ranks can be calculated in the same query. If we want the WHERE clause to reference ranks, it must be in an outer query (at a higher level than the query that generates the ranks).
The query should be efficient, since the optimizer is smart enough to see we only want rank = 1 in the outer query, so it will not actually compute ALL the ranks (it will not fully order the subquery rows by DATE_TIME). And, of course, if there is an index on DATE_TIME, it will be used.
You didn't say anything about the role played by USR. If it plays no role, then the solution should work as-is. If you also input a USR, then add the filter in the WHERE clause of the subquery. If you need a result for each USR separately, for the same input date, add PARTITION BY USR in the analytic clause of the RANK() function.
select usr, date_time, value
from (
select usr, date_time, value,
rank() over (order by case when date_time >= input_date then input_date
else date_time end desc) as rnk
from tbl cross join
(select to_date(:input_date, 'mm/dd/yyyy') as input_date from dual)
where date_time < input_date + 1
)
where rnk = 1
;
Below query will make use of window function ROW_NUMBER PARTITION BY ORDER BY. It will combine (UNION ALL) the records equal to input date OR if not found, get the rec with max date. Rownum =1 and rank=1 will only get the first record. If date is same with user input then the max date will be at the bottom and will not be selected. If date is not found then record with max date is the only record, thus will be shown. See demo at http://sqlfiddle.com/#!4/931c2/1.
SELECT usr as "user", date_time, valu as "value"
FROM (
SELECT usr, date_time, valu,
row_number() over (partition by trunc(date_time)
order by date_time desc) as rnk
FROM TEST
WHERE to_char(DATE_TIME, 'yyyy-mm-dd') = '2018-05-21'
UNION ALL
SELECT usr, date_time, valu,
row_number() over (order by date_time desc) as rnk
FROM TEST) t
WHERE rnk = 1
AND rownum = 1;
Ugly as hell, who knows how it'll behave on large tables (probably not very well), but - kind of works on a simple case (based on Oracle).
SQL> create table test (usern varchar2(3), date_time date, hour number, value number);
Table created.
SQL> insert into test
2 (select 'HH1', date '2018-05-20', 1, 50 from dual union
3 select 'HH1', date '2018-05-20', 10, 50 from dual union
4 select 'HH1', date '2018-05-20', 18, 120 from dual union
5 select 'HH1', date '2018-05-25', 12, 10 from dual union
6 select 'HH1', date '2018-05-26', 10, 15 from dual
7 );
5 rows created.
SQL>
Testing:
SQL> select *
2 from test t
3 where t.date_time = (select max(t1.date_time) from test t1
4 where t1.usern = t.usern
5 and t1.date_time <= to_date('&&date_you_enter', 'dd.mm.yyyy')
6 )
7 and t.hour = (select max(case when e.it_exists = 'Y' then t.hour
8 else t2.hour
9 end)
10 from test t2 left join
11 (select t3.usern, 'Y' it_exists
12 from test t3
13 where t3.date_time = to_date('&&date_you_enter', 'dd.mm.yyyy')
14 ) e on e.usern = t2.usern
15 )
16 order by t.date_time, t.hour;
Enter value for date_you_enter: 20.05.2018
USE DATE_TIME HOUR VALUE
--- ---------- ---------- ----------
HH1 20.05.2018 1 50
HH1 20.05.2018 10 50
HH1 20.05.2018 18 120
SQL> undefine date_you_enter
SQL> /
Enter value for date_you_enter: 21.05.2018
USE DATE_TIME HOUR VALUE
--- ---------- ---------- ----------
HH1 20.05.2018 18 120
SQL> undefine date_you_enter
SQL> /
Enter value for date_you_enter: 25.05.2018
USE DATE_TIME HOUR VALUE
--- ---------- ---------- ----------
HH1 25.05.2018 12 10
SQL>
It will be helpful if you post your SQL Statement. But try this
Select
column1, column2,
FROM
YourTable
Where Date <= #dateInputByUserGoesHere

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x