Spark SQL - Finding the maximum value of a month per year - sql

I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.

You can use window functions:
select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1
For each year, this gives you the row that has the greatest cnt. If there are ties, the query returns them.
Note that count is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt in the query.

You can use window functions, if you want to use SQL:
select t.*
from (select t.*,
row_number() over (partition by year order by count desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank() instead of row_number().

Related

sql query to obtain most number of names by year

I have a sample dataframe below that is over 500k rows:
|year|name|text|id|
|2001|foog|ltgn|01|
|2001|goof|ltg4|02|
|2002|tggr|ltg5|03|
|2002|wwwe|ltg6|04|
|2004|frgr|ltg7|05|
|2004|ggtg|ltg8|06|
|2003|hhyy|lt9n|07|
|2003|jjuu|l2gn|08|
|2005|fotg|l3gn|09|
I want to use sql to select the most popular name for each of the year. ie: it returns me a dataframe that has only most popular name per year for all the years that it has in the 500k rows.
I can do this via 2 separate statements:
-- sql query that gives me the names
select count(1), name from table_name group by name, order by count(1) desc limit 1;
-- If i add in a year parameter -> i can get for that particular year
select count(1), name from table_name where year = '2001' group by name, order by count(1) desc limit 1;
However how do I merge the query into 1 sql such that it provides me with the data of just the most popular name for each year?
You can use aggregation and window functions:
select yn.*
from (select yn.*,
row_number() over (partition by year order by cnt desc) as seqnum
from (select year, name, count(*) as cnt
from table_name
group by year, name
) yn
) yn
where seqnum = 1;
The innermost subquery calculates the count for each name in each year. The middle subquery enumerates the names for each year based on the count, with the highest count getting 1. And the outer subquery filters to get only the name (per year) that has the highest count.
In most databases, you can simplify this to:
select yn.*
from (select year, name, count(*) as cnt,
row_number() over (partition by year order by count(*) desc as seqnum
from table_name
group by year, name
) yn
where seqnum = 1;
I have a vague recollection that SparcSQL doesn't allow this syntax.

Generate custom group ranking in sql

As posted, I am trying to generate group ranking based on Is_True_Mod column. Here Until next 1 comes, I want 1 group to be there. Please find expected output in SQL. Here in expected output, rows grouped based on Is_True_Mode column. Regular ranking showing for reference ( order by ranking should be their )
You can identify the groups using a cumulative sum. Then you can you row_number() to enumerate the rows:
select t.*,
row_number() over (partition by grp order by regularranking) as expected_output
from (select t.*,
sum(is_true_mode) over (order by regularranking) as grp
from t
) t;

Finding consecutive patterns (with SQL)

A table consecutive in PostgreSQL:
Each se_id has an idx
from 0 up to 100 - here 0 to 9.
The search pattern:
SELECT *
FROM consecutive
WHERE val_3_bool = 1
AND val_1_dur > 4100 AND val_1_dur < 5900
Now I'm looking for the longest consecutive appearance of this pattern
for each p_id - and the AVG of the counted val_1_dur.
Is it possible to calculate this in pure SQL?
table as txt
"Result" as txt
One method is the difference of row numbers approach to get the sequences for each:
select pid, count(*) as in_a_row, sum(val1_dur) as dur
from (select t.*,
row_number() over (partition by pid order by idx) as seqnum,
row_number() over (partition by pid, val3_bool order by idx) as seqnum_d
from consecutive t
) t
group by (seqnun - seqnum_d), pid, val3_bool;
If you are looking specifically for "1" values, then add where val3_bool = 1 to the outer query. To understand why this works, I would suggest that you stare at the results of the subquery, so you can understand why the difference defines the consecutive values.
You can then get the max using distinct on:
select distinct on (pid) t.*
from (select pid, count(*) as in_a_row, sum(val1_dur) as dur
from (select t.*,
row_number() over (partition by pid order by idx) as seqnum,
row_number() over (partition by pid, val3_bool order by idx) as seqnum_d
from consecutive t
) t
group by (seqnun - seqnum_d), pid, val3_bool;
) t
order by pid, in_a_row desc;
The distinct on does not require an additional level of subquery, but I think that makes the logic clearer.
There are Window Functions, that enable you to compare one line with the previous and next one.
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
https://www.postgresql.org/docs/current/static/tutorial-window.html
As seen on How to compare the current row with next and previous row in PostgreSQL? and Filtering by window function result in Postgresql

Get aggregate over n last values in vertica

We have table that has the columns dates,sales and item.
An item's price can be different at every sale, and we want to find the price of an item, averaged over its most recent 50 sales.
Is there a way to do this using analytical functions in Vertica?
For a popular item, all these 50 sales could be from this week. For another, we may need to have a 3 month window.
Can we know what these windows are, per item ?
You would use a window-frame clause to get the value on every row:
select t.*,
avg(t.price) over (partition by item
order by t.date desc
rows between 49 preceding and current row
) as avg_price_50
from t;
On re-reading the question, I suspect you want a single row per item. For that, use row_number():
select t.item, avg(t.price)
from (select t.*,
row_number() over (partition by item order by t.date desc) as seqnum
from t
) t
where seqnum <= 50
group by item;

Selecting type(s) of account with 2nd maximum number of accounts

Suppose we have an accounts table along with the already given values
I want to find the type of account with second highest number of accounts. In this case, result should be 'FD'. In case their is a contention for second highest count I need all those types in the result.
I'm not getting any idea of how to do it. I've found numerous posts for finding second highest values, say salary, in a table. But not for second highest COUNT.
This can be done using cte's. Get the counts for each type as the first step. Then use dense_rank (to get multiple rows with same counts in case of ties) to get the rank of rows by type based on counts. Finally, select the second ranked row.
with counts as (
select type, count(*) cnt
from yourtable
group by type)
, ranks as (
select type, dense_rank() over(order by cnt desc) rnk
from counts)
select type
from ranks
where rnk = 2;
One option is to use row_number() (or dense_rank(), depending on what "second" means when there are ties):
select a.*
from (select a.type, count(*) as cnt,
row_number() over (order by count(*) desc) as seqnum
from accounta a
group by a.type
) a
where seqnum = 2;
In Oracle 12c+, you can use offset/fetch:
select a.type, count(*) as cnt
from accounta a
group by a.type
order by count(*) desc
offset 1
fetch first 1 row only