sql query to obtain most number of names by year - sql

I have a sample dataframe below that is over 500k rows:
|year|name|text|id|
|2001|foog|ltgn|01|
|2001|goof|ltg4|02|
|2002|tggr|ltg5|03|
|2002|wwwe|ltg6|04|
|2004|frgr|ltg7|05|
|2004|ggtg|ltg8|06|
|2003|hhyy|lt9n|07|
|2003|jjuu|l2gn|08|
|2005|fotg|l3gn|09|
I want to use sql to select the most popular name for each of the year. ie: it returns me a dataframe that has only most popular name per year for all the years that it has in the 500k rows.
I can do this via 2 separate statements:
-- sql query that gives me the names
select count(1), name from table_name group by name, order by count(1) desc limit 1;
-- If i add in a year parameter -> i can get for that particular year
select count(1), name from table_name where year = '2001' group by name, order by count(1) desc limit 1;
However how do I merge the query into 1 sql such that it provides me with the data of just the most popular name for each year?

You can use aggregation and window functions:
select yn.*
from (select yn.*,
row_number() over (partition by year order by cnt desc) as seqnum
from (select year, name, count(*) as cnt
from table_name
group by year, name
) yn
) yn
where seqnum = 1;
The innermost subquery calculates the count for each name in each year. The middle subquery enumerates the names for each year based on the count, with the highest count getting 1. And the outer subquery filters to get only the name (per year) that has the highest count.
In most databases, you can simplify this to:
select yn.*
from (select year, name, count(*) as cnt,
row_number() over (partition by year order by count(*) desc as seqnum
from table_name
group by year, name
) yn
where seqnum = 1;
I have a vague recollection that SparcSQL doesn't allow this syntax.

Related

How to get 3 most frequent column counts separated by year in SQL

I have a database (crimes) and I want to separate per year the top 3 districts with the most frequent amount of crimes in SQL. I have tried the following code but this just
sums the amount of crimes:
SELECT
year,
district,
CrimeID,
COUNT(*) OVER (PARTITION BY year)
FROM Crimes
You could do it like this in Oracle, if that helps (editing to add, it looks like you might be using SQL Server so I have added an alias to the derived table to make it work for that too):
SELECT
v.year,
v.district,
v.count
FROM (
SELECT
year,
district,
COUNT(*) AS count,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY COUNT(*) DESC) AS rono
FROM crimes
GROUP BY year, district
) v
WHERE v.rono <= 3
ORDER BY v.year ASC, v.rono ASC

how to make a request?

I have a table Tabl1 : id, name, country, year, medal.
how can I find the top 10 countries by the number of medals for each year in 1 request?
thanks:)
You haven't told us anything about your table schema or the data, so this is a guess!
Going to assume your medal column contains the qty of medals for each Id/name, so you just need to rank by the sum of medals. Something along the lines of:
select [year], country, [Rank] from (
select [year], country, Rank() over(partition by [year] order by Sum(medal) desc ) [Rank]
from Tabl1
group by [year],country
)x
where [Rank]<=10
order by [year], [Rank]
here you can get the top 10 countries in each year:
select * from
(
select country,year,count(*),row_number() over (order by count(*) desc) as rn
from table
group by country, year
) tt
where tt.rn < 11
the sub query groups the data per country and year and gives you count() of each group, but at the same time It sorts them per count(*) desc and gives the a row number per each group ( it happanes using row_number() window funcion) , so the country with the most medal in eacg year is on top and it gets row number = 1 in each group , you need top 10 , so you filter them tt.rn < 11 in the main query.
If you want 10 countries per year:
with data as (
select country, "year" as yr,
rank() over (partition by "year" order by count(*) desc) as rnk
from T
group by country, "year"
)
select yr as "year", country from data
where rnk <= 10
order by yr, rnk;
Note that if ties are possible this could return more than ten rows for any given year.

Spark SQL - Finding the maximum value of a month per year

I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.
You can use window functions:
select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1
For each year, this gives you the row that has the greatest cnt. If there are ties, the query returns them.
Note that count is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt in the query.
You can use window functions, if you want to use SQL:
select t.*
from (select t.*,
row_number() over (partition by year order by count desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank() instead of row_number().

How do I create a new SQL table with custom column names and populate these columns

So I currently have an SQL statement that generates a table with the most frequent occurring value as well as the least frequent occurring value in a table. However this table has 2 rows with the row values as well as the fields. I need to create a custom table with 2 columns with min and max. Then have one row with one value for each. The value for these columns needs to be from the same row.
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency DESC limit 1)
UNION
(SELECT name, COUNT(name) AS frequency
FROM firefighter_certifications
GROUP BY name
ORDER BY frequency ASC limit 1);
So for the above query I would need the names of the min and max values in one row. I need to be able to define the name of new columns for the generated SQL query as well.
Min_Name | Max_Name
Certif_1 | Certif_2
I think this query should give you the results you want. It ranks each name according to the number of times it appears in the table, then uses conditional aggregation to select the min and max frequency names in one row:
with cte as (
select name,
row_number() over (order by count(*) desc) as maxr,
row_number() over (order by count(*)) as minr
from firefighter_certifications
group by name
)
select max(case when minr = 1 then name end) as Min_Name,
max(case when maxr = 1 then name end) as Max_Name
from cte
Postgres doesn't offer "first" and "last" aggregation functions. But there are other, similar methods:
select distinct first_value(name) over (order by cnt desc, name) as name_at_max,
first_value(name) over (order by cnt asc, name) as name_at_min
from (select name, count(*) as cnt
from firefighter_certifications
group by name
) n;
Or without any subquery at all:
select first_value(name) over (order by count(*) desc, name) as name_at_max,
first_value(name) over (order by count(*) asc, name) as name_at_min
from firefighter_certifications
group by name
limit 1;
Here is a db<>fiddle

Find the month in which maximum number of employees hired

I have a situation where I need to find the month in which maximum number of employees hired.
Here is my Employee table:
Although I have a solution for this:
select MM
from (
select *, dense_RANK() OVER(order by cnt desc) as rnk
from (
select month(doj) as MM,count(month(doj)) as CNT
from employee
group by month(doj)
)x
)y
where rnk=1
But I am not satisfied with what i have implemented and want the most feasible solution for it.
I think the simplest way is:
select top 1 year(doj), month(doj), count(*)
from employee
group by year(doj), month(doj)
order by count(*) desc;
Notes:
This interprets "month" as being "year/month". If you really do only want the month, then remove year() from both the select and group by.
This returns one row. If you want multiple rows when there are ties, then use select top (1) with ties.