select top 10 items per city in SparkSQL - sql

I have the following table SQL table (SparkSQL) .
user_id, city, timestamp, item_id
I need to find the top 10 items of the given city (in terms of the number of time the item_id appeared in that city) in each given date.
I then did the following:
SELECT *
FROM (
SELECT *,
row_number() OVER partition BY city AS rn
FROM mytable) AS foo
ORDER BY rn DESC
However, though it sort by rn, it didn't just give me the top 10 elements of a given date. What would be a proper way to fix this? Thanks!

Dont know what is the function to TRUNC time from timestamp in spark.
But first you need calculate the count, and then the row_number
SELECT *
FROM (
SELECT city, item_id, theDATE, cnt,
ROW_NUMBER() OVER (PARTITION BY city, theDATE
ORDER BY cnt) rn
FROM (SELECT city,
timestamp,
item_id,
to_date(timestamp) as theDATE, -- remove time and leave just date.
COUNT(item_id) OVER (PARTITION BY city, to_date(timestamp)) cnt
FROM mytable
) AS foo
) AS boo
WHERE rn <= 10
ORDER BY city, theDATE, rn

Related

how to make a request?

I have a table Tabl1 : id, name, country, year, medal.
how can I find the top 10 countries by the number of medals for each year in 1 request?
thanks:)
You haven't told us anything about your table schema or the data, so this is a guess!
Going to assume your medal column contains the qty of medals for each Id/name, so you just need to rank by the sum of medals. Something along the lines of:
select [year], country, [Rank] from (
select [year], country, Rank() over(partition by [year] order by Sum(medal) desc ) [Rank]
from Tabl1
group by [year],country
)x
where [Rank]<=10
order by [year], [Rank]
here you can get the top 10 countries in each year:
select * from
(
select country,year,count(*),row_number() over (order by count(*) desc) as rn
from table
group by country, year
) tt
where tt.rn < 11
the sub query groups the data per country and year and gives you count() of each group, but at the same time It sorts them per count(*) desc and gives the a row number per each group ( it happanes using row_number() window funcion) , so the country with the most medal in eacg year is on top and it gets row number = 1 in each group , you need top 10 , so you filter them tt.rn < 11 in the main query.
If you want 10 countries per year:
with data as (
select country, "year" as yr,
rank() over (partition by "year" order by count(*) desc) as rnk
from T
group by country, "year"
)
select yr as "year", country from data
where rnk <= 10
order by yr, rnk;
Note that if ties are possible this could return more than ten rows for any given year.

Sum having a condition

I've a table that has this information:
And need to get the following information:
If the country of the same person name (in this case Artur) is different, then I need to sum the two values of quantity from the max date (in this case 04/10) and return both person (Artur) and the qty (15k)
If the country of the same person name (in this case Joseph) is the same, then I need only the first row of the max date available.
I'm really struguling as I'm not sure how to implement the logic into my code:
Select
table.person,
table.quantity
From
(
Select
table.date,
table.person,
table.country,
table.quantity,
ROW_NUMBER () over (
PARTITION by table.code, table.person
ORDER by table.date DESC
) AS rn
FROM
table
WHERE table.date >= DATE '{2020-04-10}' -5
) a
WHERE a.RN IN (1,2)
Is it possible to create a rule to sum rows 1 and 2 when country is different (Artur case) and only return row number 1 when the country is the same for a name (Joseph case)?
Use dense_rank() or max() as a window function:
select person, sum(quantity)
from (select t.*,
max(date) over (partition by person) as max_date
from t
) t
where date = max_date
group by person;
EDIT:
Hmmm . . . I think you might want one row per country per person on the max date. If so:
select person, sum(quantity)
from (select t.*,
row_number() over (partition by person, country order by date desc) as seqnum_pc,
rank() over (partition by person order by date desc) as seqnum_p
from t
) t
where seqnum_p = 1 and seqnum_pc = 1
group by person;

Select the most recent row where 2 columns contain the same value

I have a table of street codes and county codes. I need to only select the most recent row (ordered by created date) of any rows where these 2 columns are the same.
Ex.
Here only the last row should be selected, since it has the newest created date, where the Kommunekode and Vejkode are the same.
How can I filter my select statement to allow this logic? I tried using the distinct keyword, but that does not take the Created date into account.
My current code for the view:
SELECT
Infohub_RowId,
Infohub_CreatedDate,
Id,
Sekvensnummer,
Tidspunkt,
Operation,
Kommunekode,
Vejkode,
Oprettet,
Aendret,
Navn,
Vejnavn,
Navngivenvej_id,
Aendret AS Infohub_ValidityDate
FROM (
SELECT
Infohub_RowId,
Infohub_CreatedDate,
Sekvensnummer,
Tidspunkt,
Operation,
Id,
Kommunekode,
Vejkode,
Oprettet,
Aendret,
Navn,
Vejnavn,
Navngivenvej_id,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Aendret DESC) AS RowNum
FROM
Dawa.tDelta_Vejstykke) AS x
WHERE x.RowNum = 1
The view should "clean up" the data, by selecting the newest duplicate records.
use Infohub_CreatedDate in order by and Kommunekode,Vejkode these two column in partition by
SELECT
Infohub_RowId,
Infohub_CreatedDate,
Id,
Sekvensnummer,
Tidspunkt,
Operation,
Kommunekode,
Vejkode,
Oprettet,
Aendret,
Navn,
Vejnavn,
Navngivenvej_id,
Aendret AS Infohub_ValidityDate
FROM (
SELECT
Infohub_RowId,
Infohub_CreatedDate,
Sekvensnummer,
Tidspunkt,
Operation,
Id,
Kommunekode,
Vejkode,
Oprettet,
Aendret,
Navn,
Vejnavn,
Navngivenvej_id,
ROW_NUMBER() OVER(PARTITION BY Kommunekode,
Vejkode ORDER BY Infohub_CreatedDate DESC) AS RowNum
FROM
Dawa.tDelta_Vejstykke) AS x
WHERE x.RowNum = 1
You want row_number() but Kommunekode, Vejkode should be in partition clause :
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY Kommunekode, Vejkode ORDER BY Infohub_CreatedDate DESC) AS Seq
FROM Dawa.tDelta_Vejstykke t
) t
WHERE Seq = 1;

Selecting City from Customer ID in SQL

Customer have ordered from different cities. Thus we have multiple cities against same customer_id. I want to display that city against customer id which has occurred maximum number of times , in case where customer has ordered same number of orders from multiple cities that city should be selected from where he has placed last order. I have tried something like
SELECT customer_id,delivery_city,COUNT(DISTINCT delivery_city)
FROM analytics.f_order
GROUP BY customer_id,delivery_city
HAVING COUNT(DISTINCT delivery_city) > 1
WITH cte as (
SELECT customer_id,
delivery_city,
COUNT(delivery_city) as city_count,
MAX(order_date) as last_order
FROM analytics.f_order
GROUP BY customer_id, delivery_city
), ranking as (
SELECT *, row_number() over (partition by customer_id
order by city_count DESC, last_order DESC) as rn
FROM cte
)
SELECT *
FROM ranking
WHERE rn = 1
select customer_id,
delivery_city,
amount
from
(
select t.*,
rank() over (partition by customer_id order by amount asc) as rank
from(
SELECT customer_id,
delivery_city,
COUNT(DISTINCT delivery_city) as amount
FROM analytics.f_order
GROUP BY customer_id,delivery_city
) t
)
where rank = 1

Return top 5 from SUM in select statement

I need to return the following statement but I only want to return the TOP 5 of each Sale value only.....not all the records.
Select ID, Code, sum(Sale) as Sale from TableName
Where Code = 11
Group By ID, code
I do not want this!
Select TOP 5 ID, Code, sum(Sale) as Sale from TableName
Where Code = 11
Group By ID, code
With Cte as
( Select ID, Code, sale as Sales ,
row_number() over (partition by ID,code order by sale desc) as row_num
from TableName where code=11
)
Select Id,code,sum(sales) from cte
GROUP BY ID, code
WHERE row_num < 6
WITH TopSales AS (
SELECT *, RANK() OVER (PARTITION BY ID, Code ORDER BY Sale DESC) saleRank
FROM TableName
)
SELECT ID, Code, SUM(Sale) AS Sale
FROM TopSales
WHERE (Code = 11) AND (saleRank <= 5)
GROUP BY ID, code
select id, code, SUM (sale)
from
(
select id, code, sale,
ROW_NUMBER() over(partition by id, code order by sale desc) rn
from tablename
) v
where rn<=5
group by id, code
Probably you need something like:
;WITH sales (
SELECT
id,
code,
sale,
ROW_NUMBER() OVER (PARTITION BY id, code ORDER BY sales DESC) n
FROM
TableName
WHERE
Code = 11
)
SELECT
id, code, sum(sale) sale
FROM
sales
WHERE
n <= 5
GROUP BY
id,
code
ROW_NUMBER() and PARTITION BY help to find last 5 sales. Then you SUM only top (highest) 5.
This query returns sum of top 5 sales for each (id, code) group.
If you want to return just the top 5 results for each group you could do this:
with cte as
(ID, Code, Sale,ROW_NUMBER() over(partition by ID,
Code order by (select 0)) rownum
from TableName)
Select ID, Code, sum(Sale) as Sale from cte
Where Code = 11
and rownum<=5
Group By ID, code
If you want to return top 5 results with highest salary for each group you could do this:
with cte as
(ID, Code, Sale,ROW_NUMBER() over(partition by ID,
Code order by Sale desc) rownum
from TableName)
Select ID, Code, sum(Sale) as Sale from cte
Where Code = 11
and rownum<=5
Group By ID, code
select id, code, sum(sale) as sale from tablename
where code = 11
group by id, code
order by sum(sale) desc
limit 5