Selecting top 5 rows with distinct single column value - hive

I am trying to pull the top five states with the highest measurements related to a specific measureid. My issue is that I am trying to pull DISTINCT states with the highest measurements.
My query:
select distinct measureid, reportyear, statename, max(value)
from air_quality
where measureid = 87
and reportyear >= 2008
and reportyear <= 2013
group by measureid, reportyear, statename, value
limit 5
I expect the output with DISTINCT statenames, i.e., I don't want one repeated. If California has the highest for one year, it will not be repeated again.
Currently it displays as "... California, 22 ; ... California, 22 ; ... California, 19 ; ... Arizona, 18 ; ... California, 18"

Related

Creating rows for 5 higher and lower entries with closest matching values in same table in SQL

I'm very new to SQL and trying to structure a Java database query to pass in a row identifier code, return the values of all columns in that row, and the 5 closest higher and lower rows to a value in one of the original columns. I can find previous questions using a passed in fixed value, but don't know how to approach it when the value exists in the table.
This is my attempt so far:
SELECT * FROM (SELECT code, value FROM table1 t1 WHERE code = x) AS a
UNION ALL
SELECT * FROM (SELECT * from table1 t2 WHERE NOT code = x AND count <= t1.count order by count DESC LIMIT 5) AS b
UNION ALL
SELECT * FROM (SELECT * from table1 t3 WHERE NOT code = x AND count <= t1.count order by count ASC LIMIT 5) AS c
If anyone could point me in the right direction I would really appreciate it. Thanks
Example Table:
Code
Value
Australia
15
Mexico
22
Spain
36
Nigeria
87
Poland
55
Eritrea
17
Vietnam
26
Ireland
107
Sweden
55
Canada
26
Just as an example, but if I entered Australia as my code, I want to return that and the closest 4:
Code
Value
Australia
15
Eritrea
17
Mexico
22
Vietnam
26
Canada
26
If there are no duplicates in the column Value:
SELECT *
FROM tablename
ORDER BY ABS(Value - (SELECT Value FROM tablename WHERE Code = 'Australia'))
LIMIT ?;
If there are duplicates:
SELECT *
FROM tablename
ORDER BY Code = 'Australia' DESC,
ABS(Value - (SELECT Value FROM tablename WHERE Code = 'Australia'))
LIMIT ?;
Change ? to the total number of rows returned (including 'Australia').
See the demo.

SQL Create a new calculated column based on values of multi rows and cols

I have a data about airline's booking, using Oracle db, sample is structured as below:
Recordlocator is booking code
Sequencenmbr: whenever there is a change in booking, it records new status of a booking with higher Sequencenmbr. So the highest Sequencenmbr in the database shows the latest/current status of bookings
Sequenceair: is the sequence of flights in bookings, it may be one or many flights in a booking
DepartAirport: is from airport
ArrAirport: is to airport.
So the question is, I would like to create new Itinerary column that shows full Itinerary of booking in every rows, which is combination of DepartAirport of each row (in order of SequenceAir) and ArrAirport of a last row. Could anyone help me with the SQL statement or give some links to read?
It has to group by Recordlocator, Sequencenmbr and order by SequenceAir. It should look like this:
Recordlocator
Sequencenmbr
SequenceAir
DepartureDateTime
DepartAirport
ArrAirport
Itinerary
GQWYGM
32
1
25/11/18 16:40
RGN
SIN
RGN-SIN-JKT-SIN-RGN
GQWYGM
32
2
26/11/18 09:35
SIN
JKT
RGN-SIN-JKT-SIN-RGN
GQWYGM
32
3
29/11/18 06:50
JKT
SIN
RGN-SIN-JKT-SIN-RGN
GQWYGM
32
4
29/11/18 11:00
SIN
RGN
RGN-SIN-JKT-SIN-RGN
GQWYGM
33
1
25/11/18 16:40
RGN
SIN
RGN-SIN-MNL-SIN-RGN
GQWYGM
33
2
26/11/18 09:35
SIN
MNL
RGN-SIN-MNL-SIN-RGN
GQWYGM
33
3
29/11/18 06:50
MNL
SIN
RGN-SIN-MNL-SIN-RGN
GQWYGM
33
4
29/11/18 11:00
SIN
RGN
RGN-SIN-MNL-SIN-RGN
Manythanks
select Recordlocator , Sequencenmbr, Sequenceair , DepartAirport, ArrAirport, departureDateTime
, LISTAGG(
(
case
when last_arrAirport = DepartAirport then arrAirport -- removes duplicates when last arrival and current departure are the same
else DepartAirport||'-'||ArrAirport
end
)
,'-')
WITHIN GROUP(ORDER BY SequenceAir) -- order by
OVER (PARTITION BY Recordlocator, Sequencenmbr) list -- group by
from (
select taskStack.* -- all data from your table
, lag(ArrAirport) over (PARTITION BY Recordlocator, Sequencenmbr -- group by
order by Recordlocator,Sequencenmbr,Sequenceair) last_arrAirport -- arrAirport from previous row
from taskStack
)
order BY Recordlocator,Sequencenmbr,Sequenceair
You should use LISTAGG to do this
LISTAGG(DepartAirport||'-'||ArrAirport,'-') WITHIN GROUP(ORDER BY SequenceAir)
in your select statement.

GROUP BY one column, then GROUP BY another column

I have a database table t with a sales table:
ID
TYPE
AGE
1
B
20
1
BP
20
1
BP
20
1
P
20
2
B
30
2
BP
30
2
BP
30
3
P
40
If a person buys a bundle it appears the bundle sale (TYPE B) and the different bundle products (TYPE BP), all with the same ID. So a bundle with 2 products appears 3 times (1x TYPE B and 2x TYPE BP) and has the same ID.
A person can also buy any other product in that single sale (TYPE P), which has also the same ID.
I need to calculate the average/min/max age of the customers but the multiple entries per sale tamper with the correct calculation.
The real average age is
(20 + 30 + 40) / 3 = 30
and not
(20+20+20+20 + 30+30+30 + 40) / 8 = 26,25
But I don't know how I can reduce the sales to a single row entry AND get the 4 needed values?
Do I need to GROUP BY twice (first by ID, then by AGE?) and if yes, how can I do it?
My code so far:
SELECT
AVERAGE(AGE)
, MIN(AGE)
, MAX(AGE)
, MEDIAN(AGE)
FROM t
but that does count every row.
Assuming the age is the same for all rows with the same ID (which in itself indicates a normalisation problem), you can use nest aggregation:
select avg(min(age)) from sales
group by id
AVG(MIN(AGE))
-------------
30
SQL Fiddle
The example in the documentation is very similar; and is explained as:
This calculation evaluates the inner aggregate (MAX(salary)) for each group defined by the GROUP BY clause (department_id), and aggregates the results again.
So for your version:
This calculation evaluates the inner aggregate (MIN(age)) for each group defined by the GROUP BY clause (id), and aggregates the results again.
It doesn't really matter whether the inner aggregate is min or max - again, assuming they are all the same - it's just to get a single value per ID, which can then be averaged.
You can do the same for the other values in your original query:
select
avg(min(age)) as avg_age,
min(min(age)) as min_age,
max(min(age)) as max_age,
median(min(age)) as med_age
from sales
group by id;
AVG_AGE MIN_AGE MAX_AGE MED_AGE
------- ------- ------- -------
30 20 40 30
Or if you prefer you could get the one-age-per-ID values once ina CTE or subquery and apply the second layer of aggregation to that:
select
avg(age) as avg_age,
min(age) as min_age,
max(age) as max_age,
median(age) as med_age
from (
select min(age) as age
from sales
group by id
);
which gets the same result.
SQL Fiddle

Combine GROUP BY and LIKE SQL

My objective is to display states that have 20+ of the value in the 2nd column..
Currently I have been able to display states and the values but I need to combine states that are similar and their values (e.g VIC and Vic and vic should equal VIC 68).
I also only want to display States, not their values but the values keep showing. I'm guessing its using LIKE combined with GROUP BY but I can't figure out how.
My current SQL query:
SELECT DEPARTMENT.STATE, COUNT(ACADEMIC.DEPTNUM) FROM ACADEMIC
JOIN DEPARTMENT
ON DEPARTMENT.DEPTNUM=ACADEMIC.DEPTNUM
GROUP BY DEPARTMENT.STATE;
Output:
STATE COUNT(ACADEMIC.DEPTNUM)
----- -----------------------
NSW 82
7
QLD 21
VIC 14
vic 1
WA 42
Tas 1
SA 40
Qld 55
Vic 53
ACT 35
TAS 8
I have no idea how to do this, can anyone help?
SELECT DEPARTMENT.STATE, COUNT(ACADEMIC.DEPTNUM) FROM ACADEMIC
JOIN DEPARTMENT
ON DEPARTMENT.DEPTNUM=ACADEMIC.DEPTNUM
GROUP BY DEPARTMENT.STATE
HAVING COUNT(ACADEMIC.DEPTNUM) >= 20;
Use HAVING to return only rows where the count is 20+.
To take care of different case, do UPPER on all states:
SELECT UPPER(DEPARTMENT.STATE), COUNT(ACADEMIC.DEPTNUM) FROM ACADEMIC
JOIN DEPARTMENT
ON DEPARTMENT.DEPTNUM=ACADEMIC.DEPTNUM
GROUP BY UPPER(DEPARTMENT.STATE)
HAVING COUNT(ACADEMIC.DEPTNUM) >= 20;
I need to combine states that are similar and their values (e.g VIC and Vic and vic should equal VIC 68)
You need to use SUM and GROUP BY(UPPER/LOWER) on your sub-query or simply use UPPER/LOWER in the GROUP BY expression in your original query.
For example,
SQL> with data as(
2 select 'VIC' state, 14 cnt from dual union all
3 select 'vic' state, 1 cnt from dual union all
4 select 'Vic' state, 53 cnt from dual
5 )
6 select upper(state), sum(cnt) count
7 from data
8 group by upper(state);
UPP COUNT
--- ----------
VIC 68
Since you already have the sub-query which gives you the count, all you need to use UPPER/LOWER in GROUP BY, such that the count would now consider similar states:
SELECT UPPER(DEPARTMENT.STATE) AS "STATE"
FROM ACADEMIC
JOIN DEPARTMENT
ON DEPARTMENT.DEPTNUM=ACADEMIC.DEPTNUM
GROUP BY UPPER(DEPARTMENT.STATE)
HAVING COUNT(ACADEMIC.DEPTNUM) >= 20;

Fill Users table with data using percentages from another table

I have a Table Users (it has millions of rows)
Id Name Country Product
+----+---------------+---------------+--------------+
1 John Canada
2 Kate Argentina
3 Mark China
4 Max Canada
5 Sam Argentina
6 Stacy China
...
1000 Ken Canada
I want to fill the Product column with A, B or C based on percentages.
I have another table called CountriesStats like the following
Id Country A B C
+-----+---------------+--------------+-------------+----------+
1 Canada 60 20 20
2 Argentina 35 45 20
3 China 40 10 50
This table holds the percentage of people with each product. For example in Canada 60% of people have product A, 20% have product B and 20% have product C.
I would like to fill the Users table with data based on the Percentages in the second data. So for example if there are 1 million user in canada, I would like to fill 600000 of the Product column in the Users table with A 200000 with B and 200000 with C
Thanks for any help on how to do that. I do not mind doing it in multiple steps I jsut need hints on how can I achieve that in SQL
The logic behind this is not too difficult. Assign a sequential counter to each person in each country. Then, using this value, assign the correct product based on this value. For instance, in your example, when the number is less than or equal to 600,000 then 'A' gets assigned. For 600,001 to 800,000 then 'B', and finally 'C' to the rest.
The following SQL accomplishes this:
with toupdate as (
select u.*,
row_number() over (partition by country order by newid()) as seqnum,
count(*) over (partition by country) as tot
from users u
)
update u
set product = (case when seqnum <= tot * A / 100 then 'A'
when seqnum <= tot * (A + B) / 100 then 'B'
else 'C'
end)
from toupdate u join
CountriesStats cs
on u.country = cs.country;
The with statement defines an updatable subquery with the sequence number and total for each each country, on each row. This is a nice feature of SQL Server, but is not supported in all databases.
The from statement is joining back to the CountriesStats table to get the needed values for each country. And the case statement does the necessary logic.
Note that the sequential number is assigned randomly, using newid(), so the products should be assigned randomly through the initial table.