Hive - Is there a way to further optimize a HiveQL query?

Hive - Is there a way to further optimize a HiveQL query? - sql

I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.
Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
The table columns are as following:
Airports
|iata|airport|city|state|country|
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|

Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
Tune mapjoins and enable parallel execution:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344

It might help if you do the aggregation before the union all:
SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
GROUP BY Origin
) UNION ALL
(SELECT Dest AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY Dest
)
) f INNER JOIN
airports a
ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;

I don't think GROUPING SETS are applicable here because you are only grouping by one field.
From Apache Wiki:
"The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set."

You can test this but you are in the case where an Union maybe better, so You really need to test it and come back :
SELECT airports.airport,
SUM(
CASE
WHEN T1.FlightsNum IS NOT NULL THEN 1
WHEN T2.FlightsNum IS NOT NULL THEN 1
ELSE 0
END
) AS Total_Flights
FROM airports
LEFT JOIN (SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t1
on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t2
on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC

Related

SQL - Group values by range

I have following query:
SELECT
polutionmm2 AS metric,
sum(cnt) as value
FROM polutiondistributionstatistic as p inner join crates as c on p.crateid = c.id
WHERE
c.name = '154'
and to_timestamp(startts) >= '2021/01/20 00:00:00' group by polutionmm2
this query returns these values:
"metric","value"
50,580
100,8262
150,1548
200,6358
250,869
300,3780
350,505
400,2248
450,318
500,1674
550,312
600,7420
650,1304
700,2445
750,486
800,985
850,139
900,661
950,99
1000,550
I would need to edit the query in a way that it groups them toghether in ranges of 100, starting from 0. So everything that has a metric value between 0 and 99 should be one row, and the value the sum of the rows... like this:
"metric","value"
0,580
100,9810
200,7227
300,4285
400,2556
500,1986
600,8724
700,2931
800,1124
900,760
1000,550
The query will run over about 500.000 rows.. Can this be done via query? Is it efficient?
EDIT:
there can be up to 500 ranges, so an automatic way of grouping them would be great.

You can use generate_series() and a range type to generate the the ranges you want, e.g.:
select int4range(x.start, case when x.start = 1000 then null else x.start + 100 end, '[)') as range
from generate_series(0,1000,100) as x(start)
This generates the ranges [0,100), [100,200) and so on up until [1000,).
You can adjust the width and the number of ranges by using different parameters for generate_series() and adjusting the expression that evaluates the last range
This can be used in an outer join to aggregate the values per range:
with ranges as (
select int4range(x.start, case when x.start = 1000 then null else x.start + 100 end, '[)') as range
from generate_series(0,1000,100) as x(start)
)
select r.range as metric,
sum(t.value)
from ranges r
left join the_table t on r.range #> t.metric
group by range;
The expression r.range #> t.metric tests if the metric value falls into the (generated) range
Online example

You can create a Pseudo table with interval you like and join with that table.
I'll use recursive CTE for this case.
WITH RECURSIVE cte AS(
select 0 St, 99 Ed
UNION ALL
select St + 100, Ed + 100 from cte where St <= 1000
)
select cte.st as metric,sum(tb.value) as value from cte
inner join [tableName] tb --with OP query result
on tb.metric between cte.St and cte.Ed
group by cte.st
order by st
here is DB<>fiddle with some pseudo data.

use conditional aggregation
SELECT
case when polutionmm2>=0 and polutionmm2<100 then '100'
when polutionmm2>=100 and polutionmm2<200 then '200'
........
when polutionmm2>=900 and polutionmm2<1000 then '1000'
end AS metric,
sum(cnt) as value
FROM polutiondistributionstatistic as p inner join crates as c on p.crateid = c.id
WHERE
c.name = '154'
and to_timestamp(startts) >= '2021/01/20 00:00:00'
group by case when polutionmm2>=0 and polutionmm2<100 then '100'
when polutionmm2>=100 and polutionmm2<200 then '200'
........
when polutionmm2>=900 and polutionmm2<1000 then '1000'
end

Use of MAX function in SQL query to filter data

The code below joins two tables and I need to extract only the latest date per account, though it holds multiple accounts and history records. I wanted to use the MAX function, but not sure how to incorporate it for this case. I am using My SQL server.
Appreciate any help !
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from
Property.dbo.PROP
inner join
Property.dbo.PROP_DATA on Property.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where
(PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
order by
PROP.EffDate DESC

Assuming your DBMS supports windowing functions and the with clause, a max windowing function would work:
with all_data as (
select
PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label,
max (PROP.EffDate) over (partition by PROP.PolNo) as max_date
from Actuarial.dbo.PROP
inner join Actuarial.dbo.PROP_DATA
on Actuarial.dbo.PROP.FileID = Actuarial.dbo.PROP_DATA.FileID
where (PROP_DATA.Label in ('Occupancy' , 'OccupancyTIV'))
and (PROP.EffDate >= '42278' and PROP.EffDate <= '42643')
and (PROP.Status = 'Bound')
and (Prop.FileTime = Max(Prop.FileTime))
)
select
FileName, InsName, Status, FileTime, SubmissionNo,
PolNo, EffDate, ExpDate, Region, UnderWriter, Data, Label
from all_data
where EffDate = max_date
ORDER BY EffDate DESC
This also presupposes than any given account would not have two records on the same EffDate. If that's the case, and there is no other objective means to determine the latest account, you could also use row_numer to pick a somewhat arbitrary record in the case of a tie.

Using straight SQL, you can use a self-join in a subquery in your where clause to eliminate values smaller than the max, or smaller than the top n largest, and so on. Just set the number in <= 1 to the number of top values you want per group.
Something like the following might do the trick, for example:
select
p.FileName
, p.InsName
, p.Status
, p.FileTime
, p.SubmissionNo
, p.PolNo
, p.EffDate
, p.ExpDate
, p.Region
, p.Underwriter
, pd.Data
, pd.Label
from Actuarial.dbo.PROP p
inner join Actuarial.dbo.PROP_DATA pd
on p.FileID = pd.FileID
where (
select count(*)
from Actuarial.dbo.PROP p2
where p2.FileID = p.FileID
and p2.EffDate <= p.EffDate
) <= 1
and (
pd.Label in ('Occupancy' , 'OccupancyTIV')
and p.Status = 'Bound'
)
ORDER BY p.EffDate DESC
Have a look at this stackoverflow question for a full working example.

Not tested
with temp1 as
(
select foo
from bar
whre xy = MAX(xy)
)
select PROP.FileName,PROP.InsName, PROP.Status,
PROP.FileTime, PROP.SubmissionNo, PROP.PolNo,
PROP.EffDate,PROP.ExpDate, PROP.Region,
PROP.Underwriter, PROP_DATA.Data , PROP_DATA.Label
from Actuarial.dbo.PROP
inner join temp1 t
on Actuarial.dbo.PROP.FileID = t.dbo.PROP_DATA.FileID
ORDER BY PROP.EffDate DESC

ROW_NUMBER() Query Plan SORT Optimization

The query below accesses the Votes table that contains over 30 million rows. The result set is then selected from using WHERE n = 1. In the query plan, the SORT operation in the ROW_NUMBER() windowed function is 95% of the query's cost and it is taking over 6 minutes to complete execution.
I already have an index on same_voter, eid, country include vid, nid, sid, vote, time_stamp, new to cover the where clause.
Is the most efficient way to correct this to add an index on vid, nid, sid, new DESC, time_stamp DESC or is there an alternative to using the ROW_NUMBER() function for this to achieve the same results in a more efficient manner?
SELECT v.vid, v.nid, v.sid, v.vote, v.time_stamp, v.new, v.eid,
ROW_NUMBER() OVER (
PARTITION BY v.vid, v.nid, v.sid ORDER BY v.new DESC, v.time_stamp DESC) AS n
FROM dbo.Votes v
WHERE v.same_voter <> 1
AND v.eid <= #EId
AND v.eid > (#EId - 5)
AND v.country = #Country

One possible alternative to using ROW_NUMBER():
SELECT
V.vid,
V.nid,
V.sid,
V.vote,
V.time_stamp,
V.new,
V.eid
FROM
dbo.Votes V
LEFT OUTER JOIN dbo.Votes V2 ON
V2.vid = V.vid AND
V2.nid = V.nid AND
V2.sid = V.sid AND
V2.same_voter <> 1 AND
V2.eid <= #EId AND
V2.eid > (#EId - 5) AND
V2.country = #Country AND
(V2.new > V.new OR (V2.new = V.new AND V2.time_stamp > V.time_stamp))
WHERE
V.same_voter <> 1 AND
V.eid <= #EId AND
V.eid > (#EId - 5) AND
V.country = #Country AND
V2.vid IS NULL
The query basically says to get all rows matching your criteria, then join to any other rows that match the same criteria, but which would be ranked higher for the partition based on the new and time_stamp columns. If none are found then this must be the row that you want (it's ranked highest) and if none are found that means that V2.vid will be NULL. I'm assuming that vid otherwise can never be NULL. If it's a NULLable column in your table then you'll need to adjust that last line of the query.

Fastest way to check if the the most recent result for a patient has a certain value

Mssql < 2005
I have a complex database with lots of tables, but for now only the patient table and the measurements table matter.
What I need is the number of patient where the most recent value of 'code' matches a certain value. Also, datemeasurement has to be after '2012-04-01'. I have fixed this in two different ways:
SELECT
COUNT(P.patid)
FROM T_Patients P
WHERE P.patid IN (SELECT patid
FROM T_Measurements M WHERE (M.code ='xxxx' AND result= 'xx')
AND datemeasurement =
(SELECT MAX(datemeasurement) FROM T_Measurements
WHERE datemeasurement > '2012-01-04' AND patid = M.patid
GROUP BY patid
GROUP by patid)
AND:
SELECT
COUNT(P.patid)
FROM T_Patient P
WHERE 1 = (SELECT TOP 1 case when result = 'xx' then 1 else 0 end
FROM T_Measurements M
WHERE (M.code ='xxxx') AND datemeasurement > '2012-01-04' AND patid = P.patid
ORDER by datemeasurement DESC
)
This works just fine, but it makes the query incredibly slow because it has to join the outer table on the subquery (if you know what I mean). The query takes 10 seconds without the most recent check, and 3 minutes with the most recent check.
I'm pretty sure this can be done a lot more efficient, so please enlighten me if you will :).
I tried implementing HAVING datemeasurment=MAX(datemeasurement) but that keeps throwing errors at me.

So my approach would be to write a query just getting all the last patient results since 01-04-2012, and then filtering that for your codes and results. So something like
select
count(1)
from
T_Measurements M
inner join (
SELECT PATID, MAX(datemeasurement) as lastMeasuredDate from
T_Measurements M
where datemeasurement > '01-04-2012'
group by patID
) lastMeasurements
on lastMeasurements.lastmeasuredDate = M.datemeasurement
and lastMeasurements.PatID = M.PatID
where
M.Code = 'Xxxx' and M.result = 'XX'

The fastest way may be to use row_number():
SELECT COUNT(m.patid)
from (select m.*,
ROW_NUMBER() over (partition by patid order by datemeasurement desc) as seqnum
FROM T_Measurements m
where datemeasurement > '2012-01-04'
) m
where seqnum = 1 and code = 'XXX' and result = 'xx'
Row_number() enumerates the records for each patient, so the most recent gets a value of 1. The result is just a selection.

Select Statement Return 0 if Null

I have the following query
SELECT ProgramDate, [CountVal]= COUNT(ProgramDate)
FROM ProgramsTbl
WHERE (Type = 'Type1' AND ProgramDate = '10/18/11' )
GROUP BY ProgramDate
What happens is that if there is no record that matches the Type and ProgramDate, I do not get any records returned.
What I like to have outputted in the above is something like the following if there is no values returned. Notice how for the CountVal we have 0 even if there are no records returned that fit the match condition:
ProgramDate CountVal
10/18/11 0

This is a little more complicated than you would like however, it is very possible. You will first have to create a temporary table of dates. For example, the query below creates a range of dates from 2011-10-11 to 2011-10-20
CREATE TEMPORARY TABLE date_stamps AS
SELECT (date '2011-10-10' + new_number) AS date_stamp
FROM generate_series(1, 10) AS new_number;
Using this temporary table, you can select from it and left join your table ProgramsTbl. For example
SELECT date_stamp,COUNT(ProgramDate)
FROM date_stamps
LEFT JOIN ProgramsTbl ON ProgramsTbl.ProgramDate = date_stamps.date_stamp
WHERE Type = 'Type1'
GROUP BY ProgramDate;

Select ProgramDate, [CountVal]= SUM(occur)
from
(
SELECT ProgramDate, 1 occur
FROM ProgramsTbl
WHERE (Type = 'Type1' AND ProgramDate = '10/18/11' )
UNION
SELECT '10/18/11', 0
)
GROUP BY ProgramDate

Because each SELECT statement is really building a table of records you can use a SELECT query to build a table with both the program count and a default count of zero. This would require two SELECT queries (one to get the actual count, one to get the default count) and using a UNION to combine the two SELECT results into a single table.
From there you can SELECT from the UNIONed table to sum the CountVals (if the programDate occurs in the ProgramTable the CountVal will be
CountVal of the first query if it exists(>0) + CountVal of the second query (=0)).
This way even if there are no records for the desired programDate in ProgramTable you will get a record back indicating a count of 0.
This would look like:
SELECT ProgramDate, SUM(CountVal)
FROM
(SELECT ProgramDate, COUNT(*) AS CountVal
FROM ProgramsTbl
WHERE (Type = 'Type1' AND ProgramDate = '10/18/11' )
UNION
SELECT '10/18/11' AS ProgramDate, 0 AS CountVal) T1

Here's a solution that works on SQL Server; not sure about other db platforms:
DECLARE #Type VARCHAR(5) = 'Type1'
, #ProgramDate DATE = '10/18/2011'
SELECT pt.ProgramDate
, COUNT(pt2.ProgramDate)
FROM ( SELECT #ProgramDate AS ProgramDate
, #Type AS Type
) pt
LEFT JOIN ProgramsTbl pt2 ON pt.Type = pt2.Type
AND pt.ProgramDate = pt2.ProgramDate
GROUP BY pt.ProgramDate

Grunge but simple and efficient
SELECT '10/18/11' as 'Program Date', count(*) as 'count'
FROM ProgramsTbl
WHERE Type = 'Type1' AND ProgramDate = '10/18/11'

Try something along these lines. This will establish a row with a date of 10/18/11 that will definitely return. Then you left join to your actual data to get your desired count (which can now return 0 if there are no corresponding rows).
To do this for more than 1 date, you'd want to build a Date table that holds a list of all dates you want to query (so substitute the "select '10/18/11'" with "select Date from DateTbl").
SELECT ProgDt.ProgDate, [CountVal]= COUNT(ProgramsTbl.ProgramDate)
FROM (SELECT '10/18/11' as 'ProgDate') ProgDt
LEFT JOIN ProgramsTbl
ON ProgDt.ProgDate = ProgramsTbl.ProgramDate
WHERE (Type = 'Type1')
GROUP BY ProgDt.ProgDate
To create a date table that you can use for querying, do this (assumes SQL Server 2005+):
create table Dates (MyDate datetime)
go
insert into Dates
select top 100000 row_number() over (order by s1.name)
from master..spt_values s1, master..spt_values s2
go

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive - Is there a way to further optimize a HiveQL query? - sql

I don't think GROUPING SETS are applicable here because you are only grouping by one field. From Apache Wiki: "The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set."

Related

SQL - Group values by range

Use of MAX function in SQL query to filter data

ROW_NUMBER() Query Plan SORT Optimization

Fastest way to check if the the most recent result for a patient has a certain value

Select Statement Return 0 if Null

Categories

Resources