SQL - Group values by range - sql

I have following query:
SELECT
polutionmm2 AS metric,
sum(cnt) as value
FROM polutiondistributionstatistic as p inner join crates as c on p.crateid = c.id
WHERE
c.name = '154'
and to_timestamp(startts) >= '2021/01/20 00:00:00' group by polutionmm2
this query returns these values:
"metric","value"
50,580
100,8262
150,1548
200,6358
250,869
300,3780
350,505
400,2248
450,318
500,1674
550,312
600,7420
650,1304
700,2445
750,486
800,985
850,139
900,661
950,99
1000,550
I would need to edit the query in a way that it groups them toghether in ranges of 100, starting from 0. So everything that has a metric value between 0 and 99 should be one row, and the value the sum of the rows... like this:
"metric","value"
0,580
100,9810
200,7227
300,4285
400,2556
500,1986
600,8724
700,2931
800,1124
900,760
1000,550
The query will run over about 500.000 rows.. Can this be done via query? Is it efficient?
EDIT:
there can be up to 500 ranges, so an automatic way of grouping them would be great.

You can use generate_series() and a range type to generate the the ranges you want, e.g.:
select int4range(x.start, case when x.start = 1000 then null else x.start + 100 end, '[)') as range
from generate_series(0,1000,100) as x(start)
This generates the ranges [0,100), [100,200) and so on up until [1000,).
You can adjust the width and the number of ranges by using different parameters for generate_series() and adjusting the expression that evaluates the last range
This can be used in an outer join to aggregate the values per range:
with ranges as (
select int4range(x.start, case when x.start = 1000 then null else x.start + 100 end, '[)') as range
from generate_series(0,1000,100) as x(start)
)
select r.range as metric,
sum(t.value)
from ranges r
left join the_table t on r.range #> t.metric
group by range;
The expression r.range #> t.metric tests if the metric value falls into the (generated) range
Online example

You can create a Pseudo table with interval you like and join with that table.
I'll use recursive CTE for this case.
WITH RECURSIVE cte AS(
select 0 St, 99 Ed
UNION ALL
select St + 100, Ed + 100 from cte where St <= 1000
)
select cte.st as metric,sum(tb.value) as value from cte
inner join [tableName] tb --with OP query result
on tb.metric between cte.St and cte.Ed
group by cte.st
order by st
here is DB<>fiddle with some pseudo data.

use conditional aggregation
SELECT
case when polutionmm2>=0 and polutionmm2<100 then '100'
when polutionmm2>=100 and polutionmm2<200 then '200'
........
when polutionmm2>=900 and polutionmm2<1000 then '1000'
end AS metric,
sum(cnt) as value
FROM polutiondistributionstatistic as p inner join crates as c on p.crateid = c.id
WHERE
c.name = '154'
and to_timestamp(startts) >= '2021/01/20 00:00:00'
group by case when polutionmm2>=0 and polutionmm2<100 then '100'
when polutionmm2>=100 and polutionmm2<200 then '200'
........
when polutionmm2>=900 and polutionmm2<1000 then '1000'
end

Related

Single-column row-set exists in another table or a function returns positive value

I have following table structure: http://sqlfiddle.com/#!4/952e7/1
Now I am looking for a solution for the following problem:
Given an input data-time set (see below). And the SQL statement should return all of business IDs with a given business name, where every single date-times of the input set are either present in the ORDERS table or an additional function's statement is true (these both conditions are separately to be checked for each input date-time).
An example how the input date-time dataset looks like:
WITH DATES_TO_CHECK(DATETIME) AS(SELECT DATE '2021-01-03' FROM DUAL UNION ALL SELECT DATE '2020-04-08' FROM DUAL UNION ALL SELECT DATE '2020-05-07' FROM DUAL)
To be simple, the "additional function" should be a simple random number (if greather than 0.5 than true otherwise false, so the check is dbms_random.value > 0.5).
For one given date time it would look like:
SELECT BN.NAME, BD.ID
FROM BUSINESS_DATA BD, BUSINESS_NAME BN
WHERE BD.NAME_ID=BN.ID AND
BN.NAME='B1' AND
(TO_DATE('2021-01-03', 'YYYY-MM-DD') IN (SELECT OD.ORDERDATE FROM ORDERS OD WHERE OD.BUSINESS_ID=BD.ID)
OR dbms_random.value > 0.5)
ORDER BY BD.ID
Please help me, how this solution can be applied to the input date-time rowset above AND the specified name.
I don't any difference with the question you just deleted
This is the list of businesses named B1 and for which the number of order dates that match date input dates is equal to the number of input dates or dbms_random.value > 0.5
see SQL Fiddle
WITH DATES_TO_CHECK(DATETIME) AS(
SELECT DATE '2021-01-03' FROM DUAL
UNION ALL SELECT DATE '2020-04-08' fROM DUAL
UNION ALL SELECT DATE '2020-05-07' fROM DUAL
),
businesses_that_match as (
select
od.BUSINESS_ID, count(distinct OD.ORDERDATE)
from DATES_TO_CHECK dtc
left join ORDERS od on OD.ORDERDATE = dtc.datetime
group by od.BUSINESS_ID
having count(distinct OD.ORDERDATE) = (select count(distinct DATETIME) from DATES_TO_CHECK)
)
SELECT
BN.NAME, BD.ID
FROM BUSINESS_DATA BD
inner join BUSINESS_NAME BN on BD.NAME_ID=BN.ID
left join businesses_that_match btm on btm.BUSINESS_ID = bd.id
where bn.name = 'B1'
AND (btm.BUSINESS_ID is not null
OR dbms_random.value > 0.5
)

Use sql variables in query results

I have some of the following code:
Select p.CLIENT_NO,
s.CLIENT_NAME,
s.CLIENT_TYPE,
p.GL_CODE,
p.BATCH_KEY
From RU_POST p,
RU_ACCT a,
Ru_Ru s
Where
a.INTERNAL_KEY(+) = p.INTERNAL_KEY
And p.Batch_Key in
(Select Distinct (p1.BATCH_KEY)
From RU_POST p1
Where Abs(p1.AMOUNT) <> 0
And p1.POST_DATE Between To_Date('01-01-2015', 'dd-mm-yyyy') And
To_Date('01-01-2015', 'dd-mm-yyyy')
And p1.INTERNAL_KEY In ('367', '356'))
Now I want to have values stated in p1.INTERNAL_KEY to appear in query results, like if I did SELECT p1.INTERNAL_KEY.
However, I understand this won't work. So, it would be like '367' for 100 values, '356' for other 100.
Could someone help me how to put this condition value inside my result?
Like that:
CLIENT_NO CLIENT_SHORT CLIENT_NAME GL_CODE INTERNAL_KEY
399999000 399999 A 4568 367
599999000 599999 B 4879 356
You can try changing the in subquery to a join, like this:
select distinct
p.client_no
, s.client_name
, s.client_type
, p.gl_code
, p1.internal_key
from ru_post p
join ru_post p1 on p1.batch_key = p.batch_key
left join ru_acct a on a.internal_key = p.internal_key
cross join ru_ru s
where abs(p1.amount) <> 0
and p1.post_date between date '2015-01-01' and date '2015-01-01'
and p1.internal_key in ('367', '356') );
(Edited to match updated question - now left join ru_post to ru_acct):

Hive - Is there a way to further optimize a HiveQL query?

I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.
Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
The table columns are as following:
Airports
|iata|airport|city|state|country|
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|
Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
Tune mapjoins and enable parallel execution:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344
It might help if you do the aggregation before the union all:
SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
GROUP BY Origin
) UNION ALL
(SELECT Dest AS Airport, COUNT(*) as cnt
FROM flights_stats
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY Dest
)
) f INNER JOIN
airports a
ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;
I don't think GROUPING SETS are applicable here because you are only grouping by one field.
From Apache Wiki:
"The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set."
You can test this but you are in the case where an Union maybe better, so You really need to test it and come back :
SELECT airports.airport,
SUM(
CASE
WHEN T1.FlightsNum IS NOT NULL THEN 1
WHEN T2.FlightsNum IS NOT NULL THEN 1
ELSE 0
END
) AS Total_Flights
FROM airports
LEFT JOIN (SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t1
on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))) t2
on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC

PostgreSQL use case when result in where clause

I use complex CASE WHEN for selecting values. I would like to use this result in WHERE clause, but Postgres says column 'd' does not exists.
SELECT id, name, case when complex_with_subqueries_and_multiple_when END AS d
FROM table t WHERE d IS NOT NULL
LIMIT 100, OFFSET 100;
Then I thought I can use it like this:
select * from (
SELECT id, name, case when complex_with_subqueries_and_multiple_when END AS d
FROM table t
LIMIT 100, OFFSET 100) t
WHERE d IS NOT NULL;
But now I am not getting a 100 rows as result. Probably (I am not sure) I could use LIMIT and OFFSET outside select case statement (where WHERE statement is), but I think (I am not sure why) this would be a performance hit.
Case returns array or null. What is the best/fastest way to exclude some rows if result of case statement is null? I need 100 rows (or less if not exists - of course). I am using Postgres 9.4.
Edited:
SELECT count(*) OVER() AS count, t.id, t.size, t.price, t.location, t.user_id, p.city, t.price_type, ht.value as houses_type_value, ST_X(t.coordinates) as x, ST_Y(t.coordinates) AS y,
CASE WHEN t.classification='public' THEN
ARRAY[(SELECT i.filename FROM table_images i WHERE i.table_id=t.id ORDER BY i.weight ASC LIMIT 1), t.description]
WHEN t.classification='protected' THEN
ARRAY[(SELECT i.filename FROM table_images i WHERE i.table_id=t.id ORDER BY i.weight ASC LIMIT 1), t.description]
WHEN t.id IN (SELECT rl.table_id FROM table_private_list rl WHERE rl.owner_id=t.user_id AND rl.user_id=41026) THEN
ARRAY[(SELECT i.filename FROM table_images i WHERE i.table_id=t.id ORDER BY i.weight ASC LIMIT 1), t.description]
ELSE null
END AS main_image_description
FROM table t LEFT JOIN table_modes m ON m.id = t.mode_id
LEFT JOIN table_types y ON y.id = t.type_id
LEFT JOIN post_codes p ON p.id = t.post_code_id
LEFT JOIN table_houses_types ht on ht.id = t.houses_type_id
WHERE datetime_sold IS NULL AND datetime_deleted IS NULL AND t.published=true AND coordinates IS NOT NULL AND coordinates && ST_MakeEnvelope(17.831490030182, 44.404640972306, 12.151558389557, 47.837396630872) AND main_image_description IS NOT NULL
GROUP BY t.id, m.value, y.value, p.city, ht.value ORDER BY t.id LIMIT 100 OFFSET 0
To use the CASE WHEN result in the WHERE clause you need to wrap it up in a subquery like you did, or in a view.
SELECT * FROM (
SELECT id, name, CASE
WHEN name = 'foo' THEN true
WHEN name = 'bar' THEN false
ELSE NULL
END AS c
FROM case_in_where
) t WHERE c IS NOT NULL
With a table containing 1, 'foo', 2, 'bar', 3, 'baz' this will return records 1 & 2. I don't know how long this SQL Fiddle will persist, but here is an example: http://sqlfiddle.com/#!15/1d3b4/3 . Also see https://stackoverflow.com/a/7950920/101151
Your limit is returning less than 100 rows if those 100 rows starting at offset 100 contain records for which d evaluates to NULL. I don't know how to limit the subselect without including your limiting logic (your case statements) re-written to work inside the where clause.
WHERE ... AND (
t.classification='public' OR t.classification='protected'
OR t.id IN (SELECT rl.table_id ... rl.user_id=41026))
The way you write it will be different and it may be annoying to keep the CASE logic in sync with the WHERE limiting statements, but it would allow your limits to work only on matching data.

Sql query to multiply two column value to third column

I want to multiply two columns value to 3rd column. Here is my query:
select distinct pr.PSProjectId,sfa.CodePattern, case when sfqd.NCR IS null then 'blank' else sfqd.NCR end as NCR
,
case when sfqd.NCR !='blank' then
(Select DATEDIFF(minute,starttime,EndTime) from ShopFloorStatusDetail where ShopFloorActivityId=sfa.ShopFloorActivityId
and StatusId=8
)
else
DATEDIFF(MINUTE,sfs.ShiftStarTime,sfs.shiftendtime)
end as timediff,
(select COUNT(1) from ShopFloorEmployeeTime where ShopFloorShiftId=sfs.ShopFloorShiftId) as totalemployee
from ShopFloor sf
inner join Project pr on pr.ProjectId=sf.ProjectId
inner join ShopFloorActivity sfa on sf.ShopFloorId=sfa.ShopFloorId
inner join ShopFloorShift sfs on sfs.ShopFloorActivityId=sfa.ShopFloorActivityId
left join ShopFloorStatusDetail sfsd on sfsd.ShopFloorActivityId=sfs.ShopFloorActivityId
left join ShopFloorQCDetail sfqd on sfqd.ShopFloorStatusDetailId=sfsd.ShopFloorStatusDetailId
and sfqd.NCR is not null
where CAST(sfs.ShiftStarTime as DATE) between '2014/01/06' and '2014/01/07'
and output from this query is
PSProjectId CodePattern NCR timediff totalemployee
0000129495 3TMEU blank 8 1
0000130583 3UA1P blank 1 1
0000130583 3UA1P blank 2090 2
Now i want to multiply column timediff and totalemployee and show it in a new column.
How do I do this? Please help.
Just add a new column, multiplying the existing expressions:
case when sfqd.NCR !='blank'
then (Select DATEDIFF(minute,starttime,EndTime)
from ShopFloorStatusDetail
where ShopFloorActivityId=sfa.ShopFloorActivityId
and StatusId=8
)
else DATEDIFF(MINUTE,sfs.ShiftStarTime,sfs.shiftendtime)
end
*
(select COUNT(1)
from ShopFloorEmployeeTime
where ShopFloorShiftId=sfs.ShopFloorShiftId)
Alternatively, wrap the whole existing query in another query, and multiply the calcualted columns:
select
PSProjectId,
CodePattern,
NCR,
timediff,
totalemployee,
timediff * totalemployee
from
( ...original query here... )