DB2 SQL - median with GROUP BY - sql

First of all, I am running on DB2 for i5/OS V5R4. I have ROW_NUMBER(), RANK() and common table expressions. I do not have TOP n PERCENT or LIMIT OFFSET.
The actual data set I'm working with is hard to explain, so let's just say I have a weather history table where the columns are (city, temperature, timestamp). I want to compare medians to averages for each group (city).
This was the cleanest way I found to get a median for a whole table aggregation. I adapted it from the IBM Redbook here:
WITH base_t AS
( SELECT temp, row_number() over (order by temperature) AS rownum FROM t ),
count_t AS
( SELECT COUNT(temperature) + 1 AS base_count FROM base_t ),
median_t AS
( SELECT temperature FROM base_t, count_t
WHERE rownum in (FLOOR(base_count/2e0), CEILING(base_count/2e0)) )
SELECT DECIMAL(AVG(temperature),10,2) AS median FROM median_t
That works well for getting a single row back, but it seems to fall apart for grouping. Conceptually, this is what I want:
SELECT city, AVG(temperature), MEDIAN(temperature) FROM ...
city | mean_temp | median_temp
===================================================
'Minneapolis' | 60 | 64
'Milwaukee' | 65 | 66
'Muskegon' | 70 | 61
There could be an answer that makes me look stupid, but I'm having a mental block and this isn't my #1 thing to work on right now. Seems like it could be possible, but I can't use something that's extremely complex since it's a large table and I want the ability to customize which columns are being aggregated.

In SQL Server, agreagate functions like count(*) can be partitioned and calculated without a group by. I looked quickly through the referenced redbook, and it looks like DB2 has the same feature. But if not, then this won't work:
create table TemperatureHistory
(City varchar(20)
, Temperature decimal(5, 2)
, DateTaken datetime)
insert into TemperatureHistory values ('Minneapolis', 61, '20090101')
insert into TemperatureHistory values ('Minneapolis', 59, '20090102')
insert into TemperatureHistory values ('Milwaukee', 65, '20090101')
insert into TemperatureHistory values ('Milwaukee', 65, '20090102')
insert into TemperatureHistory values ('Milwaukee', 100, '20090103')
insert into TemperatureHistory values ('Muskegon', 80, '20090101')
insert into TemperatureHistory values ('Muskegon', 70, '20090102')
insert into TemperatureHistory values ('Muskegon', 70, '20090103')
insert into TemperatureHistory values ('Muskegon', 20, '20090104')
; with base_t as
(select city
, Temperature
, row_number() over (partition by city order by temperature) as RowNum
, (count(*) over (partition by city)) + 1 as CountPlusOne
from TemperatureHistory)
select City
, avg(Temperature) as MeanTemp
, avg(case
when RowNum in (FLOOR(CountPlusOne/2.0), CEILING(CountPlusOne/2.0))
then Temperature
else null end) as MedianTemp
from base_t
group by City

Related

Subquery using Oracle SQL

I am trying to write a query which excludes meter_nos with three of the same date values ie. a 1-3 relationship.
The query must only show meter_nos which have different dates, i.e a 1-1 relationship.
Can anyone help? I am stuck
Heres a sample below:
...and a.mtr_id in (select b.mtr_id
from ci_mtr_config b
where a.mtr_id=b.mtr_id
group by b.mtr_id
having count(b.mtr_id)=3)
and a.mtr_id not in (select f.eff_dttm
from ci_mtr_config f
where a.mtr_id=f.mtr_id
group by f.eff_dttm
having count(f.eff_dttm)=3)
This does not work.
Try using COUNT(*) OVER(PARTITION BY ....) to count the number of rows sharing the same meter and date. Then filter by that calculation.
CREATE TABLE CI_MTR_CONFIG
(MTR_ID INT, EFF_DTTM DATE)
;
INSERT INTO CI_MTR_CONFIG
(MTR_ID, EFF_DTTM)
VALUES
(303, to_date('2017-01-01','yyyy-mm-dd')),
(303, to_date('2017-01-01','yyyy-mm-dd')),
(303, to_date('2017-01-01','yyyy-mm-dd')),
(202, to_date('2017-01-01','yyyy-mm-dd')),
(202, to_date('2017-01-01','yyyy-mm-dd')),
(101, to_date('2017-01-01','yyyy-mm-dd'))
;
select
*
from (
select
*, count(*) over(partition by MTR_ID, EFF_DTTM) as count_of
from CI_MTR_CONFIG
) d
where count_of = 1
Only meter 101 would be returned from the sample data above.
Note if EFF_DTTM information is more accurate than day use TRUNC()
count(*) over(partition by MTR_ID, TRUNC(EFF_DTTM)) as count_of

Crosstab transpose query request

Using Postgres 9.3.4, I've got this table:
create table tbl1(country_code text, metric1 int, metric2 int, metric3 int);
insert into tbl1 values('us', 10, 20, 30);
insert into tbl1 values('uk', 11, 21, 31);
insert into tbl1 values('fr', 12, 22, 32);
I need a crosstab query to convert it to this:
create table tbl1(metric text, us int, uk int, fr int);
insert into tbl1 values('metric1', 10, 11, 12);
insert into tbl1 values('metric2', 20, 21, 22);
insert into tbl1 values('metric3', 30, 31, 32);
As an added bonus, I'd love a rollup:
create table tbl1(metric text, total int, us int, uk int, fr int);
insert into tbl1 values('metric1', 33, 10, 11, 12);
insert into tbl1 values('metric2', 63, 20, 21, 22);
insert into tbl1 values('metric3', 93, 30, 31, 32);
I'm done staring at the crosstab spec, I have it written with case statements but it's mad unruly and long, so can someone who's fluent in crosstab please whip up a quick query so I can move on?
The special difficulty is that your data is not ready for cross tabulation. You need data in the form row_name, category, value. You can get that with a UNION query:
SELECT 'metric1' AS metric, country_code, metric1 FROM tbl1
UNION ALL
SELECT 'metric2' AS metric, country_code, metric2 FROM tbl1
UNION ALL
SELECT 'metric3' AS metric, country_code, metric3 FROM tbl1
ORDER BY 1, 2 DESC;
But a smart LATERAL query only needs a single table scan and will be faster:
SELECT x.metric, t.country_code, x.val
FROM tbl1 t
, LATERAL (VALUES
('metric1', metric1)
, ('metric2', metric2)
, ('metric3', metric3)
) x(metric, val)
ORDER BY 1, 2 DESC;
Related:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
SELECT DISTINCT on multiple columns
Using the simple form of crosstab() with 1 parameter with this query as input:
SELECT * FROM crosstab(
$$
SELECT x.metric, t.country_code, x.val
FROM tbl1 t
, LATERAL (
VALUES
('metric1', metric1)
, ('metric2', metric2)
, ('metric3', metric3)
) x(metric, val)
ORDER BY 1, 2 DESC
$$
) AS ct (metric text, us int, uk int, fr int);
List country names in alphabetically descending order (like in your demo).
This also assumes all metrics are defined NOT NULL.
If one or both are not the case, use the 2-parameter form instead:
PostgreSQL Crosstab Query
Add "rollup"
I.e. totals per metric:
SELECT * FROM crosstab(
$$
SELECT x.metric, t.country_code, x.val
FROM (
TABLE tbl1
UNION ALL
SELECT 'zzz_total', sum(metric1)::int, sum(metric2)::int, sum(metric3)::int -- etc.
FROM tbl1
) t
, LATERAL (
VALUES
('metric1', metric1)
, ('metric2', metric2)
, ('metric3', metric3)
) x(metric, val)
ORDER BY 1, 2 DESC
$$
) AS ct (metric text, total int, us int, uk int, fr int);
'zzz_total' is an arbitrary label, that must sort last alphabetically (or you need the 2-parameter form of crosstab()).
If you have lots of metrics columns, you might want to build the query string dynamically. Related:
How to perform the same aggregation on every column, without listing the columns?
Executing queries dynamically in PL/pgSQL
Also note that the upcoming Postgres 9.5 (currently beta) introduces a dedicated SQL clause for ROLLUP.
Related:
Spatial query on large table with multiple self joins performing slow

Another approach to percentiles?

I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:
CREATE TABLE test_data
(
batch_id NUMBER,
job_count NUMBER,
duration NUMBER
);
INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);
Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:
SELECT
PERCENTILE_DISC( 0.75 )
WITHIN GROUP (ORDER BY duration ASC)
AS third_quartile
FROM
test_data;
(Which gives the result of 9)
My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:
SELECT
batch_id,
job_count,
SUM(
job_count
)
OVER (
ORDER BY duration
ROWS UNBOUNDED PRECEDING
)
AS total_jobs,
duration
FROM
test_data
ORDER BY
duration ASC;
BATCH_ID JOB_COUNT TOTAL_JOBS DURATION
6 35 35 1
5 54 89 1
12 90 179 2
2 47 226 4
10 52 278 4
3 66 344 6
4 46 390 6
8 82 472 7
9 12 484 9
1 37 521 9
11 3 524 9
7 55 579 9
Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.
Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.
Is there a relatively simple way to accomplish this?
I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.
The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.
Here is the example in SQL:
select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
from (SELECT batch_id, job_count,
SUM(job_count) OVER (ORDER BY duration) as cumjobs,
sum(job_count) over () as totjobs,
duration
FROM test_data
) t cross join
(select 0.25 as percentile from dual union all
select 0.5 from dual union all
select 0.75 from dual
) pcs
group by pcs.percentile;
This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.
OK. I think I have your answer. Idea is mine. Implementation is borrowed from this Ask Tom article
SELECT PERCENTILE_DISC( 0.75 )
WITHIN GROUP (ORDER BY duration ASC)
AS third_quartile
FROM(
with data as
(select level l
from dual, (select max(job_count) max_jobs from test_data)
connect by level <= max_jobs
)
select *
from test_data, data
where l <= job_count
--ORDER BY duration, batch_id
) inner
;
Here is SQL Fiddle.

How to get the complete row from a maximum calculation?

I do struggle with a GROUP BY -- again. The basics I can handle, but there it is: How do I get to different columns I named in the group by, without destroying my grouping? Note that group by is only my own idea, there may be others that work better. It must work in Oracle, though.
Here is my example:
create table xxgroups (
groupid int not null primary key,
groupname varchar2(10)
);
insert into xxgroups values(100, 'Group 100');
insert into xxgroups values(200, 'Group 200');
drop table xxdata;
create table xxdata (
num1 int,
num2 int,
state_a int,
state_b int,
groupid int,
foreign key (groupid) references xxgroups(groupid)
);
-- "ranks" are 90, 40, null, 70:
insert into xxdata values(10, 10, 1, 4, 100);
insert into xxdata values(10, 10, 0, 4, 200);
insert into xxdata values(11, 11, 0, 3, 100);
insert into xxdata values(20, 22, 5, 7, 200);
The task is to create a result-row for each distinct (num1, num2) and print that groupname with the highest calculated "rank" from state_a and state_b.
Note that the first two rows have the same nums and thus only the higher ranking should be selected -- with the groupname being "Group 200".
I got quite far with the basic group by, I think.
SELECT xd.num1||xd.num2 nummer, max(ranking.goodness)
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
order by goodness desc
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
GROUP BY xd.num1||xd.num2
ORDER BY nummer
;
The result is 90% of what I need:
nummer ranking
----------------
1010 90
1111
2022 70
100% perfect would be
nummer groupname
-------------------
1010 Group 100
1111 Group 100
2022 Group 200
The tricky part is, that I want the groupname in the result. And I can not include it in the select, because then I would have to put it into the group by as well -- which I do not want (then I would not select the best ranking entry from over all groups)
In my solution a use a model table to calculate the "rank". There are other solution I am sure. The point is, that it is a non-trivial calculation that I do not want to do twice.
I know from other examples that one could use a second query to get back to the original row to get to the groupname, but I can not see how I could to this here,
without duplicating my ranking calculation.
A nice suggestion was to replace the group by with a LIMIT 1/ORDER BY goodness and use this calculating select as a filtering subselect. But a) there is no LIMIT in Oracle, and I doubt a rownum<=1 would do in a subselect and b) I can not wrap my brain around it anyway. Maybe there is a way?
You can use the FIRST aggregation modifier to selectively apply your function over a subset of rows of a group -- here a single row (SQLFiddle demo):
SELECT xd.num1||xd.num2 nummer,
MAX(xg.groupname) KEEP (DENSE_RANK FIRST
ORDER BY ranking.goodness DESC) grp,
max(ranking.goodness)
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
order by goodness desc
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
GROUP BY xd.num1||xd.num2
ORDER BY nummer;
Your method with analytics works as well but since we already use aggregations here, we may as well use the FIRST modifier to get all columns in one go.
Whow, I did search before, but now I found this answer, which I could adopt to my question. The Oracle-solution here is over, partition by with order by and row_number():
select *
from ( select data.*, row_number()
over (partition by nummer order by goodness desc) as seqnum
from (
SELECT xd.num1, xd.num2 nummer, xg.groupname, ranking.goodness
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
ORDER BY nummer
) data )
where seqnum = 1
;
The result is
10 10 Group 100 90 1
11 11 Group 100 1
20 22 Group 200 70 1
which is beautiful.
Now I have to try to understand what over in the select excactly does....

SQL Group By Problem

I have a table that has 3 cols namely points, project_id and creation_date. every time points are assigned a new record has been made, for example.
points = 20 project_id = 441 creation_date = 04/02/2011 -> Is one record
points = 10 project_id = 600 creation_date = 04/02/2011 -> Is another record
points = 5 project_id = 441 creation_dae = 06/02/2011 -> Is final record
(creation_date is the date on which record is entered and it is achieved by setting the default value to GETDATE())
now the problem is I want to get MAX points grouped by project_id but I also want creation_date to appear with it so I can use it for another purpose, if creation date is repeating its ok and I cannot group by creation_date because if I do so it will skip the points of project with id 600 and its wrong because id 600 is a different project and its only max points are 10 so it should be listed and its only possible if I do the grouping using project_id but then how should I also list creation_date
So far I am using this query to get MAX points of each project
SELECT MAX(points) AS points, project_id
FROM LogiCpsLogs AS LCL
WHERE (writer_id = #writer_id) AND (DATENAME(mm, GETDATE()) = DATENAME(mm, creation_date)) AND (points <> 0)
GROUP BY project_id
writer_id is the ID of writer whose points I want to see, like writer_id = 1, 2 or 3.
This query brings the result of current month only but I would like to list creation_date as well. Please help.
The subquery way
SELECT P.Project_ID, P.Creation_Date, T.Max_Points
FROM Projects P INNER JOIN
(
SELECT Project_ID, MAX(Points) AS Max_Points
FROM Projects
GROUP BY Project_ID
) T
ON P.Project_ID = T.Project_ID
AND P.Points = T.Max_Points
Please see comment: this will give you ALL days where max-points was achieved. If you only just want one, the query will be more complex.
Edits:
Misread requirements. Added additional constraint.
I'll give you sample..
SELECT MAX(POINTS),
PROJECT_ID,
CREATION_DATE
FROM yourtable
GROUP by CREATION_DATE,PROJECT_ID;
This should be what you want, you don't even need a group by or aggravate functions:
SELECT points, project_id, created_date
FROM #T AS LCL
WHERE writer_id = #writer_id AND points <> 0
AND NOT EXISTS (
SELECT TOP 1 1
FROM #T AS T2
WHERE T2.writer_id = #writer_id
AND T2.project_id = LCL.project_id
AND T2.points > LCL.points)
Where #T is your table, also if you want to only show the records where they were the total in general and not the total for just this given #writer_id then remove the restriction T2.writer_id = #writer_id from the inner query
And my code that I used to test:
DECLARE #T TABLE
(
writer_id int,
points int,
project_id int,
created_date datetime
)
INSERT INTO #T VALUES(1, 20, 441, CAST('20110204' AS DATETIME))
INSERT INTO #T VALUES(1, 10, 600, CAST('20110204' AS DATETIME))
INSERT INTO #T VALUES(1, 5, 441, CAST('20110202' AS DATETIME))
INSERT INTO #T VALUES(1, 15, 241, GETDATE())
INSERT INTO #T VALUES(1, 12, 241, GETDATE())
INSERT INTO #T VALUES(2, 12, 241, GETDATE())
SELECT * FROM #T
DECLARE #writer_id int = 1
My results:
Result Set (3 items)
points | project_id | created_date
20 | 441 | 04/02/2011 00:00:00
10 | 600 | 04/02/2011 00:00:00
15 | 241 | 21/09/2011 18:59:31
My solution use CROSS APPLY sub-queries.
For optimal performance I have created an index on project_id (ASC) & points (DESC sorting order) fields.
If you want to see all creation_date values that have maximum points then you can use WITH TIES:
CREATE TABLE dbo.Project
(
project_id INT PRIMARY KEY
,name NVARCHAR(100) NOT NULL
);
CREATE TABLE dbo.ProjectActivity
(
project_activity INT IDENTITY(1,1) PRIMARY KEY
,project_id INT NOT NULL REFERENCES dbo.Project(project_id)
,points INT NOT NULL
,creation_date DATE NOT NULL
);
CREATE INDEX IX_ProjectActivity_project_id_points_creation_date
ON dbo.ProjectActivity(project_id ASC, points DESC)
INCLUDE (creation_date);
GO
INSERT dbo.Project
VALUES (1, 'A'), (2, 'BB'), (3, 'CCC');
INSERT dbo.ProjectActivity (project_id, points, creation_date)
VALUES (1,100,'2011-01-01'), (1,110,'2011-02-02'), (1, 111, '2011-03-03'), (1, 111, '2011-04-04')
,(2, 20, '2011-02-02'), (2, 22, '2011-03-03')
,(3, 2, '2011-03-03');
SELECT p.*, ca.*
FROM dbo.Project p
CROSS APPLY
(
SELECT TOP(1) WITH TIES
pa.points, pa.creation_date
FROM dbo.ProjectActivity pa
WHERE pa.project_id = p.project_id
ORDER BY pa.points DESC
) ca;
DROP TABLE dbo.ProjectActivity;
DROP TABLE dbo.Project;