Group by aggregate with some arithmetics (merging similar rows)

Group by aggregate with some arithmetics (merging similar rows) - sql

I need to combine the following rows:
id num_votes avg_vote
1 2 4
1 3 1
1 0 0
To end up with this:
id num_votes avg_votes
1 2+3+0=5 4*2/5 + 3*1/5 = 2.2
I've tried the following, aggregate nested functions don't work of course:
select id
, sum(num_votes) as _num_votes
, sum(num_votes/sum(num_votes)*avg_vote) as _avg_vote
from mytable
GROUP BY id, num_votes, avg_vote;

SELECT id
, sum(num_votes) as _num_votes
, round( sum(num_votes * avg_vote)::numeric
/ sum(num_votes)
, 2) AS avg_votes
FROM mytable
GROUP BY id; -- you cannot GROUP BY aggregated columns, just: id
SQL Fiddle.
You don't need window functions for this. Aggregate functions do the job.
The calculation:
4*2/5 + 3*1/5 + 0*0/5
Can be rewritten as:
(4*2 + 3*1 + 0*0)/5
And implemented as:
sum(num_votes * avg_vote) / sum(num_votes)
The rest is casting and rounding to preserve fractional digits. (Integer division would truncate.)

Related

how to pivot simple table of 2 rows in postgresql?

expected table is this:
good_days bad_days
6 25
But I have this table:
day_type x
bad_days 25
good_days 6
my code is not working:
select *
from (select * from main_table)t
pivot(count(x) for day_type in ('bad_days', 'good_days') ) as pivot_table

There are multiple ways to do this
Use postgresql extension tablefunc which contains crosstab method which can accept query result and pivot the result
You can also create a custom query (only works if you've less and known day_type column values)
WITH cte (day_type, x) AS (
VALUES ('bad_days', 25), ('good_days', 6))
SELECT sum(good_days) AS good_days,
sum(bad_days) AS bad_days
FROM (
(SELECT x AS good_days,
0 AS bad_days
FROM cte
WHERE day_type = 'good_days')
UNION ALL
(SELECT 0 AS good_days,
x AS bad_days
FROM cte
WHERE day_type = 'bad_days')) AS foo

A simple method is conditional aggregation:
select sum(x) filter (where data_type = 'bad_days') as bad_days,
sum(x) filter (where data_type = 'good_days') as good_days
from t;

Return five rows of random DNA instead of just one

This is the code I have to create a string of DNA:
prepare dna_length(int) as
with t1 as (
select chr(65) as s
union select chr(67)
union select chr(71)
union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i, round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;
execute dna_length(20);
I am trying to figure out how to re-write this to give a table of 5 rows of strings of DNA of length 20 each, instead of just one row. This is for PostgreSQL.
I tried:
CREATE TABLE dna_table(g int, dna text);
INSERT INTO dna_table (1, execute dna_length(20));
But this does not seem to work. I am an absolute beginner. How to do this properly?

PREPARE creates a prepared statement that can be used "as is". If your prepared statement returns one string then you can only get one string. You can't use it in other operations like insert, e.g.
In your case you may create a function:
create or replace function dna_length(int) returns text as
$$
with t1 as (
select chr(65) as s
union
select chr(67)
union
select chr(71)
union
select chr(84))
, t2 as (select s,
row_number() over () as rn
from t1)
, t3 as (select generate_series(1, $1) as i,
round(random() * 4 + 0.5) as rn)
, t4 as (select t2.s
from t2
join t3 on (t2.rn = t3.rn))
select array_to_string(array(select s from t4), '') as dna
$$ language sql;
And use it in a way like this:
insert into dna_table(g, dna) select generate_series(1,5), dna_length(20)
From the official doc:
PREPARE creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
About functions.

This can be much simpler and faster:
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1,100) g -- 100 = 5 rows * 20 nucleotides
GROUP BY g%5;
random() produces random value in the range 0.0 <= x < 1.0. Multiply by 4 and take the mathematical ceiling with ceil() (cheaper than round()), and you get a random distribution of the numbers 1-4. Convert to ACTG, and aggregate with GROUP BY g%5 - % being the modulo operator.
About string_agg():
Concatenate multiple result rows of one column into one, group by another column
As prepared statement, taking
$1 ... the number of rows
$2 ... the number of nucleotides per row
PREPARE dna_length(int, int) AS
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1, $1 * $2) g
GROUP BY g%$1;
Call:
EXECUTE dna_length(5,20);
Result:
| dna |
| :------------------- |
| ATCTTCGACACGTCGGTACC |
| GTGGCTGCAGATGAACAGAG |
| ACAGCTTAAAACACTAAGCA |
| TCCGGACCTCTCGACCTTGA |
| CGTGCGGAGTACCCTAATTA |
db<>fiddle here
If you need it a lot, consider a function instead. See:
What is the difference between a prepared statement and a SQL or PL/pgSQL function, in terms of their purposes?

Split float between list of numbers

I have problem with splitting 0.00xxx float values between numbers.
Here is example of input data
0 is sum of 1-3 float numbers.
As result I want to see rounded numbers without loosing sum of 1-3:
IN:
0 313.726
1 216.412
2 48.659
3 48.655
OUT:
0 313.73
1 216.41
2 48.66
3 48.66
How it should work:
Idea is to split the lowest rest(in our example it's 0.002 from value 216.412) between highest. 0.001 to 48.659 = 48.66 and 0.001 to 48.655 = 48.656 after this we can round numbers without loosing data.
After sitting on this problem yesterday I found the solution. The query as I think should look like this.
select test.*,
sum(value - trunc(value, 2)) over (partition by case when id = 0 then 0 else 1 end) part,
row_number() over(partition by case when id = 0 then 0 else 1 end order by value - trunc(value, 2) desc) rn,
case when row_number() over(partition by case when id = 0 then 0 else 1 end order by value - trunc(value, 2) desc) / 100 <=
round(sum(value - trunc(value, 2)) over (partition by case when id = 0 then 0 else 1 end), 2) then trunc(value, 2) + 0.01 else trunc(value, 2) end result
from test;
But still for me it's strange to add const value "0.01" while getting the result.
Any ideas to improve this query?

You could use the round() sql function when presenting results. Round()'s second argument is the number of significant digits you want to round the number to. Issuing this select on the test table:
select id, round(value, 2) from test;
gives you the following result
0 313.73
1 216.41
2 48.66
3 48.65
Generally, you can use the stored numbers for summations and then use the round() function for presentation of the results: Here is a way to do the sum with the full significant digits and then use the round() function for presenting the final result:
select sum(value) from test where id != 0
gives the result: 313.726
select round(sum(value), 2) from test where id != 0
gives the result: 313.73
By the way allow me two observations:
1) the rounding you give for id = 3 is confusing to me: 48.654 rounds to 48.65 rather than 48.66 in two significant digits. Am I missing something?
2) Strictly speaking this issue is not a pl/sql issue as labeled. It is totally in the realm of sql. However there is a round() function in pl/sql as well and the same principles apply.

select id, value,
case when id <> max(id) over () then round(value, 2)
else round(value, 2) - sum(round(value, 2)) over () +
round(first_value(value) over (order by id), 2) * 2
end val_rnd
from test
Output:
ID VALUE VAL_RND
------ ---------- ----------
0 313.726 313.73
1 216.413 216.41
2 48.659 48.66
3 48.654 48.66
Above query works, but it moves all difference to last row. And this is not "honest" and maybe not what you are after for other scenarios.
The most "unhonest" behavior is observable with big number of values, all equal 0.005.
To make full distribution you need to:
sum all original values in sub-rows and subtract rounded total value from row with id 0,
use row_number() to sort sub-rows in order of difference between rounded value and original value (maybe descending, it depends on sign of difference, use sign(), abs),
assign to each row value increased by .01 (or decreased if difference < 0 ) until it reaches difference/.01 (use case when ),
union row with id = 0 containing rounded sum
optionally sort results.
It's hard (but achievable) in one query. Alternative is some PL/SQL procedure or function, which might be more readable.

If I get you correct, you don't want to use round because rounding the partial numbers don't match the rounded total.
In this case simple trick is applied. You use round for all but the last number. The last fraction is calculated as a difference between the rounded sum and the rounded parts so far (all but the last one).
You may express this with analytical function as follows
WITH total AS
(SELECT id, value, ROUND(value,2) value_rounded FROM test WHERE id = 0
),
rounded AS
( SELECT id, value, ROUND(value,2) value_rounded FROM test WHERE id != 0
)
SELECT id, value_rounded FROM total
UNION ALL
SELECT id,
CASE
WHEN row_number() over (order by id) != COUNT(*) over ()
THEN
/* not the last row - regular result */
value_rounded
ELSE
/* last row - corrected result */
(select value_rounded from total) - SUM(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
END AS value
FROM rounded
ORDER BY id;
Note that this is the test for the last numer
row_number() over (order by id) != COUNT(*) over ()
and this is the sum of all parts from begin (UNBOUNDED PRECEDING) up to the one but last ( 1 PRECEDING)
SUM(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
I splitted your data in two source total - one row with the total and and rounded parts.
UPDATE
In some case the last corrected number shows an "ugly" large difference to the original value,
as the differences in one rounding direction are higher that in the opposite one.
The following select takes this in account and distributes the difference between the parts.
The example bellow illustrated this on teh example with lot of 0.05s
WITH nums AS
(SELECT rownum id, 0.005 value FROM dual connect by level <= 5
),
rounded AS
( SELECT id, value, ROUND(value,2) value_rounded FROM nums
),
with_diff as
(SELECT id, value, value_rounded,
-- difference so far - between the exact SUM and SUM of rounded parts
-- cut to two decimal points
floor(100* (
sum(value) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -
sum(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))
/ 100 diff_so_far
FROM rounded),
delta_diff as
(select id, value, value_rounded,DIFF_SO_FAR,
DIFF_SO_FAR - LAG(DIFF_SO_FAR,1,0) over (order by ID) as diff_delta
from with_diff)
SELECT id, value,
CASE
WHEN row_number() over (order by id) != COUNT(*) over ()
THEN
/* not the last row - take the rounded value and ... */
value_rounded +
/* ... add or subtract the delta difference */
diff_delta
ELSE
/* last row - corrected result */
round(sum(value) over(),2) - SUM(value_rounded + diff_delta) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
END AS value_rounded, diff_delta
FROM delta_diff
ORDER BY id;
ID VALUE VALUE_ROUNDED DIFF_DELTA
---------- ---------- ------------- ----------
1 ,005 0 -0,01
2 ,005 ,01 0
3 ,005 0 -0,01
4 ,005 ,01 0
5 ,005 ,01 -0,01

pragamtic solution based on following rules:
1) check the difference between the rounded sum and sum of rounded parts.
select round(sum(value),2) - sum(round(value,2)) from test where id != 0;
2) apply this difference
e.g. if you get 0.01, this means one rounded part must be increased by 0.01
if you get -.02, it means two rounded parts must be decreased by 0.01
The query below simple correct the last N parts:
with diff as (
select round(sum(value),2) - sum(round(value,2)) diff from test where id != 0
), diff_values as
(select sign(diff)*.01 diff_value, abs(100*diff) corr_cnt
from diff)
select id, round(value,2)
+ case when row_number() over (order by id desc) <= corr_cnt then diff_value else 0 end result
from test, diff_values where id != 0
order by id;
ID RESULT
---------- ----------
1 216,41
2 48,66
3 48,66
If the number of corrected records in much higher than two, check the data and the rounding precision.

AS400 DB2 query math expression in Select

I have not done DB2 queries for a while so I am having issues with a math expression in my Select statement. It does not throw an error but I get the wrong result. Can someone tell me how DB2 evaluates the expression?
Part of my Select is below.
The values are:
t1.Points = 100
t2.Involvepoints = 1
(current date - t1.fromdt) in days is 1268 (so it would be current
date 7/19/2013 - 01/28/2010 in days)
It should read like (100 * 1) * (1 - (.000274 * 1268)) = 65.2568
SELECT Value1,
value2,
(CASE
WHEN (T1.POINTS * T2.INVOLVEPOINTS) * (1 - .000274 * DAYS(CURRENT DATE) - DAYS(T1.FROMDT)) >= 0 THEN (T1.POINTS * T2.INVOLVEPOINTS) * (1 - .000274 * DAYS(CURRENT DATE) - DAYS(T1.FROMDT))
ELSE 0
END) AS POINTSTOTAL
FROM TABLE1;

The parenthesis are not enforcing the correct precedence of operations and the join declaration is missing. In addition you can use the MAX scalar function instead of the repetitive CASE statement.
Here is a proof using common table expressions to simulate the source data:
with
t1 (value1, points, fromdt)
as (select 1, 100, '2010-01-28' from sysibm.sysdummy1),
t2 (value2, involvepoints)
as (select 2, 1 from sysibm.sysdummy1)
select value1, value2,
max(0, t1.points * t2.involvepoints *
(1 - .000274 * (DAYS('2013-07-19') - DAYS(t1.fromdt)))) as pointstotal
from t1, t2;
The result is:
VALUE1 VALUE2 POINTSTOTAL
------ ------ -----------
1 2 65.256800

Did you mean this?
...
(T1.POINTS * T2.INVOLVEPOINTS) * (1 - .000274 * ( DAYS(CURRENT DATE) - DAYS(T1.FROMDT) ) )
...
Note the extra pair of parentheses around the subtraction of dates. Normally multiplication takes precedence over addition, so in your original query you multiply today's date by 0.000274, subtract that from 1, then subtract the value of FROMDT from the result.
Curiously, you have those parentheses in your explanation, but not in the actual formula.

is there a PRODUCT function like there is a SUM function in Oracle SQL?

I have a coworker looking for this, and I don't recall ever running into anything like that.
Is there a reasonable technique that would let you simulate it?
SELECT PRODUCT(X)
FROM
(
SELECT 3 X FROM DUAL
UNION ALL
SELECT 5 X FROM DUAL
UNION ALL
SELECT 2 X FROM DUAL
)
would yield 30

select exp(sum(ln(col)))
from table;
edit:
if col always > 0

DECLARE #a int
SET #a = 1
-- re-assign #a for each row in the result
-- as what #a was before * the value in the row
SELECT #a = #a * amount
FROM theTable
There's a way to do string concat that is similiar:
DECLARE #b varchar(max)
SET #b = ""
SELECT #b = #b + CustomerName
FROM Customers

Here's another way to do it. This is definitely the longer way to do it but it was part of a fun project.
You've got to reach back to school for this one, lol. They key to remember here is that LOG is the inverse of Exponent.
LOG10(X*Y) = LOG10(X) + LOG10(Y)
or
ln(X*Y) = ln(X) + ln(Y) (ln = natural log, or simply Log base 10)
Example
If X=5 and Y=6
X * Y = 30
ln(5) + ln(6) = 3.4
ln(30) = 3.4
e^3.4 = 30, so does 5 x 6
EXP(3.4) = 30
So above, if 5 and 6 each occupied a row in the table, we take the natural log of each value, sum up the rows, then take the exponent of the sum to get 30.
Below is the code in a SQL statement for SQL Server. Some editing is likely required to make it run on Oracle. Hopefully it's not a big difference but I suspect at least the CASE statement isn't the same on Oracle. You'll notice some extra stuff in there to test if the sign of the row is negative.
CREATE TABLE DUAL (VAL INT NOT NULL)
INSERT DUAL VALUES (3)
INSERT DUAL VALUES (5)
INSERT DUAL VALUES (2)
SELECT
CASE SUM(CASE WHEN SIGN(VAL) = -1 THEN 1 ELSE 0 END) % 2
WHEN 1 THEN -1
ELSE 1
END
* CASE
WHEN SUM(VAL) = 0 THEN 0
WHEN SUM(VAL) IS NOT NULL THEN EXP(SUM(LOG(ABS(CASE WHEN SIGN(VAL) <> 0 THEN VAL END))))
ELSE NULL
END
* CASE MIN(ABS(VAL)) WHEN 0 THEN 0 ELSE 1 END
AS PRODUCT
FROM DUAL

The accepted answer by tuinstoel is correct, of course:
select exp(sum(ln(col)))
from table;
But notice that if col is of type NUMBER, you will find tremendous performance improvement when using BINARY_DOUBLE instead. Ideally, you would have a BINARY_DOUBLE column in your table, but if that's not possible, you can still cast col to BINARY_DOUBLE. I got a 100x improvement in a simple test that I documented here, for this cast:
select exp(sum(ln(cast(col as binary_double))))
from table;

Is there a reasonable technique that would let you simulate it?
One technique could be using LISTAGG to generate product_expression string and XMLTABLE + GETXMLTYPE to evaluate it:
WITH cte AS (
SELECT grp, LISTAGG(l, '*') AS product_expression
FROM t
GROUP BY grp
)
SELECT c.*, s.val AS product_value
FROM cte c
CROSS APPLY(
SELECT *
FROM XMLTABLE('/ROWSET/ROW/*'
PASSING dbms_xmlgen.getXMLType('SELECT ' || c.product_expression || ' FROM dual')
COLUMNS val NUMBER PATH '.')
) s;
db<>fiddle demo
Output:
+------+---------------------+---------------+
| GRP | PRODUCT_EXPRESSION | PRODUCT_VALUE |
+------+---------------------+---------------+
| b | 2*6 | 12 |
| a | 3*5*7 | 105 |
+------+---------------------+---------------+
More roboust version with handling single NULL value in the group:
WITH cte AS (
SELECT grp, LISTAGG(l, '*') AS product_expression
FROM t
GROUP BY grp
)
SELECT c.*, s.val AS product_value
FROM cte c
OUTER APPLY(
SELECT *
FROM XMLTABLE('/ROWSET/ROW/*'
passing dbms_xmlgen.getXMLType('SELECT ' || c.product_expression || ' FROM dual')
COLUMNS val NUMBER PATH '.')
WHERE c.product_expression IS NOT NULL
) s;
db<>fiddle demo
*CROSS/OUTER APPLY(Oracle 12c) is used for convenience and could be replaced with nested subqueries.
This approach could be used for generating different aggregation functions.

There are many different implmentations of "SQL". When you say "does sql have" are you referring to a specific ANSI version of SQL, or a vendor specific implementation. DavidB's answer is one that works in a few different environments I have tested but depending on your environment you could write or find a function exactly like what you are asking for. Say you were using Microsoft SQL Server 2005, then a possible solution would be to write a custom aggregator in .net code named PRODUCT which would allow your original query to work exactly as you have written it.

In c# you might have to do:
SELECT EXP(SUM(LOG([col])))
FROM table;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by aggregate with some arithmetics (merging similar rows) - sql

Related

how to pivot simple table of 2 rows in postgresql?

Return five rows of random DNA instead of just one

Split float between list of numbers

AS400 DB2 query math expression in Select

is there a PRODUCT function like there is a SUM function in Oracle SQL?

Categories

Resources