Arithmetic operation on array aggregation in postgres - sql

I have 2 questions regarding array_agg in postgres
1) I have a column which is of type array_agg. I need to divide each of the element of the array_agg by a constant value. Is it possible. I checked http://www.postgresql.org/docs/9.1/static/functions-array.html, but could not find any reference to any arithmetic operations on array_agg.
Edit:
An example of the desired operation.
select array_agg(value)/2 from some_table
Here, I create an array of the column value from the table some_table and I have to divide each of the column by 4
2) Is it possible to use coalesce in array_agg. In my scenario, there may be cases wherein, the array_agg of a column may result in a NULL array. Can we use coalesce for array_agg ?
Example:
select coalsece(array_agg(value1), 0)

Dividing is probably simper than you thought:
SELECT array_agg(value/2)
FROM ...
However, what value/2 does exactly depends on the data type. If value is an integer, fractional digits are truncated. To preserve fractional digits use value/2.0 instead. The fractional digit forces the calculation to be done with numeric values.
COALESCE won't make any difference outside the array. Either there are no rows, then you get no result at all ('no row'), or if there are, you get an array, possibly with NULL elements. But the value of the array itself is never NULL.
To replace individual NULL values with 0:
SELECT array_agg(coalesce(value/2.0, 0))
FROM ...

Related

Store int, float and boolean in same database column

Is there a sane way of storing int, float and boolean values in the same column in Postgres?
If have something like that:
rid
time
value
2d9c5bdc-dfc5-4ce5-888f-59d06b5065d0
2021-01-01 00:00:10.000000 +00:00
true
039264ad-af42-43a0-806b-294c878827fe
2020-01-03 10:00:00.000000 +00:00
2
b3b1f808-d3c3-4b6a-8fe6-c9f5af61d517
2021-01-01 00:00:10.000000 +00:00
43.2
Currently I'm using jsonb to store it, the problem however now is, that I can't filter in the table with for instance the greater operator.
The query
SELECT *
FROM points
WHERE value > 0;
gives back the error:
ERROR: operator does not exist: jsonb > integer: No operator matches the given name and argument types. You might need to add explicit type casts.
For me it's okay to handle boolean as 1 or 0 in case of true or false. Is there any possibility to achieve that with jsonb or is there maybe another super type which lets me use a column that is able to use all three types?
Performance is not so much of a concern here, as I'm going to have very few records inside of that table, max 5k I guess.
If you were just storing integers and floats, normally you'd use a float or numeric column.
But there's that pesky true.
You could cast the JSON...
select *
from test
where value::float > 1;
...but there's that pesky true.
You have to convert the boolean to a number to make it work.
select *
from test
where
(case when value = 'true' then 1.0 when value = 'false' then 0.0 else value::float end) >= 1;
Or ignore it.
This having to work around the type system suggests that value is actually two or even three different fields crammed into one. Consider separating them into multiple columns.
You should skip the rows where value is not number and cast the value to numeric, e.g.:
with points(id, value) as (
values
(1, 'true'::jsonb),
(2, '2'),
(3, '43.2')
)
select *
from points
where jsonb_typeof(value) = 'number'
and value::text::numeric > 0;
id | value
----+-------
2 | 2
3 | 43.2
(2 rows)
I actually found out, regardless of the jsonb fields value, that you can compare it to other jsonb in postgres. That means, I can for instance do the following:
SELECT *
FROM points
WHERE val > '5'
This correctly gives me back only the third row. It just ignores the bool value. To filter for a certain bool I can achieve that with the following query:
SELECT *
FROM points
WHERE val = 'true'
This is good enough for me. I even could hold timestamps in the json column and compare them using this methodology.
Another way of solving the problem after all your comments seem to be to make the column a numeric. This would work as well, but requires more client side conversion, as I would have to have a second type column, remembering what the actual type is. This type should than be used on the client side to convert the value back into its og value. For integers its trivial, for booleans like #schwern suggested, one can use 1 and 0, for dates, one could use the unix timestamp representation.
When I now want to search for a certain value, the type has to be contained in the where clause as well.

SQLite SUM function and strings

I have the following problem.
Let's say that I have a table foo:
CREATE TABLE foo
(
mycolumn INTEGER
);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
INSERT INTO foo VALUES ("");
then, the result of SELECT SUM(mycolumn) FROM foo; is:
SUM(mycolumn)
-------------
3.0
If I updated the empty string to NULL the result would be 3. It is understandable as aggregate functions disregard NULL values. However, I do not understand the floating point result given above (3.0).
I cannot find any information about TEXT to REAL conversion on SQLite Datatypes page. Therefore, I have the following question:
Is there an automatic conversion from TEXT to REAL storage class when an aggregate function is used?
PS. I am using version 3.34.1 of SQLite.
From the SQLite documentation on aggregates:
The result of sum() is an integer value if all non-NULL inputs are integers. If any input to sum() is neither an integer or a NULL then sum() returns a floating point value which might be an approximation to the true sum.
Check the sqlite documentation
sum(X) total(X)
The sum() and total() aggregate functions return sum of all non-NULL
values in the group. If there are no non-NULL input rows then sum()
returns NULL but total() returns 0.0. NULL is not normally a helpful
result for the sum of no rows but the SQL standard requires it and
most other SQL database engines implement sum() that way so SQLite
does it in the same way in order to be compatible. The non-standard
total() function is provided as a convenient way to work around this
design problem in the SQL language.
The result of total() is always a floating point value. The result of
sum() is an integer value if all non-NULL inputs are integers. If any
input to sum() is neither an integer or a NULL then sum() returns a
floating point value which might be an approximation to the true sum.
Sum() will throw an "integer overflow" exception if all inputs are
integers or NULL and an integer overflow occurs at any point during
the computation. Total() never throws an integer overflow.

Get max on comma separated values in column

How to get max on comma separated values in Original_Ids column and get max value in one column and remaining ids in different column.
|Original_Ids | Max_Id| Remaining_Ids |
|123,534,243,345| 534 | 123,234,345 |
Upadte -
If I already have Max_id and just need below equation?
Remaining_Ids = Original_Ids - Max_id
Thanks
Thanks to the excellent possibilities of array manipulation in Postgres, this could be done relatively easy by converting the string to an array and from there to a set.
Then regular queries on that set are possible. With max() the maximum can be selected and with EXCEPT ALL the maximum can be removed from the set.
A set can then be converted to an array and with array_to_string() and the array can be converted to a delimited string again.
SELECT ids original_ids,
(SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id)) max_id,
array_to_string(ARRAY((SELECT un.id::integer
FROM unnest(string_to_array(ids,
',')) un(id)
EXCEPT ALL
SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id))),
',') remaining_ids
FROM elbat;
Another option would have been regexp_split_to_table() which directly produces a set (or regexp_split_to_array() but than we'd had the possible regular expression overhead and still had to convert the array to a set).
But nevertheless you just should (almost) never use delimited lists (nor arrays). Use a table, that's (almost) always the best option.
SQL Fiddle
You can use a window function (https://www.postgresql.org/docs/current/static/tutorial-window.html) to get the max element per unnested array. After that you can reaggregate the elements and remove the calculated max value from the array.
Result:
a max_elem remaining
123,534,243,345 534 123,243,345
3,23,1 23 3,17
42 42
56,123,234,345,345 345 56,123,234
This query needs only one split/unnest as well as only one max calculation.
SELECT
a,
max_elem,
array_remove(array_agg(elements), max_elem) as remaining -- C
FROM (
SELECT
*,
MAX(elements) OVER (PARTITION BY a) as max_elem -- B
FROM (
SELECT
a,
unnest((string_to_array(a, ','))::int[]) as elements -- A
FROM arrays
)s
)s
GROUP BY a, max_elem
A: string_to_array converts the string list into an array. Because the arrays are treated as string arrays you need the cast them into integer arrays by adding ::int[]. The unnest() expands all array elements into own rows.
B: window function MAX gives the maximum value of the single arrays as max_elem
C: array_agg reaggregates the elements through the GROUP BY id. After that array_remove removes the max_elem value from the array.
If you do not like to store them as pure arrays but as string list again you could add array_to_string. But I wouldn't recommend this because your data are integer arrays and not strings. For every further calculation you would need this string cast. A even better way (as already stated by #stickybit) is not to store the elements as arrays but as unnested data. As you can see in nearly every operation should would do the unnest before.
Note:
It would be better to use an ID to adress the columns/arrays instead of the origin string as in SQL Fiddle with IDs
If you install the extension intarray this is quite easy.
First you need to create the extension (you have to be superuser to do that):
create extension intarray;
Then you can do the following:
select original_ids,
original_ids[1] as max_id,
sort(original_ids - original_ids[1]) as remaining_ids
from (
select sort_desc(string_to_array(original_ids,',')::int[]) as original_ids
from bad_design
) t
But you shouldn't be storing comma separated values to begin with

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?
Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)
You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.
Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

Determine MAX Decimal Scale Used on a Column

In MS SQL, I need a approach to determine the largest scale being used by the rows for a certain decimal column.
For example Col1 Decimal(19,8) has a scale of 8, but I need to know if all 8 are actually being used, or if only 5, 6, or 7 are being used.
Sample Data:
123.12345000
321.43210000
5255.12340000
5244.12345000
For the data above, I'd need the query to either return 5, or 123.12345000 or 5244.12345000.
I'm not concerned about performance, I'm sure a full table scan will be in order, I just need to run the query once.
Not pretty, but I think it should do the trick:
-- Find the first non-zero character in the reversed string...
-- And then subtract from the scale of the decimal + 1.
SELECT 9 - PATINDEX('%[1-9]%', REVERSE(Col1))
I like #Michael Fredrickson's answer better and am only posting this as an alternative for specific cases where the actual scale is unknown but is certain to be no more than 18:
SELECT LEN(CAST(CAST(REVERSE(Col1) AS float) AS bigint))
Please note that, although there are two explicit CAST calls here, the query actually performs two more implicit conversions:
As the argument of REVERSE, Col1 is converted to a string.
The bigint is cast as a string before being used as the argument of LEN.
SELECT
MAX(CHAR_LENGTH(
SUBSTRING(column_name::text FROM '\.(\d*?)0*$')
)) AS max_scale
FROM table_name;
*? is the non-greedy version of *, so \d*? catches all digits after the decimal point except trailing zeros.
The pattern contains a pair of parentheses, so the portion of the text that matched the first parenthesized subexpression (that is \d*?) is returned.
References:
https://www.postgresql.org/docs/9.6/static/sql-createcast.html
https://www.postgresql.org/docs/9.6/static/functions-matching.html
Note this will scan the entire table:
SELECT TOP 1 [Col1]
FROM [Table]
ORDER BY LEN(PARSENAME(CAST([Col1] AS VARCHAR(40)), 1)) DESC