Find max value from column that has a json object with key-value pairs - sql

I Have a table that has a column of a JSON string (key-value pairs) of items, I want to return only the key-value pair of the largest value
I can do this by first UNNESTing the JSON object and then taking the largest value by ORDER BY item, value (DESC) and using array_agg to get the largest one. The problem is that this means creating multiple tables and is slow. I am hoping that in one operation, I'll be able to extract the largest key-value pair.
This:
| id | items |
| -- | ---------------------------------- |
| 1 | {Item1=7.3, Item2=1.3, Item3=9.8} |
| 2 | {Item2=4.4, Item3=5.2, Item1=0.1} |
| 3 | {Item5=6.6, Item2=1.4, Item4=1.5} |
| 4 | {Item6=0.9, Item7=11.2, Item4=8.1} |
Should become:
| id | item | value |
| -- | ----- | ----- |
| 1 | Item3 | 9.8 |
| 2 | Item3 | 5.2 |
| 3 | Item5 | 6.6 |
| 4 | Item7 | 11.2 |
I don't actually need the value, so long as the item is the largest from the JSON object, so the following would be fine as well:
| id | item |
| -- | ----- |
| 1 | Item3 |
| 2 | Item3 |
| 3 | Item5 |
| 4 | Item7 |

Presto's UNNEST performance got improved in Presto 316. However, you don't need UNNEST in this case.
You can
convert your JSON to arary of key/value pairs using JSON CAST and map_entries
reduce the array to pick the key for highest value
since key/value pairs are represented as anonymous row elements, it's very convenient to use positional access to row elements with subscript operator, (available since Presto 314)
Use query like
SELECT
id,
reduce(
-- conver JSON to array of key/value pairs
map_entries(CAST(data AS map(varchar, double))),
-- initial state for reduce (must be same type as key/value pairs)
(CAST(NULL AS varchar), -1e0), -- assuming your values cannot be negative
-- reduction function
(state, element) -> if(state[2] > element[2], state, element),
-- reduce output function
state -> state[1]
) AS top
FROM (VALUES
(1, JSON '{"Item1":7.3, "Item2":1.3, "Item3":9.8}'),
(4, JSON '{"Item6":0.9, "Item7":11.2, "Item4":8.1}'),
(5, JSON '{}'),
(6, NULL)
) t(id, data);
Output
id | top
----+-------
1 | Item3
4 | Item7
5 | NULL
6 | NULL
(4 rows)

Store the values one per row in a child table.
CREATE TABLE child (
id INT NOT NULL,
item VARCHAR(6) NOT NULL,
value DECIMAL(9,1),
PRIMARY KEY (id, item)
);
You don't have to do a join to find the largest per group, just use a window function:
WITH cte AS (
SELECT id, item, ROW_NUMBER() OVER (PARTITION BY id ORDER BY value DESC) AS rownum
FROM mytable
)
SELECT * FROM cte WHERE rownum = 1;
Solving this with JSON is a bad idea. It makes your table denormalized, it makes the queries harder to design, and I predict it will make the query performance worse.

Related

How to merge arrays of different table entries together?

I have a table which keeps track of price changes at different dates:
| ID | price | date |
|----|-----------------|------------|
| 1 | {118, 123, 144} | 2020/12/05 |
| 2 | {222, 333, 231} | 2020/12/06 |
| 3 | {99, 55, 33} | 2020/12/07 |
I would like to retrieve all prices in a single array ordered by their date so I would get the result like this:
| ID | price |
|----|------------------------------------|
| 1 | {118,123,144,222,333,231,99,55,33} |
How can I achieve this? It seems to me that the challenge is creating a new array or appending values to an existing one.
You can unnest and re-aggregate with proper ordering:
select array_agg(x.val order by t.date, x.rn) as all_prices
from mytable t
cross join lateral unnest(t.price) with ordinality as x(val, rn)
One (presumably) important thing is to keep track of the position of each element in the original array, that we can then use as second-level sorting criteria when aggregating.

How to select properties of arrays of indeterminate size (SQL Server)

I have a table of arrays, and would like to return a list of items.
| table |
|---------------------------------------|
| [{"item": 1},{"item": 2},{"item": 3}] |
| [{"item": 4},{"item": 5},{"item": 6}] |
| [{"item": 7},{"item": 8},{"item": 9}] |
Example Query
The following method will return a wide table of items, but it doesn't scale well.
select
json_value([table], '$[0].item'),
json_value([table], '$[1].item'),
json_value([table], '$[2].item')
from someTable;
How can I select properties of arrays of indeterminate size? (like 100)
Desired Output
| items |
|-------|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
You need to use OPENJSON to achieve this. Also, as I note, you should choose a different name than Table for a column's name. TABLE is a reserved key word and it's confusing as a COLUMN is not a TABLE.
WITH VTE AS(
SELECT *
FROM (VALUES(N'[{"item": 1},{"item": 2},{"item": 3}]'),
(N'[{"item": 4},{"item": 5},{"item": 6}]'),
(N'[{"item": 7},{"item": 8},{"item": 9}]'))V([Table])) --Table isn't a good choice of a came for a column.
--TABLE is a reserved keyword.
SELECT T.item
FROM VTE V
CROSS APPLY OPENJSON(V.[Table]) --Table isn't a good choice of a came for a column. TABLE is a reserved keyword.
WITH(item int) T;

Casting string to int i.e. the string "res"

I have a column in a table which is type array<string>. The table is partitioned daily since 2018-01-01. At some stage, the values in the array goes from strings to integers. The data looks like this:
| yyyy_mm_dd | h_id | p_id | con |
|------------|-------|------|---------------|
| 2018-10-01 | 52988 | 1 | ["res", "av"] |
| 2018-10-02 | 52988 | 1 | ["1","2"] |
| 2018-10-03 | 52988 | 1 | ["1","2"] |
There is a mapping between the strings and integers. "res" maps to 1 and "av" maps to 2 etc. However, I've written a query to perform some logic. Here is a snippet (subquery) of it:
SELECT
t.yyyy_mm_dd,
t.h_id,
t.p_id,
CAST(e.con AS INT) AS api
FROM
my_table t
LATERAL VIEW EXPLODE(con) e AS con
My problem is that this doesn't work for the earlier dates when strings were used instead of integers. Is there anyway to select con and remap the strings to integers so the data is across all partitions?
Expected output:
| yyyy_mm_dd | h_id | p_id | con |
|------------|-------|------|---------------|
| 2018-10-01 | 52988 | 1 | ["1","2"] |
| 2018-10-02 | 52988 | 1 | ["1","2"] |
| 2018-10-03 | 52988 | 1 | ["1","2"] |
Once the values selected are all integers (within a string array), then the CAST(e.con AS INT) will work
Edit: To clarify, I will put the solution as a subquery before I use lateral view explode. This way I am exploding on a table where all partitions have integers in con. I hope this makes sense.
CAST(e.api as INT) returns NULL if not possible to cast. collect_list will collect an array including duplicates and without NULLs. If you need array without duplicated elements, use collect_set().
SELECT
t.yyyy_mm_dd,
t.h_id,
t.p_id,
collect_list(--array of integers
--cast case as string if you need array of strings
CASE WHEN e.api = 'res' THEN 1
WHEN e.api = 'av' THEN 2
--add more cases
ELSE CAST(e.api as INT)
END
) as con
FROM
my_table t
LATERAL VIEW EXPLODE(con) e AS api
GROUP BY t.yyyy_mm_dd, t.h_id, t.p_id

How to create INSERT query that adds sequence number in one table to another

I have a table sample_1 in a Postgres 10.7 database with some longitudinal research data and an ascending sequence number per key. I need to INSERT data from a staging table (sample_2) maintaining the sequence column accordingly.
sequence numbers are 0-based. I assume I need a query to seek the greatest sequence number per key in sample_1 and add that to each new row's follow-up sequence number. I'm mainly struggling at this step with the sequence number arithmetic. Tried this:
INSERT INTO sample_1 (KEY, SEQUENCE, DATA)
SELECT KEY, sample_2.SEQUENCE + max(sample_1.SEQUENCE), DATA
FROM sample_2;
However, I get errors saying I can't use 'sample_1.SEQUENCE' in Line 2 because that's the table being inserted in to. I can't figure out how to do the arithmetic with my insert sequence!
Sample data:
sample_1
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 0 | a |
| YMH_0001_XX | 1 | b |
| YMH_0002_YY | 0 | c |
sample_2
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 1 | d |
| YMH_0002_YY | 1 | e |
| YMH_0002_YY | 2 | f |
I want to continue ascending sequence numbers per key for inserted rows.
To be clear, the resultant table in this example would be 3 columns and 6 rows as such:
sample_1
| KEY | SEQUENCE | DATA |
+-------------+----------+------+
| YMH_0001_XX | 0 | a |
| YMH_0001_XX | 1 | b |
| YMH_0001_XX | 2 | d |
| YMH_0002_YY | 0 | c |
| YMH_0002_YY | 1 | e |
| YMH_0002_YY | 2 | f |
That should do what you are after:
INSERT INTO sample_1 (key, sequence, data)
SELECT s2.key
, COALESCE(s1.seq_base, -1)
+ row_number() OVER (PARTITION BY s2.key ORDER BY s2.sequence)
, s2.data
FROM sample_2 s2
LEFT JOIN (
SELECT key, max(sequence) AS seq_base
FROM sample_1
GROUP BY 1
) s1 USING (key);
Notes
You need to build on the existing maximum sequence per key in sample_1. (I named it seq_base.) Compute that in a subquery and join to it.
Add row_number() to it as demonstrated. That preserves the order of input rows, discarding absolute numbers.
We need the LEFTJOIN to avoid losing rows with new keys from sample_2.
Likewise, we need COALESCE to start a fresh sequence for new keys. Default to -1 to effectively start sequences with 0 after adding the 1-based row number.
This is not safe for concurrent execution, but I don't think that's your use case.

how to find the json size stored in a column of postgres

We are using Postgres. In one table, we have a column of type JSON.
How to find the size of JSON stored for a particular row? And how to find the row which has max size of JSON data in that column?
If you want to know how many bytes it takes to store column, then you can use
pg_column_size(any) - Number of bytes used to store a particular value
(possibly compressed)
Example:
SELECT pg_column_size(int2) AS int2, pg_column_size(int4) AS int4,
pg_column_size(int8) AS int8, pg_column_size(text) AS text,
pg_column_size(bpchar) AS bpchar, pg_column_size(char) AS char,
pg_column_size(bool) AS bool, pg_column_size(to_json) AS to_json,
pg_column_size(to_jsonb) AS to_jsonb,
pg_column_size(json_build_object) AS json_build_object,
pg_column_size(jsonb_build_object) AS jsonb_build_object,
octet_length(text) AS o_text, octet_length(bpchar) AS o_bpchar,
octet_length(char) AS o_char, octet_length(to_json::text) AS o_to_json,
octet_length(to_jsonb::text) AS o_to_jsonb,
octet_length(json_build_object::text) AS o_json_build_object,
octet_length(jsonb_build_object::text) AS o_jsonb_build_object
FROM (SELECT 1::int2, 1::int4, 1::int8, 1::text, 1::char, '1'::"char",
1::boolean, to_json(1), to_jsonb(1), json_build_object(1,'test'),
jsonb_build_object(1,'test')
) AS sub
Result:
int2 | int4 | int8 | text | bpchar | char | bool | to_json | to_jsonb | json_build_object | jsonb_build_object | o_text | o_bpchar | o_char | o_to_json | o_to_jsonb | o_json_build_object | o_jsonb_build_object
------+------+------+------+--------+------+------+---------+----------+-------------------+--------------------+--------+----------+--------+-----------+------------+---------------------+----------------------
2 | 4 | 8 | 5 | 5 | 1 | 1 | 5 | 20 | 18 | 21 | 1 | 1 | 1 | 1 | 1 | 14 | 13
Getting row with largest json value is simply sorting by pg_column_size(json_column) desc limit 1.
I think:
Just to know the what's the biggest value:
select max(pg_column_size(json)) from table;
To know the ID of the biggest value:
select id, pg_column_size(json)
from table
group by id
order by max(pg_column_size(json)) desc limit 1;
Seems to work for me, but I'm not much of an expert.
I assume you are looking for octet_length?..
https://www.postgresql.org/docs/current/static/functions-string.html
Number of bytes in string
t=# with d(j) as (values('{"a":0}'),('{"a":null}'),('{"abc":0}'))
select j,octet_length(j) from d;
j | octet_length
------------+--------------
{"a":0} | 7
{"a":null} | 10
{"abc":0} | 9
(3 rows)
so max is:
t=# with d(j) as (values('{"a":0}'),('{"a":null}'),('{"abc":0}'))
select j from d order by octet_length(j) desc limit 1;
j
------------
{"a":null}
(1 row)