PostgreSQL: Efficiently split JSON array into rows - sql

I have a table (Table A) that includes a text column that contains JSON encoded data.
The JSON data is always an array with between one and a few thousand plain object.
I have another table (Table B) with a few columns, including a column with a datatype of 'JSON'
I want to select all the rows from table A, split the json array into its elements and insert each element into table B
Bonus objective: Each object (almost) always has a key, x. I want to pull the value of x out into column, and delete x from the original object (if it exists).
E.g.: Table A
| id | json_array (text) |
+----+--------------------------------+
| 1 | '[{"x": 1}, {"y": 8}]' |
| 2 | '[{"x": 2, "y": 3}, {"x": 1}]' |
| 3 | '[{"x": 8, "z": 2}, {"z": 3}]' |
| 4 | '[{"x": 5, "y": 2, "z": 3}]' |
...would become: Table B
| id | a_id | x | json (json) |
+----+------+------+--------------------+
| 0 | 1 | 1 | '{}' |
| 1 | 1 | NULL | '{"y": 8}' |
| 2 | 2 | 2 | '{"y": 3}' |
| 3 | 2 | 1 | '{}' |
| 4 | 3 | 8 | '{"y": 2}' |
| 5 | 3 | NULL | '{"z": 3}' |
| 6 | 4 | 5 | '{"y": 2, "z": 3}' |
This initially has to work on a few million rows, and would then need to be run at regular intervals, so making it efficient would be a priority.
Is it possible to do this without using a loop and PL/PgSQL? I haven't been making much progress.

The json data type is not particularly suitable (or intended) for modification at the database level. Extracting "x" objects from the JSON object is therefore cumbersome, although it can be done.
You should create your table B (with hopefully a more creative column name than "json"; I am using item here) and make the id column a serial that starts at 0. A pure json solution then looks like this:
INSERT INTO b (a_id, x, item)
SELECT sub.a_id, sub.x,
('{' ||
string_agg(
CASE WHEN i.k IS NULL THEN '' ELSE '"' || i.k || '":' || i.v END,
', ') ||
'}')::json
FROM (
SELECT a.id AS a_id, (j.items->>'x')::integer AS x, j.items
FROM a, json_array_elements(json_array) j(items) ) sub
LEFT JOIN json_each(sub.items) i(k,v) ON i.k <> 'x'
GROUP BY sub.a_id, sub.x
ORDER BY sub.a_id;
In the sub-query this extracts the a_id and x values, well as the JSON object. In the outer query the JSON object is broken into its individual pieces and the objects with key x thrown out (the LEFT JOIN ON i.k <> 'x'). In the select list the pieces are put back together again with string concatenation and grouped into compound objects.
This necessarily has to be like this because json has no built-in manipulation functions of any consequence. This works on PG versions 9.3+, i.e. since time immemorial insofar as JSON support is concerned.
If you are using PG9.5+, the solution is much simpler through a cast to jsonb:
INSERT INTO b (a_id, x, item)
SELECT a.id, (j.items->>'x')::integer, j.items #- '{x}'
FROM a, jsonb_array_elements(json_array::jsonb) j(items);
The #- operator on the jsonb data type does all the dirty work here. Obviously, there is a lot of work going on behind the scenes, converting json to jsonb, so if you find that you need to manipulate your JSON objects more frequently then you are better off using the jsonb type to begin with. In your case I suggest you do some benchmarking with EXPLAIN ANALYZE SELECT ... (you can safely forget about the INSERT while testing) on perhaps 10,000 rows to see which works best for your setup.

Related

Match two jsonb documents by order of elements in array

I have table of data jsonb documents in postgres and second table containing templates for data.
I need to match data jsonb row with template jsonb row just by order of elements in array in effective way.
template jsonb document:
{
"template":1,
"rows":[
"first row",
"second row",
"third row"
]
}
data jsonb document:
{
"template":1,
"data":[
125,
578,
445
]
}
desired output:
Desc
Amount
first row
125
second row
578
third row
445
template table:
| id | jsonb |
| -------- | ------------------------------------------------------ |
| 1 | {"template":1,"rows":["first row","second row","third row"]} |
| 2 | {"template":2,"rows":["first row","second row","third row"]} |
| 3 | {"template":3,"rows":["first row","second row","third row"]} |
data table:
| id | jsonb |
| -------- | ------------------------------------------- |
| 1 | {"template":1,"data":[125,578,445]} |
| 2 | {"template":1,"data":[125,578,445]} |
| 3 | {"template":2,"data":[125,578,445]} |
I have millions of data jsonb documents and hundreds of templates.
I would do it just by converting both to tables, then use row_number windowed function but it does not seem very effective way to me.
Is there better way of doing this?
You will have to normalize this mess "on-the-fly" to get the output you want.
You need to unnest each array using jsonb_array_elements() using the with ordinality option to get the array index. You can join the two tables by extracting the value of the template key:
Assuming you want to return this for a specific row from the data table:
select td.val, dt.val
from data
cross join jsonb_array_elements_text(data.jsonb_column -> 'data') with ordinality as dt(val, idx)
left join template tpl
on tpl.jsonb_column ->> 'template' = data.jsonb_column ->> 'template'
left join jsonb_array_elements_text(tpl.jsonb_column -> 'rows') with ordinality as td(val, idx)
on td.idx = dt.idx
where data.id = 1;
Online example

Sort SQL results and include missing keys

I have a Postgres table like this (greatly simplified):
id | object_id (foreign id) | key (text) | value (text)
1 | 1 | A | 0foo
2 | 1 | B | 1bar
3 | 1 | C | 2baz
4 | 1 | D | 3ham
5 | 2 | C | 4sam
6 | 3 | F | 5pam
…
(billions of rows)
I select object_ids according to some query (not relevant here), and then sort them according to the value of a specified key.
def sort_query_result(query, sort_by, limit, offset):
return query\
.with_entities(Table.object_id)\
.filter(Table.key == sort_by)\
.order_by(desc(Table.value))\
.limit(limit).offset(offset).subquery()
For example, assume a query matches object_ids 1 and 2 above. When sort_by=C, I want the result to be returned in the order [2, 1], because 4sam > 2baz.
This works well but there's one big problem:
Object ids that are returned by query but do not have any row for the sort_by key, are not returned at all.
For example, for a query that matches object_ids 1 and 2, sort_query_results(query, sort_by='D') == [1]. The object_id 2 is dropped because it has no D, which is undesirable.
Instead, I'd like to return all object_ids from the query. Those without the sort key should be sorted at the end, in any order: sort_query_results(query, sort_by='D') == [1, 2].
What's the best way to achieve that?
Note: I do not have the freedom to change the DB schema or business logic. But I can change the query code. I use SQLAlchemy ORM from Python, but could execute raw Postgres commands if necessary. Thank you.

Check if one json value is in the json value from another JSON, for each key

With SQL in postgres, I want to know if one JSON is 'IN' another JSON.
For example:
json_1 = {"a": ["123"], "b": ["456", "789"]}
json_2 = {"a": ["123"], "b": ["456"]}
In the above case, json_2["a"] is in json_1["a"] and json_2["b"] is in json_1["b"].
If I would know all the possible keys of the json, I would easily be able to write the above per key. However, the problem is that I don't know how many and which keys are in the JSON. How can I check for every key in the JSON, if json_2 is in json_1?
I'm not sure about what output format you want, but this will create one row per key, with a boolean stating whether the json_2 key's array values are contained within the json_1 key's values.
CREATE TABLE t (json_1 JSONB, json_2 JSONB);
INSERT INTO t
VALUES
('{"a":["123"],"b":["456","789","aaa"],"c":["999"],"d":[]}',
'{"a":["123"],"b":["789","456"],"c":["123"],"d":["x"]}');
Query #1
SELECT key, value <# (json_1->key) AS contained
FROM (
SELECT (JSONB_EACH(json_2)).*, json_1
FROM t
) j;
Returns:
| key | contained |
| --- | --------- |
| a | true |
| b | true |
| c | false |
| d | false |
View on DB Fiddle

Filter json values regardless of keys in PostgreSQL

I have a table called diary which includes columns listed below:
| id | user_id | custom_foods |
|----|---------|--------------------|
| 1 | 1 | {"56": 2, "42": 0} |
| 2 | 1 | {"19861": 1} |
| 3 | 2 | {} |
| 4 | 3 | {"331": 0} |
I would like to count how many diaries having custom_foods value(s) larger than 0 each user have. I don't care about the keys, since the keys can be any number in string.
The desired output is:
| user_id | count |
|---------|---------|
| 1 | 2 |
| 2 | 0 |
| 3 | 0 |
I started with:
select *
from diary as d
join json_each_text(d.custom_foods) as e
on d.custom_foods != '{}'
where e.value > 0
I don't even know whether the syntax is correct. Now I am getting the error:
ERROR: function json_each_text(text) does not exist
LINE 3: join json_each_text(d.custom_foods) as e
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
My using version is: psql (10.5 (Ubuntu 10.5-1.pgdg14.04+1), server 9.4.19). According to PostgreSQL 9.4.19 Documentation, that function should exist. I am so confused that I don't know how to proceed now.
Threads that I referred to:
Postgres and jsonb - search value at any key
Query postgres jsonb by value regardless of keys
Your custom_foods column is defined as text, so you should cast it to json before applying json_each_text. As json_each_text by default does not consider empty jsons, you may get the count as 0 for empty jsons from a separate CTE and do a UNION ALL
WITH empty AS
( SELECT DISTINCT user_id,
0 AS COUNT
FROM diary
WHERE custom_foods = '{}' )
SELECT user_id,
count(CASE
WHEN VALUE::int > 0 THEN 1
END)
FROM diary d,
json_each_text(d.custom_foods::JSON)
GROUP BY user_id
UNION ALL
SELECT *
FROM empty
ORDER BY user_id;
Demo

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+