I'm trying to implement EAV pattern using Attribute->Value tables but unlike standard way values stored in jsonb filed like {"attrId":[values]}. It's help make easy search request like:
SELECT * FROM products p WHERE p.attributes #> "{1:[2]} AND p.attributes #> "{1:[4]}"
Now I'm wondering is it will be a good approach, and what is a effective way to calculate count of available variations, for example:
-p1- {"width":[1]}
-p2- {"width":[2],"height":[3]}
-p3- {"width":[1]}
Output will
width: 1 (count 2); 2 (count 1)
height: 3 (count 1)
when select width 2
width: 1 (count 0); 2 (count 1)
height: 3 (count 1)
"Flat is better than nested" -- the zen of python
I think you would be better served to use simple key/value pairs and in the rare event you have a complex value, then make it a list. But I don't see that use case.
Here is an example which answers your question. It could be modified to use your structure, but let's keep it simple:
First create a table and insert some JSON:
# create table foo (a jsonb);
# insert into foo values ('{"a":"1", "b":"2"}');
# insert into foo values ('{"c":"3", "d":"4"}');
# insert into foo values ('{"e":"5", "a":"6"}');
Here are the records:
# select * from foo;
a
----------------------
{"a": "1", "b": "2"}
{"c": "3", "d": "4"}
{"a": "6", "e": "5"}
(3 rows)
Here is the output of the json_each_text() function from https://www.postgresql.org/docs/9.6/static/functions-json.html
# select jsonb_each_text(a) from foo;
jsonb_each_text
-----------------
(a,1)
(b,2)
(c,3)
(d,4)
(a,6)
(e,5)
(6 rows)
Now we need to put it in a table expression to be able to get access to the individual fields:
# with t1 as (select jsonb_each_text(a) as rec from foo)
select (rec).key, (rec).value from t1;
key | value
-----+-------
a | 1
b | 2
c | 3
d | 4
a | 6
e | 5
(6 rows)
And lastly here is a grouping with the SUM function. Notice the a key which was in the database 2x, has been properly summed.
# with t1 as (select jsonb_each_text(a) as rec from foo)
select (rec).key, sum((rec).value::int) from t1 group by (rec).key;
key | sum
-----+-----
c | 3
b | 2
a | 7
e | 5
d | 4
(5 rows)
As a final note, (rec) has parentheses around it because otherwise it is incorrectly looked at as a table and will result in this error:
ERROR: missing FROM-clause entry for table "rec"
Related
I have a table with 2 column R(id int,dat jsonb). The b column jsonb consist of a 2D array [][]. For example :
id| dat
1 | {"name":"a","numbers":[[1,2],[3,4],[5,6],[1,3]]}
2 | {"age":5,"numbers":[[1,1]]}
3 | {"numbers":[[5,6],[6,7]]}
I'm trying to find all the ids that contain a specific number in one of those sub arrays. I used 2 solutions and I want to understand why the first one isn't working :
1)
select * from R
where exists (
select from jsonb_array_elements(R.dat->'numbers')->>0 first,jsonb_array_elements(range.data->'numbers')->>1 second where first::decimal= 1 and second::decimal= 1
);
ERROR: syntax error at or near "->>"
LINE 3: ...t from jsonb_array_elements(R.dat->'numbers')->>0 first,j...
SELECT *
FROM R
WHERE EXISTS (
SELECT FROM jsonb_array_elements(R.dat-> 'numbers') subarray
WHERE (subarray->>0)::decimal = 1 and (subarray->>1)::decimal = 1
);
In addition, I saw that gin index doesn't handle this operator so basically does any index will help here ?
Your first query raises an error because you can use only table experssions (not value expressions) in the FROM clause.
You can make the second query a bit simpler:
select *
from r
where exists (
select from jsonb_array_elements(dat->'numbers') subarray
where subarray = '[1,1]'
);
or using the function in a lateral join:
select r.*
from r
cross join jsonb_array_elements(dat->'numbers')
where value = '[1,1]';
There is no index that could support these queries because of the use of jsonb_array_elements().
You may be tempted to use the containment operator #> in the way like this:
select *
from r
where dat->'numbers' #> '[[1,1]]'::jsonb
id | dat
----+------------------------------------------------------------
1 | {"name": "a", "numbers": [[1, 2], [3, 4], [5, 6], [1, 3]]}
2 | {"age": 5, "numbers": [[1, 1]]}
(2 rows)
Unfortunately, as you can see, it does not work as you could expect. The use of the operator on arrays is a bit tricky, as it works in the way: array1 #> array2 is true if for each element j of array2, there is an i in array1 such that i #> j. Hence, per the documentation:
the order of array elements is not significant when doing a containment match, and duplicate array elements are effectively considered only once.
I'm working on the URL extraction on AWS Redshift. The URL column looks like this:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
The result I want to get is to get the series that starts with B and also expand rows if there are multiple series in a URL.
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
My current code is:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
The regex part works well but I have problems with extracting multiple values and expand them to new rows. I tried to use REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g') but function regexp_matches does not exist on Redshift...
The solution I use is fairly ugly but achieves the desired results. It involves using REGEXP_COUNT to determine the maximum number of matches in a row then joining the resulting table of numbers to a query using REGEXP_SUBSTR.
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N
IronFarm's answer inspired me, though I wanted to find a solution that didn't require a cross join. Here's what I came up with:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
This yields:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)
I have a column with type jsonb holding a list of IDs as plain JSON array in my PostgreSQL 9.6.6 database and I want to search this field based on any ID in the list. How to query write this query?
'[1,8,3,4,56,6]'
For example, my table is:
CREATE TABLE mytable (
id bigint NOT NULL,
numbers jsonb
);
And it has some values:
id | numbers
-----+-------
1 | "[1,8,3,4,56,6]"
2 | "[1,2,7,4,24,5]"
I want something like this:
SELECT *
FROM mytable
WHERE
id = 1
AND
numbers::json->>VALUE(56)
;
Expected result (only if the JSON array has 56 as element):
id | numbers
-----+-------
1 | "[1,8,3,4,56,6]"
Step-2 problem :
The result of this command is TRUE :
SELECT '[1,8,3,4,56,6]'::jsonb #> '56';
but already when I use
SELECT *
FROM mytable
numbers::jsonb #> '[56]';
or
SELECT *
FROM mytable
numbers::jsonb #> '56';
or
SELECT *
FROM mytable
numbers::jsonb #> '[56]'::jsonb;
The result is nothing :
id | numbers
-----+-------
(0 rows)
Instead of be this :
id | numbers
-----+-------
1 | "[1,8,3,4,56,6]"
I find why I get (0 rows) ! :))
because I insert jsonb value to mytable with double quotation , in fact this is correct value format (without double quotation ):
id | numbers
-----+-------
1 | [1,8,3,4,56,6]
2 | [1,2,7,4,24,5]
now when run this command:
SELECT *
FROM mytable
numbers #> '56';
The result is :
id | numbers
-----+-------
1 | [1,8,3,4,56,6]
Use the jsonb "contains" operator #>:
SELECT *
FROM mytable
WHERE id = 1
AND numbers #> '[56]';
Or
...
AND numbers #> '56';
Works with our without enclosing array brackets in this case.
dbfiddle here
This can be supported with various kinds of indexes for great read performance if your table is big.
Detailed explanation / instructions:
Index for finding an element in a JSON array
Hint (addressing your comment): when testing with string literals, be sure to add an explicit cast:
SELECT '[1,8,3,4,56,6]'::jsonb #> '56';
If you don't, Postgres does not know which data types to assume. There are multiple options:
SELECT '[1,8,3,4,56,6]' #> '56';
ERROR: operator is not unique: unknown #> unknown
Related:
GIN index on smallint[] column not used or error "operator is not unique"
I would like to use multiple arrays within a select clause. The obvious one didn't work and postgresql points to ROWS FROM() ...
select * from unnest(array[1,2], array[3,4]) as (a int, b int);
ERROR:
UNNEST() with multiple arguments cannot have a column definition list
LINE 1: select * from unnest(array[1,2], array[3,4]) as (a int, b in...
^
HINT: Use separate UNNEST() calls inside ROWS FROM(), and attach a column definition list to each one.
...
select * from rows from (unnest(array[1,2]), unnest(array[3,4])) as (a int, b int);
ERROR:
ROWS FROM() with multiple functions cannot have a column definition list
LINE 1: ...from (unnest(array[1,2]), unnest(array[3,4])) as (a int, b i...
^
HINT: Put a separate column definition list for each function inside ROWS FROM().
The manual explains this as well but how to define these 'separate column definitions'?
You can define the column names without their types using just AS t(a, b):
#= SELECT * FROM unnest(array[1,2], array[3,4,5]) AS t(a, b);
a | b
---+---
1 | 3
2 | 4
∅ | 5
To define types, do it on the arrays themselves:
#= SELECT a / 2 AS half_a, b / 2 AS half_b
FROM unnest(array[1,2]::float[], array[3,4,5]::integer[]) AS t(a, b);
half_a | half_b
--------+--------
0.5 | 1
1 | 2
∅ | 2
EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.
An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.
I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.
The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;