Split multi-bracket data (JSON arrays) into rows preserving associations - sql

I need to split data from bracket to row format.
If there is data in multi bracket, it does not update properly using below query.
Here's an example:
create table test (
id integer,
questionid character varying (255),
questionanswer character varying (255)
);
INSERT INTO test (id,questionid,questionanswer) values
(1,'[101,102,103]','[["option_11"],["Test message 1"],["option_14"]]'),
(2,'[201]','[["option_3","option_4"]]'),
(3,'[301,302]','[["option_1","option_3"],["option_1"]]'),
(4,'[976,1791,978,1793,980,1795,982,1797]','[["option_2","option_3","option_4","option_5"],["Test message"],["option_4"],["Test message2"],["option_2"],["Test message3"],["option_2","option_3"],["Test message4"]]');
Query:
select t.id, t1.val, v1#>>'{}' from test t
cross join lateral (select row_number() over (order by v.value#>>'{}') r, v.value#>>'{}' val
from json_array_elements(t.questionid::json) v) t1
join lateral (select row_number() over (order by 1) r, v.value val
from json_array_elements(t.questionanswer::json) v) t2 on t1.r = t2.r
cross join lateral json_array_elements(t2.val) v1;
Current query output for id = 4:
id
val
?column?
4
1791
option_2
4
1791
option_3
4
1791
option_4
4
1791
option_5
4
1793
Test message
4
1795
option_4
4
1797
Test message2
4
976
option_2
4
978
Test message3
4
980
option_2
4
980
option_3
4
982
Test message4
Associations between questions and answers come out wrong. Output should be:
id
val
?column?
4
976
option_2
4
976
option_3
4
976
option_4
4
976
option_5
4
1791
Test message
4
978
option_4
4
1793
Test message2
4
980
option_2
4
1795
Test message3
4
982
option_2
4
982
option_3
4
1797
Test message4

Most importantly, use WITH ORDINALITY instead of row_number() and join unnested questions and answers on their ordinal positions. See:
PostgreSQL unnest() with element number
And use json_array_elements_text(). See:
How to turn JSON array into Postgres array?
SELECT t.id, qa.q_id
, json_array_elements_text(qa.answers) AS answer
FROM test t
CROSS JOIN LATERAL (
SELECT *
FROM json_array_elements_text(t.questionid::json) WITH ORDINALITY q(q_id, ord)
JOIN json_array_elements(t.questionanswer::json) WITH ORDINALITY a(answers, ord) USING (ord)
) qa
ORDER BY t.id, qa.ord;
fiddle
Aside: you should probably store JSON values as type json (or jsonb) to begin with.

Related

PostgreSQL fill in the blanks in an outer join

Outer Join 'fill-in-the blanks'
I have a pair of master-detail tables in a PostgreSQL database where master table 'samples' has some samples with a timestamp in each.
The detail table 'sample_values' has some values for some parameters at any given sample timestamp.
My Query
SELECT s.sample_id, s.sample_time, v.parameter_id, v.sample_value
FROM samples s LEFT OUTER JOIN sample_values v ON v.sample_id=s.sample_id
ORDER BY s.sample_id, v.parameter_id;
returns (as expected):
sample_id
sample_time
parameter_id
sample_value
1
2023-01-13T01:00:00.000Z
1
1.23
1
2023-01-13T01:00:00.000Z
2
4.98
2
2023-01-13T01:01:00.000Z
3
2023-01-13T01:02:00.000Z
4
2023-01-13T01:03:00.000Z
5
2023-01-13T01:04:00.000Z
2
6.08
6
2023-01-13T01:05:00.000Z
7
2023-01-13T01:06:00.000Z
1
1.89
8
2023-01-13T01:07:00.000Z
9
2023-01-13T01:08:00.000Z
10
2023-01-13T01:09:00.000Z
11
2023-01-13T01:10:00.000Z
12
2023-01-13T01:11:00.000Z
13
2023-01-13T01:12:00.000Z
14
2023-01-13T01:13:00.000Z
15
2023-01-13T01:14:00.000Z
1
2.11
16
2023-01-13T01:15:00.000Z
17
2023-01-13T01:16:00.000Z
18
2023-01-13T01:17:00.000Z
19
2023-01-13T01:18:00.000Z
2
3.57
20
2023-01-13T01:19:00.000Z
21
2023-01-13T01:20:00.000Z
22
2023-01-13T01:21:00.000Z
23
2023-01-13T01:22:00.000Z
1
3.21
23
2023-01-13T01:22:00.000Z
2
5.31
How do I write a query that returns one row per timestamp per parameter, where sample_value is the 'latest known' sample_value for that parameter like this:
sample_id
sample_time
parameter_id
sample_value
1
2023-01-13T01:00:00.000Z
1
1.23
1
2023-01-13T01:00:00.000Z
2
4.98
2
2023-01-13T01:01:00.000Z
1
1.23
2
2023-01-13T01:01:00.000Z
2
4.98
3
2023-01-13T01:02:00.000Z
1
1.23
3
2023-01-13T01:02:00.000Z
2
4.98
4
2023-01-13T01:03:00.000Z
1
1.23
4
2023-01-13T01:03:00.000Z
2
4.98
5
2023-01-13T01:04:00.000Z
1
1.23
5
2023-01-13T01:04:00.000Z
2
6.08
6
2023-01-13T01:05:00.000Z
1
1.23
6
2023-01-13T01:05:00.000Z
2
6.08
7
2023-01-13T01:06:00.000Z
1
1.89
7
2023-01-13T01:06:00.000Z
2
6.08
8
2023-01-13T01:07:00.000Z
1
1.89
8
2023-01-13T01:07:00.000Z
2
6.08
View on DB Fiddle
I cannot get my head around the LAST_VALUE function (if that is even the right tool for this?):
LAST_VALUE ( expression )
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
First of all you need two rows for each of your sample ids. You can achieve it by cross joining your sample values with the distinct amount of parameters, and ensuring the condition on parameters is met as well on the left join.
...
FROM samples s
CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
LEFT JOIN sample_values v
ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
...
In addition to this, your intuition of using the LAST_VALUE window function was correct. Problem is that PostgreSQL is unable to ignore null values till its current version. The only workaround for this problem is to generate partitioning on your parameter_ids and sample_value (each partition will contain one non-null value and the other null values), then taking the maximum value from each partition.
WITH cte AS (
SELECT s.sample_id, s.sample_time, p.parameter_id, v.sample_value,
COUNT(v.sample_value) OVER(
PARTITION BY p.parameter_id
ORDER BY s.sample_id
) AS partitions
FROM samples s
CROSS JOIN (SELECT DISTINCT parameter_id FROM sample_values) p
LEFT JOIN sample_values v
ON v.sample_id = s.sample_id AND v.parameter_id = p.parameter_id
)
SELECT sample_id, sample_time, parameter_id,
COALESCE(sample_value,
MAX(sample_value) OVER (PARTITION BY parameter_id, partitions)
) AS sample_value
FROM cte
ORDER BY sample_id, parameter_id
Check the demo here.

How can I count distinct only the values in VARCHAR array?

I have a VARCHAR column with arrays in it (string as a array) in Oracle sql dialect how can I count the distinct values in it?
for example I have the following rows
ID List
1 ["351","364"]
2 ["364","351"]
3 ["364","951"]
4 ["951"]
I expected to count 3.
Assuming you're on a recent version of Oracle, you can use JSON functions to extract the elements of the arrays:
select t.id, j.value
from your_table t
outer apply json_table (t.list, '$[*]' columns (value path '$')) j
ID
VALUE
1
351
1
364
2
364
2
351
3
364
3
951
4
951
And then just count the distinct values:
select count(distinct j.value) as c
from your_table t
outer apply json_table (t.list, '$[*]' columns (value path '$')) j
C
3

Get count of numbers from all columns in a large table

I have the following table
ID A1 A2 A3 A4 A5 A6
1 324 243 3432 23423 342 342
2 342 242 4345 23423 324 342
How do I write a query that will give me the no.of times a number is appearing in any of the above columns. For example, this is the output I am looking for -
324 2
243 1
3432 1
23423 1
342 3
242 1
4345 1
23423 1
There are a number of ways to do this, but my first thought is to use unnest:
rnubel=# CREATE TABLE mv (a int, b int, c int);
CREATE TABLE
rnubel=# INSERT INTO mv (a, b, c) VALUES (1, 1, 1), (2, 2, 2), (3, 4, 5);
INSERT 0 3
rnubel=# SELECT unnest(array[a, b, c]) as value, COUNT(*) from mv GROUP BY 1;
value | count
-------+-------
5 | 1
4 | 1
2 | 3
1 | 3
3 | 1
(5 rows)
unnest is a handy function that turns an array into a set of rows, so it expands the array of column values into one row per column value. Then you just group and count as usual.
Brute force method:
SELECT Value
,COUNT(1) AS ValueCount
FROM (
SELECT A1 AS Value
FROM t
UNION ALL
SELECT A2
FROM t
UNION ALL
SELECT A3
FROM t
UNION ALL
SELECT A4
FROM t
UNION ALL
SELECT A5
FROM t
UNION ALL
SELECT A6
FROM t
) x
GROUP BY Value
In Postgres, you can use lateral joins to unpivot values. I find this more direct than using an array or union all:
select v.a, count(*)
from t cross join lateral
(values (a1), (a2), (a3), (a4), (a5), (a6)
) v(a)
group by v.a;
Here is a db<>fiddle.

How can this sql grouped query be corrected? (I think this is sql generic, we use postgresql)

Data set example is as follows:
noteid seq notebalance
1 4 125.00
1 3 120.00
2 8 235.00
2 6 235.00
2 5 200.00
3 9 145.00
4 17 550.00
4 16 550.00
4 14 500.00
4 12 450.00
4 10 400.00
...
so we basically have the latest notebalance at the beginning of each noteid group.
What is the proper sql syntax to obtain the latest balances for each noteid?
as in:
1 4 125.00
2 8 235.00
3 9 145.00
4 17 550.00
A generic (= ANSI SQL) solution would be to use a window function:
select noteid, seq, notebalance
from (
select noteid, seq, notebalance,
row_number() over (partition by noteid order by seq desc) as rn
from the_table
) t
where rn = 1
order by noteid;
When using Postgres, it's usually faster to use the distinct on operator:
select distinct on (noteid) noteid, seq, notebalance
from the_table
order by noteid, seq desc;
SQLFiddle example: http://sqlfiddle.com/#!15/8ca27/2
I think ROW_NUMBER() is what you are looking for. This is similar to this SO question.
"the record with the highest seq" := "there is no record with a higher seq (for this noteid)"
SELECT noteid, seq, notebalance
FROM the_table tt
WHERE NOT EXISTS (
SELECT * FROM the_table nx
WHERE nx.noteid = tt.noteid
AND nc.seq > tt.seq
)
ORDER BY noteid
;

grouping rows by csv column TSQL

I have a table with following sample data:
**CategoriesIDs RegistrantID**
47 1
276|275|278|274|277 4
276|275|278|274|277 16261
NULL 16262
NULL 16264
NULL 16265
NULL 16266
NULL 16267
NULL 16268
NULL 16269
NULL 16270
276|275|278 16276
276|275|278|274|277 16292
276|275|278|274|277 16293
276|275|278|274|277 16294
276|275|278|274|277 16295
276|275|278|274|277 16302
276|275|278|274|277 16303
276|275|278|274|277 16304
276|275|278|274|277 16305
276|275|278|274|277 16306
276|275|278|274|277 16307
I need to know :
1). which category has how many regisrtantids (like 277 has how many registrantids)
2). group the registrants by category so that i can find which registrants are in category 277 for example)
Do I need to create a function which generates a table from csv ? I have created a function but not sure if it will work in this situation with IN clause.
Please suggest
If you are looking for output below
Category Reg Count
277 12
274 12
47 1
276 13
278 13
275 13
SQL FIDDLE DEMO
Try this
SELECT Category,COUNT([RegistrantID]) AS [Reg Count] FROM
(
SELECT
Split.a.value('.', 'VARCHAR(100)') AS Category
,[RegistrantID]
FROM
(
SELECT
CONVERT(XML,'<C>'+REPLACE([CategoriesIDs],'|','</C><C>') + '</C>') AS Categories
, [RegistrantID]
FROM table1
) T CROSS APPLY Categories.nodes('/C') AS Split(a)
) T1
GROUP BY Category
You should normalise your data.
That said, try this.
;with c as (
select RegistrantID, CategoriesIDs, 0 as start, CHARINDEX('|', CategoriesIDs) as sep
from yourtable
union all
select RegistrantID,CategoriesIDs, sep, CHARINDEX('|', CategoriesIDs, sep+1) from c
where sep>0
)
select *, count(*) over (partition by CategoriesID)
from
(
select convert(int,SUBSTRING(CategoriesIDs,start+1,chars)) as [CategoriesID],
RegistrantID
from
(
select *,
Case sep when 0 then LEN(CategoriesIDs) else sep-start-1 end as chars
from c
) v
) c2
order by CategoriesID
If you have a "Categories" table, you can do this with the following query:
select c.CategoryId, count(*)
from t join
categories c
on ','+cast(c.CategoryId as varchar(255))+',' like '%,'+CategoriesId+',%'
group by c.CategoryId;
This will not be particularly efficient. But neither will breaking the string apart. You should really have an association table with one row per item (in your original table) and category.