How to convert fields to JSON in Postgresql - sql

I have a table with the following schema (postgresql 14):
message sentiment classification
any text positive mobile, communication
message are only string, phrases.
sentiment is a string, only one word
classification are string but can have 1 to many word comma separated
I would like to create a json field with these columns, like this:
{"msg":"any text", "sentiment":"positive","classification":["mobile,"communication"]}
Also, if possible, is there a way to consider the classification this way:
{"msg":"any text", "sentiment":"positive","classification 1":"mobile","classification 2" communication"}

The first part of question is easy - Postgres provides functions for splitting string and converting to json:
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
select row_to_json(x.*)
from (
select t.message
, t.sentiment
, array_to_json(string_to_array(t.classification, ', ')) as classification
from t
) x
The second part is harder - your want json to have variable number of attributes, mixed of grouped and nongrouped data. I suggest to unwind all attributes and then assemble them back (note the numbered CTE is actually not needed if your real table has id - I just needed some column to group by):
with t(message, sentiment, classification) as (values
('any text','positive','mobile, communication')
)
, numbered (id, message, sentiment, classification) as (
select row_number() over (order by null)
, t.*
from t
)
, extracted (id,message,sentiment,classification,index) as (
select n.id
, n.message
, n.sentiment
, l.c
, l.i
from numbered n
join lateral unnest(string_to_array(n.classification, ', ')) with ordinality l(c,i) on true
), unioned (id, attribute, value) as (
select id, concat('classification ', index::text), classification
from extracted
union all
select id, 'message', message
from numbered
union all
select id, 'sentiment', sentiment
from numbered
)
select json_object_agg(attribute, value)
from unioned
group by id;
DB fiddle

Use jsonb_build_object and concatenate the columns you want
SELECT
jsonb_build_object(
'msg',message,
'sentiment',sentiment,
'classification',
string_to_array(classification,','))
FROM mytable;
Demo: db<>fiddle
The second output is definitely not trivial. The SQL code would be much larger and harder to maintain - not to mention that parsing such file also requires a little more effort.

You can use a cte to handle the flattening of the classification attributes and then perform the necessary grouping in the main queries for each problem component:
with cte(r, m, s, k) as (
select row_number() over (order by t.message), t.message, t.sentiment, v.* from tbl t
cross join json_array_elements(array_to_json(string_to_array(t.classification, ', '))) v
)
-- first part --
select json_build_object('msg', t1.message, 'sentiment', t1.sentiment, 'classification', string_to_array(t1.classification, ', ')) from tbl t1
-- second part --
select jsonb_build_object('msg', t1.m, 'sentiment', t1.s)||('{'||t1.g||'}')::jsonb
from (select c.m, c.s, array_to_string(array_agg('"classification '||c.r||'":'||c.k), ', ') g
from cte c group by c.m, c.s) t1

Related

How can I split a column with nested delimiters into a list of STRUCTs in Bigquery SQL?

I've got a column that has entries (variable numbers of 2-tuples) like this:
D000001:Term1;D00007:Term19;D00008:Term781 (mesh_terms column below) and I'd like to split those so that I end up with an ARRAY<STRUCT<code STRING, term STRING>> for each row.
The query below works as needed, but I'm curious if anyone has suggestions on improvements in terms of readability, performance (on Bigquery, so not too big a problem), or best practices.
with t1 as (
SELECT
pmid,
split(mesh_terms, ';') as l1
FROM `omicidx_etl.pm1`
),
t2 as (
select
t1.pmid,
x
from t1,
unnest(t1.l1) as x
),
t3 as (
select
pmid,
split(x, ':') as y
from t2
)
select
pmid,
array_agg(STRUCT(t3.y[offset(0)] as code, t3.y[offset(1)] as term)) as mesh_terms
from
t3
group by pmid
Use below
select pmid,
array(
select as struct split(kv, ':')[offset(0)] code, split(kv, ':')[safe_offset(1)] term
from unnest(regexp_extract_all(mesh_terms, r'[^:;]+:[^:;]+')) kv
) mesh_terms
from your_table
if applied to sample data like in your question - output is

Remove Substring according to specific pattern

I need to remove in a SQL Server database a substring according to a pattern:
Before: Winter_QZ6P91712017_115BPM
After: Winter_115BPM
Or
Before: cpx_Note In My Calendar_QZ6P91707044
After: cpx_Note In My Calendar
Basically delete the substring that has pattern _ + 12 chars.
I've tried PatIndex('_\S{12}', myCol) to get the index of the substring but it doesn't match anything.
Assuming you mean underscore followed by 12 characters that are not underscores you can use this pattern:
SELECT *,
CASE WHEN PATINDEX('%[_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_]%', str) > 0
THEN STUFF(str, PATINDEX('%[_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_][^_]%', str), 13, '')
ELSE str
END
FROM (VALUES
('Winter_QZ6P91712017_115BPM'),
('Winter_115BPM_QZ6P91712017')
) AS tests(str)
late to the party, but you could also use latest STRING_SPLIT function to explode the string by underscores and count length of each segment between underscores. If the length is >=12, these sections must be replaced from original string via replace function recursively.
drop table if exists Tbl;
drop table if exists #temptable;
create table Tbl (input nvarchar(max));
insert into Tbl VALUES
('Winter_QZ6P91712017_115BPM'),
('cpx_Note In My Calendar_QZ6P91707044'),
('stuff_asdasd_QZ6P91712017'),
('stuff_asdasd_QZ6P91712017_stuff_asdasd_QZ6P91712017'),
('stuff_asdasd_QZ6P917120117_stuff_asdasd_QZ6P91712017');
select
input, value as replacethisstring,
rn = row_number() over (partition by input order by (select 1))
into #temptable
from
(
select
input,value as hyphensplit
from Tbl
cross apply string_split(input,'_')
)T cross apply string_split(hyphensplit,' ')
where len(value)>=12
; with cte as (
select input, inputtrans= replace(input,replacethisstring,''), level=1 from #temptable where rn=1
union all
select T.input,inputtrans=replace(cte.inputtrans,T.replacethisstring,''),level=level+1
from cte inner join #temptable T on T.input=cte.input and rn=level+1
--where level=rn
)
select input, inputtrans
from (
select *, rn=row_number() over (partition by input order by level desc) from cte
) T where rn=1
sample output
SQL Server doesn't support Regex. Considering, however, you just want to remove the first '_' and the 12 characters afterwards, you could use CHARINDEX to find the location of said underscore, and then STUFF to remove the 13 characters:
SELECT V.YourString,
STUFF(V.YourString, CHARINDEX('_',V.YourString),13,'') AS NewString
FROM (VALUES('Winter_QZ6P91712017_115BPM'))V(YourString);

BigQuery - concatenate ignoring NULL

I'm very new to SQL. I understand in MySQL there's the CONCAT_WS function, but BigQuery doesn't recognise this.
I have a bunch of twenty fields I need to CONCAT into one comma-separated string, but some are NULL, and if one is NULL then the whole result will be NULL. Here's what I have so far:
CONCAT(m.track1, ", ", m.track2))) As Tracks,
I tried this but it returns NULL too:
CONCAT(m.track1, IFNULL(m.track2,CONCAT(", ", m.track2))) As Tracks,
Super grateful for any advice, thank you in advance.
Unfortunately, BigQuery doesn't support concat_ws(). So, one method is string_agg():
select t.*,
(select string_agg(track, ',')
from (select t.track1 as track union all select t.track2) x
) x
from t;
Actually a simpler method uses arrays:
select t.*,
array_to_string([track1, track2], ',')
Arrays with NULL values are not supported in result sets, but they can be used for intermediate results.
I have a bunch of twenty fields I need to CONCAT into one comma-separated string
Assuming that these are the only fields in the table - you can use below approach - generic enough to handle any number of columns and their names w/o explicit enumeration
select
(select string_agg(col, ', ' order by offset)
from unnest(split(trim(format('%t', (select as struct t.*)), '()'), ', ')) col with offset
where not upper(col) = 'NULL'
) as Tracks
from `project.dataset.table` t
Below is oversimplified dummy example to try, test the approach
#standardSQL
with `project.dataset.table` as (
select 1 track1, 2 track2, 3 track3, 4 track4 union all
select 5, null, 7, 8
)
select
(select string_agg(col, ', ' order by offset)
from unnest(split(trim(format('%t', (select as struct t.*)), '()'), ', ')) col with offset
where not upper(col) = 'NULL'
) as Tracks
from `project.dataset.table` t
with output

How to get yes/no statistics from SQL of how often strings occur each

Is there a way to query a table from BigQuery project HTTPArchive by checking how often certain strings occur by a certain file type?
I was able to write a query for a single check but how to perform this query on multiple strings at once without needing to send the same query every time just with a different string check and process the ~800GB of table data every time?
Getting the results as array might work somehow? I want to publish in-depth monthly statistics to the public for free so the option to send those queries separately and get billed for querying of roughly $2000/month is no option for me as a student.
SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (
SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies
WHERE url LIKE "%.js"
)
GROUP BY matched
Please note that this is just one example of many (~50) and the pre-generated stats are not what I am looking for as it doesn't contain the needed information.
Below is for BigQuery Standard SQL
#standardSQL
WITH strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
-- ORDER BY ext, str
You can test / play with above using [totally] dummy data as below
#standardSQL
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL
SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL
SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL
SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL
SELECT 'rfv.php', 'd=1' UNION ALL
SELECT 'tgb.txt', '?abc=xyz' UNION ALL
SELECT 'yhn.php', 'like v' UNION ALL
SELECT 'ujm.go', 'lkjsad' UNION ALL
SELECT 'ujm.go', 'yhj' UNION ALL
SELECT 'ujm.go', 'dfgh' UNION ALL
SELECT 'ikl.js', 'werwer'
), strings AS (
SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT
ext, str, COUNT(1) total,
COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches,
ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s
GROUP BY ext, str
ORDER BY ext, str
One method is to bring in a table with the different strings. This is the idea:
SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched
FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN
(SELECT 'document.write' as str UNION ALL
SELECT 'xxx' as str
) s
WHERE url LIKE "%.js"
)
GROUP BY str, matched;
You would just add more strings to s.

SQL Customized search with special characters

I am creating a key-wording module where I want to search data using the comma separated words.And the search is categorized into comma , and minus -.
I know a relational database engine is designed from the principle that a cell holds a single value and obeying to this rule can help for performance.But in this case table is already running and have millions of data and can't change the table structure.
Take a look on the example what I exactly want to do is
I have a main table name tbl_main in SQL
AS_ID KWD
1 Man,Businessman,Business,Office,confidence,arms crossed
2 Man,Businessman,Business,Office,laptop,corridor,waiting
3 man,business,mobile phone,mobile,phone
4 Welcome,girl,Greeting,beautiful,bride,celebration,wedding,woman,happiness
5 beautiful,bride,wedding,woman,girl,happiness,mobile phone,talking
6 woman,girl,Digital Tablet,working,sitting,online
7 woman,girl,Digital Tablet,working,smiling,happiness,hand on chin
If search text is = Man,Businessman then result AS_ID is =1,2
If search text is = Man,-Businessman then result AS_ID is =3
If search text is = woman,girl,-Working then result AS_ID is =4,5
If search text is = woman,girl then result AS_ID is =4,5,6,7
What is the best why to do this, Help is much appreciated.Thanks in advance
I think you can easily solve this by creating a FULL TEXT INDEX on your KWD column. Then you can use the CONTAINS query to search for phrases. The FULL TEXT index takes care of the punctuation and ignores the commas automatically.
-- If search text is = Man,Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND "Businessman"')
-- If search text is = Man,-Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND NOT "Businessman"')
-- If search text is = woman,girl,-Working the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "girl" AND NOT "working"')
To search the multiple words (like the mobile phone in your case) use the quoted phrases:
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "mobile phone"')
As commented below the quoted phrases are important in all searches to avoid bad searches in the case of e.g. when a search term is "tablet working" and the KWD value is woman,girl,Digital Tablet,working,sitting,online
There is a special case for a single - search term. The NOT cannot be used as the first term in the CONTAINS. Therefore, the query like this should be used:
-- If search text is = -Working the query will be
SELECT AS_ID FROM tbl_main
WHERE NOT CONTAINS(KWD, '"working"')
Here is my attempt using Jeff Moden's DelimitedSplit8k to split the comma-separated values.
First, here is the splitter function (check the article for updates of the script):
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), #pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,E2(N) AS (SELECT 1 FROM E1 a, E1 b)
,E4(N) AS (SELECT 1 FROM E2 a, E2 b)
,cteTally(N) AS(
SELECT TOP (ISNULL(DATALENGTH(#pString), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
,cteStart(N1) AS(
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1),0) - s.N1, 8000)
FROM cteStart s
)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Here is the complete solution:
-- search parameter
DECLARE #search_text VARCHAR(8000) = 'woman,girl,-working'
-- split comma-separated search parameters
-- items starting in '-' will have a value of 1 for exclude
DECLARE #search_values TABLE(ItemNumber INT, Item VARCHAR(8000), Exclude BIT)
INSERT INTO #search_values
SELECT
ItemNumber,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN LTRIM(RTRIM(STUFF(Item, 1, 1 ,''))) ELSE LTRIM(RTRIM(Item)) END,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN 1 ELSE 0 END
FROM dbo.DelimitedSplit8K(#search_text, ',') s
;WITH CteSplitted AS( -- split each KWD to separate rows
SELECT *
FROM tbl_main t
CROSS APPLY(
SELECT
ItemNumber, Item = LTRIM(RTRIM(Item))
FROM dbo.DelimitedSplit8K(t.KWD, ',')
)x
)
SELECT
cs.AS_ID
FROM CteSplitted cs
INNER JOIN #search_values sv
ON sv.Item = cs.Item
GROUP BY cs.AS_ID
HAVING
-- all parameters should be included (Relational Division with no Remainder)
COUNT(DISTINCT cs.Item) = (SELECT COUNT(DISTINCT Item) FROM #search_values WHERE Exclude = 0)
-- no exclude parameters
AND SUM(CASE WHEN sv.Exclude = 1 THEN 1 ELSE 0 END) = 0
SQL Fiddle
This one uses a solution from the Relational Division with no Remainder problem discussed in this article by Dwain Camps.
From what you've described, you want the keywords that are included in the search text to be a match in the KWD column, and those that are prefixed with a - to be excluded.
Despite the data existing in this format, it still makes most sense to normalize the data, and then query based on the existence or non existence of the keywords.
To do this, in very rough terms:-
Create two additional tables - Keyword and tbl_Main_Keyword. Keyword contains a distinct list of each of the possible keywords and tbl_Main_Keyword contains a link between each record in tbl_Main to each Keyword record where there's a match. Ensure to create an index on the text field for the keyword (e.g. the Keyword.KeywordText column, or whatever you call it), as well as the KeywordID field in the tbl_Main_Keyword table. Create Foreign Keys between tables.
Write some DML (or use a separate program, such as a C# program) to iterate through each record, parsing the text, and inserting each distinct keyword encountered into the Keyword table. Create a relationship to the row for each keyword in the tbl_main record.
Now, for searching, parse out the search text into keywords, and compose a query against the tbl_Main_Keyword table containing both a WHERE KeywordID IN and WHERE KeywordID NOT IN clause, depending on whether there is a match.
Take note to consider whether the case of each keyword is important to your business case, and consider the collation (case sensitive or insensitive) accordingly.
I would prefer cha's solution, but here's another solution:
declare #QueryParts table (q varchar(1000))
insert into #QueryParts values
('woman'),
('girl'),
('-Working')
select AS_ID
from tbl_main
inner join #QueryParts on
(q not like '-%' and ',' + KWD + ',' like '%,' + q + ',%') or
(q like '-%' and ',' + KWD + ',' not like '%,' + substring(q, 2, 1000) + ',%')
group by AS_ID
having COUNT(*) = (select COUNT(*) from #QueryParts)
With such a design, you would have two tables. One that defines the IDs and a subtable that holds the set of keywords per search string.
Likewise, you would transform the search strings into two tables, one for strings that should match and one for negated strings. Assuming that you put this in a stored procedure, these tables would be table-value parameters.
Once you have this set up, the query is simple to write:
SELECT M.AS_ID
FROM tbl_main M
WHERE (SELECT COUNT(*)
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #searchwords)) =
(SELECT COUNT(*) FROM #searchwords)
AND NOT EXISTS (SELECT *
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #minuswords))