Handling null or missing attributes in JSON using PostgreSQL - sql

I'm learning how to handle JSON in PostgreSQL.
I have a table with some columns. One column is a JSON field. The data in that column has at least these three variations:
Case 1: {"createDate": 1448067864151, "name": "world"}
Case 2: {"createDate": "", "name": "hello"}
Case 3: {"name": "sky"}
Later on, I want to select the createDate.
TO_TIMESTAMP((attributes->>'createDate')::bigint * 0.001)
That works fine for Case 1 when the data is present and it is convertible to a bigint. But what about when it isn't? How do I handle this?
I read this article. It explains that we can add check constraints to perform some rudimentary validation. Alternatively, I could do a schema validation before the data is inserts (on the client side). There are pros and cons with both ideas.
Using a Check Constraint
CONSTRAINT validate_createDate CHECK ((attributes->>'createDate')::bigint >= 1)
This forces a non-nullable field (Case 3 fails). But I want the attribute to be optional. Furthermore, if the attribute doesn't convert to a bigint because it is blank (Case 2), this errors out.
Using JSON schema validation on the client side before insert
This works, in part, because the schema validation makes sure that what data comes in conforms to the schema. In my case, I can control which clients access this table, so this is OK. But it doesn't matter for the SQL later on since my validator will let pass all three cases.

Basically, you need to check if createDate attribute is empty:
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT to_timestamp((attributes->>'createDate')::bigint * 0.001) FROM data
WHERE
(attributes->>'createDate') IS NOT NULL
AND
(attributes->>'createDate') != '';
Output:
to_timestamp
----------------------------
2015-11-20 17:04:24.151-08
(1 row)

Building on Dmitry's answer, you can also check the json type with the json_typeof function. Note the json operator: -> to get json instead of the ->> operator which always casts the value to string.
By doing the check in the SELECT with a CASE conditional instead of in the WHERE clause, we also keep the rows not having a createdDate. Depending on your usecase, this might be better.
WITH data(attributes) AS ( VALUES
('{"createDate": 1448067864151, "name": "world"}'::JSON),
('{"createDate": "", "name": "hello"}'::JSON),
('{"name": "sky"}'::JSON)
)
SELECT
CASE WHEN (json_typeof(attributes->'createDate') = 'number')
THEN to_timestamp((attributes->>'createDate')::bigint * 0.001)
END AS created_date
FROM data
;
Output:
created_date
----------------------------
"2015-11-21 02:04:24.151+01"
""
""
(3 rows)

Related

searching for a keyword in JSON having multiple fields

{
"TaskType": "kkkk",
"Status": "SUCCESS",
"jobID": "18056",
"DownloadFilePath": "https://abcd",
"accountId": "1234",
"customerId": "hhff"
}
This is sample data in message in notification table. I need to search for the keyword inside 3 or 4 fields out of 6 fields.
One way is to use multiple OR statements like:
SELECT message AS note
FROM notification
WHERE message::jsonb->'TaskType' LIKE '%1234%'
OR message::jsonb->'jobId' LIKE '%1234%'
OR message::jsonb->'Status' LIKE '%1234%';
The results are satisfactory, but is there a way to optimize the query in case I need to search in more fields.
you could just cast the message column to text and search inside the text:
SELECT message AS note
FROM notification
WHERE message::text like '%1234%'
You need to iterate through all key/value pairs:
select n.*
from notification n
where exists (select *
from jsonb_each_text(n.message) as m(key,value)
where key in ('TaskType', 'jobId', 'Status')
and value like '%1234%');
If you want to search in all values, just leave out the key in (..) part in the sub-select.
Searching in all values can be done using a JSON path operator in Postgres 12 or later:
select *
from notification
where message ## '$.* like_regex "1234"'
This assumes that messages is defined as jsonb - which it should be. If it's not, you need to cast it: message::jsonb

Fetching attribute from JSON string with JSON_VAL cause "<attribute> is invalid in the used context" error

A proprietary third-party application stores JSON strings in it's database like this one:
{"state":"complete","timestamp":1614776473000}
I need the timestamp and found out that
DB2 offers JSON functions. Since it's stored as string in the PROF_VALUE column, I guess that converting with SYSTOOLS.JSON2BSON is required, before I can use JSON_VAL to fetch the timestamp:
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "timestamp", "f")
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'
This causes an error that timestamp is invalid in the used context ( SQLCODE=-206, SQLSTATE=42703, DRIVER=4.26.14). The same error is thown when I remove the JSON2BSON call like this
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE, "timestamp", "f")
Also not working with the same error (different data-types):
SELECT SYSTOOLS.JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), "state", "s:1000")
SELECT SYSTOOLS.JSON_VAL(PROF_VALUE) "state", "s:1000")
I don't understand this error. My syntax is like the documented JSON_VAL ( json-value , search-string , result-type) and it is the same like in the examples, where they show how to fetch the name field of an object.
I also played around a bit with JSON_TABLE to use raw input data for testing (instead of the database data), but it seems not suiteable for that.
SELECT *
FROM TABLE(SYSTOOLS.JSON_TABLE( SYSTOOLS.JSON2BSON('{"state":"complete","timestamp":1614776473000}'), 'state','s:32')) DATA
This gave me a table with one row: Type = 2 and Value = complete.
I had two problems in my query: First it seems that double quotes " are for object references. I wasn't aware that there is any difference, because in most databases I used yet, both single ' and double quotes " are equal.
The second problem is, that JSON_VAL needs to be called without SYSTOOLS, but the reference is still needed on SYSTOOLS.JSON2BSON(PROF_VALUE).
With those changes, the following query worked:
SELECT JSON_VAL(SYSTOOLS.JSON2BSON(PROF_VALUE), 'timestamp', 'f')
FROM EMPINST.PROFILE_EXTENSIONS ext
WHERE PROF_PROPERTY_ID = 'touchpointState'

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs

I followed these instructions to get my AWS WAF data into an Athena table.
I would like to query the data to find the latest requests with an action of BLOCK. This query works:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;
My issue is cleanly identifying the "terminatingrule" - the reason the request was blocked. As an example, a result has
terminatingrule = AWS-AWSManagedRulesCommonRuleSet
And
rulegrouplist = [
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesKnownBadInputsRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesLinuxRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesCommonRuleSet",
"terminatingrule": {
"rulematchdetails": "null",
"action": "BLOCK",
"ruleid": "NoUserAgent_HEADER"
},
"excludedrules":"null"
}
]
The piece of data I would like separated into a column is rulegrouplist[terminatingrule].ruleid which has a value of NoUserAgent_HEADER
AWS provide useful information on querying nested Athena arrays, but I have been unable to get the result I want.
I have framed this as an AWS question but since Athena uses SQL queries, it's likely that anyone with good SQL skills could work this out.
It's not entirely clear to me exactly what you want, but I'm going to assume you are after the array element where terminatingrule is not "null" (I will also assume that if there are multiple you want the first).
The documentation you link to say that the type of the rulegrouplist column is array<string>. The reason why it is string and not a complex type is because there seems to be multiple different schemas for this column, one example being that the terminatingrule property is either the string "null", or a struct/object – something that can't be described using Athena's type system.
This is not a problem, however. When dealing with JSON there's a whole set of JSON functions that can be used. Here's one way to use json_extract combined with filter and element_at to remove array elements where the terminatingrule property is the string "null" and then pick the first of the remaining elements:
SELECT
element_at(
filter(
rulegrouplist,
rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)
),
1
) AS first_non_null_terminatingrule
FROM waf_logs
WHERE action = 'BLOCK'
ORDER BY date DESC
You say you want the "latest", which to me is ambiguous and could mean both first non-null and last non-null element. The query above will return the first non-null element, and if you want the last you can change the second argument to element_at to -1 (Athena's array indexing starts from 1, and -1 is counting from the end).
To return the individual ruleid element of the json:
SELECT from_unixtime(timestamp / 1000e0) AS date, action, httprequest.clientip AS ip, httprequest.uri AS request, httprequest.country as country, terminatingruleid, json_extract(element_at(filter(rulegrouplist,rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON) ),1), '$.terminatingrule.ruleid') AS ruleid
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
I had the same issue but the solution posted by Theo didn't work for me, even though the table was created according to the instructions linked to in the original post.
Here is what worked for me, which is basically the same as Theo's solution, but without the json conversion:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist,
element_at(filter(ruleGroupList, ruleGroup -> ruleGroup.terminatingRule IS NOT NULL),1).terminatingRule.ruleId AS ruleId
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;

Select rows from table with jsonb column based on arbitrary jsonb filter expression

Test data
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3}')
, ('{"a":11,"b":12, "c":13}')
, ('{"a":21,"b":22, "c":23}')
Problem statement: I want to receive an arbitrary JSONB parameter which acts as a filter on column t.data, such as
{ "b":{ "from":0, "to":20 }, "c":13 }
and use this to select matching rows from my test table t.
In this example, I want rows where b is between 0 and 20 and c = 13.
No error is required if the filter specifies a "column" (or "tag") which does not exist in t.data - it just fails to find a match.
I've used numeric values for simplicity but would like an approach which generalises to text as well.
What I have tried so far. I looked at the containment approach, which works for equality conditions, but am stumped on a generic way of handling range conditions:
select * from t
where t.data#> '{"c":13}'::jsonb;
Background: This problem arose when building a generic table-preview page on a website (for Admin users).
The page displays a filter based on various columns in whichever table is selected for preview.
The filter is then passed to a function in Postgres DB which applies this dynamic filter condition to the table.
It returns a jsonb array of the rows matching the filter specified by the user.
This jsonb array is then used to populate the Preview resultset.
The columns which make up the filter may change.
My Postgres version is 9.6 - thanks.
if you want to parse { "b":{ "from":0, "to":20 }, "c":13 } you need a parser. It is out of scope of json functions, but you can write "generic" query using AND and OR to filter by such json, eg:
https://www.db-fiddle.com/f/jAPBQggG3p7CxqbKLMbPKw/0
with filt(f) as (values('{ "b":{ "from":0, "to":20 }, "c":13 }'::json))
select *
from t
join filt on
(f->'b'->>'from')::int < (data->>'b')::int
and
(f->'b'->>'to')::int > (data->>'b')::int
and
(data->>'c')::int = (f->>'c')::int
;
Thanks for the comments/suggestions.
I will definitely look at GraphQL when I have more time - I'm working under a tight deadline at the moment.
It seems the consensus is that a fully generic solution is not achievable without a parser.
However, I got a workable first draft - it's far from ideal but we can work with it. Any comments/improvements are welcome ...
Test data (expanded to include dates & text fields)
DROP TABLE t;
CREATE TABLE t(_id serial PRIMARY KEY, data jsonb);
INSERT INTO t(data) VALUES
('{"a":1,"b":2, "c":3, "d":"2018-03-10", "e":"2018-03-10", "f":"Blah blah" }')
, ('{"a":11,"b":12, "c":13, "d":"2018-03-14", "e":"2018-03-14", "f":"Howzat!"}')
, ('{"a":21,"b":22, "c":23, "d":"2018-03-14", "e":"2018-03-14", "f":"Blah blah"}')
First draft of code to apply a jsonb filter dynamically, but with restrictions on what syntax is supported.
Also, it just fails silently if the syntax supplied does not match what it expects.
Timestamp handling a bit kludgey, too.
-- Handle timestamp & text types as well as int
-- See is_timestamp(text) function at bottom
with cte as (
select t.data, f.filt, fk.key
from t
, ( values ('{ "a":11, "b":{ "from":0, "to":20 }, "c":13, "d":"2018-03-14", "e":{ "from":"2018-03-11", "to": "2018-03-14" }, "f":"Howzat!" }'::jsonb ) ) as f(filt) -- equiv to cross join
, lateral (select * from jsonb_each(f.filt)) as fk
)
select data, filt --, key, jsonb_typeof(filt->key), jsonb_typeof(filt->key->'from'), is_timestamp((filt->key)::text), is_timestamp((filt->key->'from')::text)
from cte
where
case when (filt->key->>'from') is null then
case jsonb_typeof(filt->key)
when 'number' then (data->>key)::numeric = (filt->>key)::numeric
when 'string' then
case is_timestamp( (filt->key)::text )
when true then (data->>key)::timestamp = (filt->>key)::timestamp
else (data->>key)::text = (filt->>key)::text
end
when 'boolean' then (data->>key)::boolean = (filt->>key)::boolean
else false
end
else
case jsonb_typeof(filt->key->'from')
when 'number' then (data->>key)::numeric between (filt->key->>'from')::numeric and (filt->key->>'to')::numeric
when 'string' then
case is_timestamp( (filt->key->'from')::text )
when true then (data->>key)::timestamp between (filt->key->>'from')::timestamp and (filt->key->>'to')::timestamp
else (data->>key)::text between (filt->key->>'from')::text and (filt->key->>'to')::text
end
when 'boolean' then false
else false
end
end
group by data, filt
having count(*) = ( select count(distinct key) from cte ) -- must match on all filter elements
;
create or replace function is_timestamp(s text) returns boolean as $$
begin
perform s::timestamp;
return true;
exception when others then
return false;
end;
$$ strict language plpgsql immutable;

Using Postgres JSON Functions on table columns

I have searched extensively (in Postgres docs and on Google and SO) to find examples of JSON functions being used on actual JSON columns in a table.
Here's my problem: I am trying to extract key values from an array of JSON objects in a column, using jsonb_to_recordset(), but get syntax errors. When I pass the object literally to the function, it works fine:
Passing JSON literally:
select *
from jsonb_to_recordset('[
{ "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]') as f(name text);`
results in:
400MB-PDF.pdf
1000MB-PDF.pdf
It extracts the value of the key "name".
Here's the JSON in the column, being extracted using:
select journal.data::jsonb#>>'{context,data,files}'
from journal
where id = 'ap32bbofopvo7pjgo07g';
resulting in:
[ { "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]
But when I try to pass jsonb#>>'{context,data,files}' to jsonb_to_recordset() like this:
select id,
journal.data::jsonb#>>::jsonb_to_recordset('{context,data,files}') as f(name text)
from journal
where id = 'ap32bbofopvo7pjgo07g';
I get a syntax error. I have tried different ways but each time it complains about a syntax error:
Version:
PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
The expressions after select must evaluate to a single value. Since jsonb_to_recordset returns a set of rows and columns, you can't use it there.
The solution is a cross join lateral, which allows you to expand one row into multiple rows using a function. That gives you single rows that select can act on. For example:
select *
from journal j
cross join lateral
jsonb_to_recordset(j.data#>'{context, data, files}') as d(id int, name text)
where j.id = 'ap32bbofopvo7pjgo07g'
Note that the #>> operator returns type text, and the #> operator returns type jsonb. As jsonb_to_recordset expects jsonb as its first parameter I'm using #>.
See it working at rextester.com
jsonb_to_recordset is a set-valued function and can only be invoked in specific places. The FROM clause is one such place, which is why your first example works, but the SELECT clause is not.
In order to turn your JSON array into a "table" that you can query, you need to use a lateral join. The effect is rather like a foreach loop on the source recordset, and that's where you apply the jsonb_to_recordset function. Here's a sample dataset:
create table jstuff (id int, val jsonb);
insert into jstuff
values
(1, '[{"outer": {"inner": "a"}}, {"outer": {"inner": "b"}}]'),
(2, '[{"outer": {"inner": "c"}}]');
A simple lateral join query:
select id, r.*
from jstuff
join lateral jsonb_to_recordset(val) as r("outer" jsonb) on true;
id | outer
----+----------------
1 | {"inner": "a"}
1 | {"inner": "b"}
2 | {"inner": "c"}
(3 rows)
That's the hard part. Note that you have to define what your new recordset looks like in the AS clause -- since each element in our val array is a JSON object with a single field named "outer", that's what we give it. If your array elements contain multiple fields you're interested in, you declare those in a similar manner. Be aware also that your JSON schema needs to be consistent: if an array element doesn't contain a key named "outer", the resulting value will be null.
From here, you just need to pull the specific value you need out of each JSON object using the traversal operator as you were. If I wanted only the "inner" value from the sample dataset, I would specify select id, r.outer->>'inner'. Since it's already JSONB, it doesn't require casting.