Querying JSONB array value for sub values? - sql

I have a JSONB Object:
{"name": "Foo", "interfaces": [{"name": "Bar", "status": "up"}]}
It is stored in a jsonb column of a table
create table device (
device_name character varying not null,
device_data jsonb not null
);
So i was trying to get a count by name of devices which have interfaces that are not 'up'. Group By is used for developing counts by naame, but i am having issues querying the json list for values.
MY first Attempt was:
select device_name, count(*) from device where device_json -> 'interfaces' -> 'status' != 'up' group by device_name;
Some surrounding data that made me think something was going to be difficult was:
select count(device_data -> 'interfaces') from device;
which I thought that was going to get me a count of all interfaces from all devices, but that is not correct. It seems like it is just returning the count from the first item.
Im thinking I might need to do a sub query or join of inner content.
Ive been thinking it over and when looking up psql it seems like i havent found a way to query a list type in a jsonb object. Maybe im mistaken. I didnt want to build a business layer on top of this as I figured that the DBMS would be able to handle this heavy lifting.
I saw there is a function jsonb_array_elements_text(device_data -> 'interfaces')::jsonb -> 'status' which would return the information, but I cant do any sort of count in it, as count(jsonb_array_elements_text(device_data -> 'interfaces')::jsonb -> 'status') will return ERROR: set-valued function called in context that cannot accept a set

You need a lateral join to unnest the array and count the elements that are down (or not up)
select d.device_name, t.num_down
from device d
cross join lateral (
select count(*) num_down
from jsonb_array_elements(d.device_data -> 'interfaces') as x(i)
where i ->> 'status' = 'down'
) t
To count all interfaces and the down interfaces, you can use filtered aggregation:
select d.device_name, t.*
from device d
cross join lateral (
select count(*) as all_interfaces,
count(*) filter (where i ->> 'status' = 'down') as down_interfaces
from jsonb_array_elements(d.device_data -> 'interfaces') as x(i)
) t
Online example

jsonb_array_elements is the right idea, I think you are looking for an EXISTS condition to match your description "devices which have interfaces that are not 'up'":
SELECT device_name, count(*)
FROM device
WHERE EXISTS (SELECT *
FROM jsonb_array_elements(device_json -> 'interfaces') interface
WHERE interface ->> 'status' != 'up')
GROUP BY device_name;
I would like to know how many interfaces are down
That's a different problem, for this you could use a subquery in the SELECT clause, and probably wouldn't need to do any grouping:
SELECT
device_name,
( SELECT count(*)
FROM jsonb_array_elements(device_json -> 'interfaces') interface
WHERE interface ->> 'status' != 'up'
) AS down_count
FROM device

Related

How to cast only the part of a table using a single SQL command in PostgreSQL

In a PostgreSQL table I have several information stored as text. It depends on the context described by a type column what type of information is stored. The application is prepared to get by only one command the Id's of the row.
I got into trouble when i tried to compare the information (bigint stored as a string) with an external value (e.g. '9' > '11'). When I tried to cast the column, the datatbase return an error (not all values in the column are castable, e.g. datetime or normal text). Also when I try to cast only the result of a query command, I get a cast error.
I get the table with the castable rows by this command:
SELECT information.id as id, item.information::bigint as item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
The resulting rows are showing up only text that is castable. When I throw it into another command it results in an error.
SELECT x.id FROM (
SELECT information.id as id, item.information::bigint as item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) AS x
WHERE x.item > '0'::bigint
Accroding to the error, the database tried to cast all rows in the table.
Technically, this happens because the optimizer thinks WHERE x.item > '0'::bigint is a much more efficient filter than information.type = 'task'. So in the table scan, the WHERE x.item > '0'::bigint condition is chosen to be the predicate. This thinking is not wrong but will make you fall into this seemingly illogical trouble.
The suggestion by Gordon to use CASE WHEN inf.type = 'task' THEN i.information::bigint END can avoid this, but however it may sometimes ruin your idea to put that as a sub-query and require the same condition to be written twice.
A funny trick I tried is to use OUTER APPLY:
SELECT x.* FROM (SELECT 1 AS dummy) dummy
OUTER APPLY (
SELECT information.id as id, item.information::bigint AS item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) x
WHERE x.item > '0'::bigint
Sorry that I only verified the SQL Server version of this. I understand PostgreSQL has no OUTER APPLY, but the equivalent should be:
SELECT x.* FROM (SELECT 1 AS dummy) dummy
LEFT JOIN LATERAL (
SELECT information.id as id, item.information::bigint AS item
FROM information
INNER JOIN item
ON information.id = item.informationid
WHERE information.type = 'task'
) x ON true
WHERE x.item > '0'::bigint
(reference is this question)
Finally, a more tidy but less flexible method is add the optimizer hint to turn off it to force the optimizer to run the query as how it is written.
This is unfortunate. Try using a case expression:
SELECT inf.id as id,
(CASE WHEN inf.type = 'task' THEN i.information::bigint END) as item
FROM information inf JOIN
item i
ON inf.id = i.informationid
WHERE inf.type = 'task';
There is no guarantee that the WHERE filter is applied before the SELECT. However, CASE does guarantee the order of evaluation, so it is safe.

PostgreSQL ALL(A) <# ANY(B)

The objective is to solve the following use case:
table contains many numrange[] fields. Let A be one of those fields
we need to request rows with a parameter of type numrange[] = B according to this rule : ALL(A) <# ANY(B)
A sample of a request on table dt.t with B = {[1,3],[9,10]} would be :
select * from dt.t where ALL(A) <# ANY(ARRAY[numrange(1,3),numrange(9,10)])
So it seems feasible. But the ALL operator can only be used on the right side of the condition...
And turning it around for about a day I don't find a clue on how to solve this use case (not using functions if possible).
The real use case will be using filtering on many fields so the solution needs to be working for multiple fields in the same where clause
select *
from dt.t
where ALL(A1) <# ANY(ARRAY[numrange(1,3),numrange(9,10)])
and ALL(A2) <# ANY(ARRAY[numrange(10,13),numrange(20,20)])
Found this solution :
select *
from dt.t t1
where (
select count(1)
from (
select unnest(A) a
from dt.t t2
where t2.id=t1.id
) t
where t.a <# ANY(ARRAY['[1,3)'::numrange])) = array_length(A,1);
The idea is :
select unnest(A) a from dt.t t2 where t2.id=t1.id => gives each element of ARRAY field A
t.a <# ANY(ARRAY['[1,3)'::numrange]) => tests if this element is included in the parameter => <# ANY(B) part
(select count(1) [...]) = array_length(A,1) => checks that all elements of A is valid => the ALL(A) part of the problem
Tried it, works, and seems legit. The only thing really important is that B is the minimal union of itself (there shall be no numrange[] equivalent to B with less numrange in it).
Apart from that, seems to work. Thank you all for your help and time.

Squeryl Select Duplicates

I would like to find overlapping data with a Squeryl query. I can do so by using the method found here with normal SQL, but can't figure out how to do so using Squeryl.
Basically I need to convert this line that finds Non-Distinct rows to Squeryl
SELECT *
FROM myTable L1
JOIN(
SELECT myField1,myField2
FROM myTable
GROUP BY myField1,myField2
HAVING COUNT(*) >= 2
) L2
ON L1.myField1 = L2.myField1 AND L1.myField2 = L2.myField2;
EDIT : More importantly I need to be able to do this dynamically. I have a bit of a complex dynamic query that I call that may rely on different options being passed. If an Option is defined then it should call this, otherwise inhibit if null. But groupBy does not support an inhibitBy method. To see a full explanation of my current method look here
def getAllJoined(
hasFallback:Option[String] = None
showDuplicates:Option[String] = None):List[(Type1,Type2)] = transaction{
join(mainTable,
table2,
table3,
table3,
table4.leftOuter,
table4.leftOuter,
table5,
table6)((main, attr1, attr2, attr3, attr4, attr5, attr6, attr7) =>
where(
main.fallBack.isNotNull.inhibitWhen(!hasFallback.isDefined)
)
//What to do here to only find duplicates when showDuplicates.isDefined? AKA Non-Distinct
select(main,attr1,attr2,attr3,attr4,attr5,attr6,attr7)
on(
(main.attr1Col === attr1.id) ,
(main.attr2Col === attr2.id) ,
(main.attr3Col === attr3.id) ,
(main.attr4Col === attr4.map(_.id)) ,
(main.attr5Col === attr5.map(_.id)) ,
(main.attr6Col === attr6.id) ,
(main.attr7Col === attr7.id)
)
).toList
Check out this discussion on Google Groups. Looks like they had fixed a bug related to inhibited having in 2011, but not sure why it still persists in your case. They also have an example query using the having clause in the same thread.

Select first or random row in group by

I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades):
SELECT
media_files.album,
media_files.artist,
ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
playlist_media_files.playlist_id = 1
GROUP BY
media_files.album,
media_files.artist
ORDER BY
media_files.album ASC
and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo.
The problem is that I have another column in media files, which is artwork.
artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set.
So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection).
Is that possible to do with only SQL/Window Functions (first_value over..)?
Yes, it's possible. First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions:
SELECT
mf.album,
mf.artist,
ARRAY_AGG (mf.id) AS media_file_ids
FROM
"media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
pmf.playlist_id = 1
GROUP BY
mf.album,
mf.artist
ORDER BY
mf.album ASC
Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON, though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it.
What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. There is no such aggregate and it isn't really worth implementing it for the job. You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions.
To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list:
(SELECT mf2.artwork
FROM media_files mf2
WHERE mf2.artist = mf.artist
AND mf2.album = mf.album
LIMIT 1) AS picked_artwork
You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above.
Alternately, here's a quick and dirty way to implement selection of a random row in-line:
(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)]
Since there's no sample data I can't test these modifications. Let me know if there's an issue.
"First" pick
Wouldn't it be simpler / cheaper to just use min():
SELECT m.album
,m.artist
,array_agg(m.id) AS media_file_ids
,min(m.artwork) AS artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
WHERE p.playlist_id = 1
GROUP BY m.album, m.artist
ORDER BY m.album, m.artist;
Abitrary / random pick
If you are looking for a random selection, #Craig already provided a solution with truly random picks.
You could also use a CTE to avoid additional scans on the (possibly big) base table and then run two separate (cheap) subqueries on the small result set.
For arbitrary selection - not truly random, the result will depend on the physical order of rows in the table and implementation-specifics:
WITH x AS (
SELECT m.album, m.artist, m.id, m.artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
)
SELECT a.album, a.artist, a.media_file_ids, b.artwork
FROM (
SELECT album, artist, array_agg(id) AS media_file_ids
FROM x
) a
JOIN (
SELECT DISTINCT ON (1,2) album, artist, artwork
FROM x
) b USING (album, artist);
For truly random results, you can add an ORDER BY .. random() like this to subquery b:
JOIN (
SELECT DISTINCT ON (1, 2) album, artist, artwork
FROM x
ORDER BY 1, 2, random()
) b USING (album, artist);

How to use min() in where/having clause (to avoid subquery) in Hive/SQL

I have a large table of events. Per user I want to count the occurence of type A events before the earliest type B event.
I am searching for an elegant query. Hive is used so I can't do subqueries
Timestamp Type User
... A X
... A X
... B X
... A X
... A X
... A Y
... A Y
... A Y
... B Y
... A Y
Wanted Result:
User Count_Type_A
X 2
Y 3
I could not get the "cut-off" timestamp by doing:
Select User, min(Timestamp)
Where Type=B
Group BY User;
But then how can I use that information inside the next query where I want to do something like:
SELECT User, count(Timestamp)
WHERE Type=A AND Timestamp<min(User.Timestamp_Type_B)
GROUP BY User;
My only idea so far are to determine the cut-off timestamps first and then do a join with all type A events and then select from the resulting table, but that feels wrong and would look ugly.
I'm also considering the possibility that this is the wrong type of problem/analysis for Hive and that I should consider hand-written map-reduce or pig instead.
Please help me by pointing in the right direction.
First Update:
In response to Cilvic's first comment to this answer, I've adjusted my query to the following based on workarounds suggested in the comments found at https://issues.apache.org/jira/browse/HIVE-556:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
CROSS JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
WHERE main.[Type] = 'A'
AND (sub.[User] = main.[User])
AND (main.[Timestamp] < sub.[First_B_TS])
GROUP BY main.[User]
Original:
Give this a shot:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
I did my best to follow hive syntax. Let me know if you have any questions. I would like to know why you wish/need to avoid a subquery.
In general, I +1 coge.soft's solution. Here it is again for your reference:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
However, a couple things to note:
What happens when there are no B events? Assuming you would want to count all the A events per user in that case an inner join as specified in the solution wouldn't work since there would be no entry for that user in the sub table. You would need to change to a left outer join for that.
The solution also does 2 passes over the data - one to populate the sub table, other to join the sub table with the main table. Depending on your notion of performance and efficiency, there is an alternative where you could do this by a single pass of data. You can distribute the data by user using Hive's distribute by functionality and write a custom reducer that would do your count calculation in your favorite language using Hive's transform functionality.