Get union and intersection of jsonb array in Postgresql - sql

I have a DB of people with jsonb column interests. In my application user can search for people by providing their hobbies which is set of some predefined values. I want to offer him a best match and in order to do so I would like to count match as intersection/union of interests. This way the top results won't be people who have plenty of hobbies in my DB.
Example:
DB records:
name interests::jsonb
Mary ["swimming","reading","jogging"]
John ["climbing","reading"]
Ann ["swimming","watching TV","programming"]
Carl ["knitting"]
user input in app:
["reading", "swimming", "knitting", "cars"]
my script should output this:
Mary 0.4
John 0.2
Ann 0.16667
Carl 0.25
Now I'm using
SELECT name
FROM people
WHERE interests #>
ANY (ARRAY ['"reading"', '"swimming"', '"knitting"', '"cars"']::jsonb[])
but this gives me even records with many interests and no way to order it.
Is there any way I can achieve it in a reasonable time - let's say up to 5 seconds in DB with around 400K records?
EDIT:
I added another example to clarify my calculations. My calculation needs to filter people with many hobbies. Therefore match should be calculated as Intersection(input, db_record)/Union(input, db_record).
Example:
input = ["reading"]
DB records:
name interests::jsonb
Mary ["swimming","reading","jogging"]
John ["climbing","reading"]
Ann ["swimming","watching TV","programming"]
Carl ["reading"]
Match for Mary would be calculated as (LENGTH(["reading"]))/(LENGTH(["swimming","reading","jogging"])) which is 0.3333
and for Carl it would be (LENGTH(["reading"]))/LENGTH([("reading")]) which is 1
UPDATE: I managed to do it with
SELECT result.id, result.name, result.overlap_count/(jsonb_array_length(persons.interests) + 4 - result.overlap_count)::decimal as score
FROM (SELECT t1.name as name, t1.id, COUNT(t1.name) as overlap_count
FROM (SELECT name, id, jsonb_array_elements(interests)
FROM persons) as t1
JOIN (SELECT unnest(ARRAY ['"reading"', '"swimming"', '"knitting"', '"cars"'])::jsonb as elements) as t2 ON t1.jsonb_array_elements = t2.elements
GROUP BY t1.name, t1.id) as result
JOIN persons ON result.id = persons.id ORDER BY score desc
Here's my fiddle https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b4b1760854b2d77a1c7e6011d074a1a3
However it's not fast enough and I would appreciate any improvements.

One option is to unnest the parameter and use the ? operator to check each and every element the jsonb array:
select
t.name,
x.match_ratio
from mytable t
cross join lateral (
select avg( (t.interests ? a.val)::int ) match_ratio
from unnest(array['reading', 'swimming', 'knitting', 'cars']) a(val)
) x
It is not very clear what are the rules behind the result that you are showing. This gives you a ratio that represents the percentage of values in the parameter array that can be found in the interests of each person (so Mary gets 0.5 since she has two interests in common with the search parameter, and all other names get 0.25).
Demo on DB Fiddle

One option would be using jsonb_array_elements() to unnest the jsonb column :
SELECT name, count / SUM(count) over () AS ratio
FROM(
SELECT name, COUNT(name) AS count
FROM people
JOIN jsonb_array_elements(interests) AS j(elm) ON TRUE
WHERE interests #>
ANY (ARRAY ['"reading"', '"swimming"', '"knitting"', '"cars"']::jsonb[])
GROUP BY name ) q
Demo

Related

PostgreSQL Return Row if Value Exists in One of Several Columns

Ok, I am stuck on this one.
I have a PostgreSQL table customers that looks like this:
id firm1 firm2 firm3 firm4 firm5 lastname firstname
1 13 8 2 0 0 Smith John
2 3 2 0 0 0 Doe Jane
Each row corresponds to a client/customer. Each client/customer can be associated with one or multiple firms; the numeric value under each firm# columns corresponds to the firm id in a different table.
So I am looking for a way of returning all rows of customers that are associated with a specific firm.
For example, SELECT id, lastname, firstname where 8 exists in firm1, firm2, firm3, firm4, firm5 would just return the John Smith row as he is associated with firm 8 under the firm2 column.
Any ideas on how to accomplish that?
You can use the IN operator for that:
SELECT *
FROM customer
where 8 IN (firm1, firm2, firm3, firm4, firm5);
But it would be much better in the long run if your normalized your data model.
You should consider to normalize your tables, with the current schema you should join firms tables as many times as the number of firm fields in your customer table.
select *
from customers c
left join firms f1
on f1.firm_id = c.firm1
left join firms f2
on f2.firm_id = c.firm2
left join firms f3
on f3.firm_id = c.firm3
left join firms f4
on f4.firm_id = c.firm4
You can "unpivot" using a combination of array and unnest, as specified in this answer: unpivot and PostgreSQL.
In your case, I think this should work:
select lastname,
firstname,
unnest(array[firm1, firm2, firm3, firm4, firm5]) as firm_id
from customer
Now you can select from this table (using either a with statement or an inner query) where firm_id is the value you care about

Alternative for GROUP BY and STUFF in SQL

I am writing some SQL queries in AWS Athena. I have 3 tables search, retrieval and intent. In search table I have 2 columns id and term i.e.
id term
1 abc
1 bcd
2 def
1 ghd
What I want is to write a query to get:
id term
1 abc, bcd, ghd
2 def
I know this can be done using STUFF and FOR XML PATH but, in Athena all the features of SQL are yet not supported. Is there any other way to achieve this. My current query is:
select search.id , STUFF(
(select ',' + search.term
from search
FOR XML PATH('')),1,1,'')
FROM search
group by search.id
Also, I have one more question. I have retrieval table that consist of 3 columns i.e.:
id time term
1 0 abc
1 20 bcd
1 100 gfh
2 40 hfg
2 60 lkf
What I want is:
id time term
1 100 gfh
2 60 lkf
I want to write a query to get the id and term on the basis of max value of time. Here is my current query:
select retrieval.id, max(retrieval.time), retrieval.term
from search
group by retrieval.id, retrieval.term
order by max(retrieval.time)
I am getting duplicate id's along with the term. I think it is because, I am doing group by on id and term both. But, I am not sure how can I achieve it without using group by.
The XML method is brokenness in SQL Server. No reason to attempt it in any other database.
One method uses arrays:
select s.id, array_agg(s.term)
from search s
group by s.id;
Because the database supports arrays, you should learn to use them. You can convert the array to a string:
select s.id, array_join(array_agg(s.term), ',') as terms
from search s
group by s.id;
Group by is a group operation: think that you are clubbing the results and have to find min, max, count etc.
I am answering only one question. Use it to find the answer to question 1
For question 2:
select
from (select id, max(time) as time
from search
group by id, term
order by max(time)
) search_1, search as search_2
where search_1.id = search_2.id
and search_1.time = search_2.time

sql statement to select previous rows to a search param

Im after an sql statement (if it exists) or how to set up a method using several sql statements to achieve the following.
I have a listbox and a search text box.
in the search box, user would enter a surname e.g. smith.
i then want to query the database for the search with something like this :
select * FROM customer where surname LIKE searchparam
This would give me all the results for customers with surname containing : SMITH . Simple, right?
What i need to do is limit the results returned. This statement could give me 1000's of rows if the search param was just S.
What i want is the result, limited to the first 20 matches AND the 10 rows prior to the 1st match.
For example, SMI search:
Sives
Skimmings
Skinner
Skipper
Slater
Sloan
Slow
Small
Smallwood
Smetain
Smith ----------- This is the first match of my query. But i want the previous 10 and following 20.
Smith
Smith
Smith
Smith
Smoday
Smyth
Snedden
Snell
Snow
Sohn
Solis
Solomon
Solway
Sommer
Sommers
Soper
Sorace
Spears
Spedding
Is there anyway to do this?
As few sql statements as possible.
Reason? I am creating an app for users with slow internet connections.
I am using POSTGRESQL v9
Thanks
Andrew
WITH ranked AS (
SELECT *, ROW_NUMBER() over (ORDER BY surname) AS rowNumber FROM customer
)
SELECT ranked.*
FROM ranked, (SELECT MIN(rowNumber) target FROM ranked WHERE surname LIKE searchparam) found
WHERE ranked.rowNumber BETWEEN found.target - 10 AND found.target + 20
ORDER BY ranked.rowNumber
SQL Fiddle here. Note that the fiddle uses the example data, and I modified the range to 3 entries before and 6 entries past.
I'm assuming that you're looking for a general algorithm ...
It sounds like you're looking for a combination of finding the matches "greater than or equal to smith", and "less than smith".
For the former you'd order by surname and limit the result to 20, and for the latter you'd order by surname descending and limit to 10.
The two result sets can then be added together as arrays and reordered.
I think you need to use ROW_NUMBER() (see this link).
WITH cust1 AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY surname) as numRow FROM customer
)
SELECT c1.surname, c1.numRow, x.flag
FROM cust1 c1, (SELECT *,
case when numRow = (SELECT MIN(numRow) FROM cust1 WHERE surname='Smith') then 1 else 0 end as flag
FROM cust1) x
WHERE x.flag = 1 and c1.numRow BETWEEN x.numRow - 1 AND x.numRow + 1
ORDER BY c1.numRow
SQLFiddle here.
This works, but the flag finally isn't necessary and it would be a query like PinnyM posts.
A variation on #PinnyM's solution:
WITH ranked AS (
SELECT
*,
ROW_NUMBER() over (ORDER BY surname) AS rowNumber
FROM customer
),
minrank AS (
SELECT
*,
MIN(CASE WHEN surname LIKE searchparam THEN rowNumber END) OVER () AS target
FROM ranked
)
SELECT
surname
FROM minrank
WHERE rowNumber BETWEEN target - 10 AND target + 20
;
Instead of two separate calls to the ranked CTE, one to get the first match's row number and the other to read the results from, another CTE is introduced to serve both purposes. Can't speak for PostgreSQL but in SQL Server this might result in a better execution plan for the query, although in either case the real efficiency would still need to be verified by proper testing.

SQL server - How to find the highest number in '<> ' in a text column?

Lets say I have the following data in the Employee table: (nothing more)
ID FirstName LastName x
-------------------------------------------------------------------
20 John Mackenzie <A>te</A><b>wq</b><a>342</a><d>rt21</d>
21 Ted Green <A>re</A><b>es</b><1>t34w</1><4>65z</4>
22 Marcy Nate <A>ds</A><b>tf</b><3>fv 34</3><6>65aa</6>
I need to search in the X column and get highest number in <> these brackets
What sort of SELECT statement can get me, for example, the number 6 like in <6>, in the x column?
This type of query generally works on finding patterns, I consider that the <6> is at the 9th position from left.
Please note if the pattern changes the below query will not work.
SELECT A.* FROM YOURTABLE A INNER JOIN
(SELECT TOP 1 ID,Firstname,Lastname,SUBSTRING(X,LEN(X)-9,1) AS [ORDER]
FROM YOURTABLE
WHERE ISNUMERIC(SUBSTRING(X,LEN(X)-9,1))=1
ORDER BY SUBSTRING(X,LEN(X)-9,1))B
ON
A.ID=B.ID AND
A.FIRSTNAME=B.FIRSTNAME AND
A.LASTNAME=B.LASTNAME

Any option except cursor in this kind of group by?

I have a sample data as:
Johnson; Michael, Surendir;Mishra, Mohan; Ram
Johnson; Michael R.
Mohan; Anaha
Jordan; Michael
Maru; Tushar
The output of the query should be:
Johnson; Michael 2
Mohan; Anaha 1
Michael; Jordon 1
Maru; Tushar 1
Surendir;Mishra 1
Mohan; Ram 1
As you can see it is print the count of each name separated by , but with a twist. We cannot simply do a groupby on full name because sometimes the name may contain middle name 1st initial and sometimes it may not. Eg. Johnson; Michael and Johnson; Michael R. are counted as single name and hence their count is 2. Further either Johnson; Michael should appear or Johnson; Michael R. should appear in resultset with count of 2 (not both because that would be repeated record)
The table contains names separated by , and it is not possible to denormalize it as it is LIVE and given to us by someone else.
Is there anyway to write a query for this without using cursor? I have around 3 million records in my DB and I have to support pagination etc also. What do you think would be the best way to achieve this?
This is why your data should be normalised.
;with cte as
(
select 1 as Item, 1 as Start, CHARINDEX(',',People+',' , 1) as Split,
People+',' as People
from YourHorribleTable
union all
select cte.Item+1, cte.Split+1, nullif(CHARINDEX(',',people, cte.Split+1),0), People as Split
from cte
where cte.Split<>0
)
select Person, COUNT(*)
from
(
select case when nullif(charindex (' ', person, 2+nullif(CHARINDEX(';', person),0)),0) is null then person
else substring(person,1,charindex (' ', person, 2+nullif(CHARINDEX(';', person),0)))
end as Person
from
(
select LTRIM(RTRIM( SUBSTRING(people, start,isnull(split,len(People)+1)-start))) as person
from cte
) v
where person<>''
) v
group by Person
order by COUNT(*) desc