Count matches between multiple columns and words in a nested array

Count matches between multiple columns and words in a nested array - sql

My earlier question was resolved. Now I need to develop a related, but more complex query.
I have a table like this:
id description additional_info
-------------------------------------------
123 games XYD
124 Festivals sport swim
And I need to count matches to arrays like this:
array_content varchar[] := {"Festivals,games","sport,swim"}
If either of the columns description and additional_info contains any of the tags separated by a comma, we count that as 1. So each array element (consisting of multiple words) can only contribute 1 to the total count.
The result for the above example should be:
id RID Matches
1 123 1
2 124 2

The answer isn't simple, but figuring out what you are asking was harder:
SELECT row_number() OVER (ORDER BY t.id) AS id
, t.id AS "RID"
, count(DISTINCT a.ord) AS "Matches"
FROM tbl t
LEFT JOIN (
unnest(array_content) WITH ORDINALITY x(elem, ord)
CROSS JOIN LATERAL
unnest(string_to_array(elem, ',')) txt
) a ON t.description ~ a.txt
OR t.additional_info ~ a.txt
GROUP BY t.id;
Produces your desired result exactly.
array_content is your array of search terms.
How does this work?
Each array element of the outer array in your search term is a comma-separated list. Decompose the odd construct by unnesting twice (after transforming each element of the outer array into another array). Example:
SELECT *
FROM unnest('{"Festivals,games","sport,swim"}'::varchar[]) WITH ORDINALITY x(elem, ord)
CROSS JOIN LATERAL
unnest(string_to_array(elem, ',')) txt;
Result:
elem | ord | txt
-----------------+-----+------------
Festivals,games | 1 | Festivals
Festivals,games | 1 | games
sport,swim | 2 | sport
sport,swim | 2 | swim
Since you want to count matches for each outer array element once, we generate a unique number on the fly with WITH ORDINALITY. Details:
PostgreSQL unnest() with element number
Now we can LEFT JOIN to this derived table on the condition of a desired match:
... ON t.description ~ a.txt
OR t.additional_info ~ a.txt
.. and get the count with count(DISTINCT a.ord), counting each array only once even if multiple search terms match.
Finally, I added the mysterious id in your result with row_number() OVER (ORDER BY t.id) AS id - assuming it's supposed to be a serial number. Voilá.
The same considerations for regular expression matches (~) as in your previous question apply:
Postgres query to calculate matching strings

Related

Creating a category tree table from an array of categories in PostgreSQL

How to generate ids and parent_ids from the arrays of categories. The number or depth of subcategories can be anything between 1-10 levels.
Example PostgreSQL column. Datatype character varying array.
data_column
character varying[] |
----------------------------------
[root_1, child_1, childchild_1] |
[root_1, child_1, childchild_2] |
[root_2, child_2] |
I would like to convert the column of arrays into the table as shown below that I assume is called the Adjacency List Model. I know there is also the Nested Tree Sets Model and Materialised Path model.
Final output table
id | title | parent_id
------------------------------
1 | root_1 | null
2 | root_2 | null
3 | child_1 | 1
4 | child_2 | 2
5 | childchild_1 | 3
6 | childchild_2 | 3
Final output tree hierarchy
root_1
--child_1
----childchild_1
----childchild_2
root_2
--child_2

step-by-step demo: db<>fiddle
You can do this with a recursive CTE
WITH RECURSIVE cte AS
( SELECT data[1] as title, 2 as idx, null as parent, data FROM t -- 1
UNION
SELECT data[idx], idx + 1, title, data -- 2
FROM cte
WHERE idx <= cardinality(data)
)
SELECT DISTINCT -- 3
title,
parent
FROM cte
The starting query of the recursion: Get all root elements and data you'll need within the recursion
The recursive part: Get element of new index and increase the index
After recursion: Query the columns you finally need. The DISTINCT removes tied elements (e.g. two times the same root_1).
Now you have created the hierarchy. Now you need the ids.
You can generate them in many different ways, for example using the row_number() window function:
WITH RECURSIVE cte AS (...)
SELECT
*,
row_number() OVER ()
FROM (
SELECT DISTINCT
title,
parent
FROM cte
) s
Now, every row has its own id. The order criterion may be tweaked a little. Here we have only little chance to change this without any further information. But the algorithm stays the same.
With the ids of each column, we can create a self join to join the parent id by using the parent title column. Because a self join is a repetition of the select query, it makes sense to encapsulate it into a second CTE to avoid code replication. The final result is:
WITH RECURSIVE cte AS
( SELECT data[1] as title, 2 as idx, null as parent, data FROM t
UNION
SELECT data[idx], idx + 1, title, data
FROM cte
WHERE idx <= cardinality(data)
), numbered AS (
SELECT
*,
row_number() OVER ()
FROM (
SELECT DISTINCT
title,
parent
FROM cte
) s
)
SELECT
n1.row_number as id,
n1.title,
n2.row_number as parent_id
FROM numbered n1
LEFT JOIN numbered n2 ON n1.parent = n2.title

How to get substring for filter and group by clause in AWS Redshift database

How to get substring from column which contains records for filter and group by clause in AWS Redshift database.
I have table with records like:
Table_Id | Categories | Value
<ID> | ABC1; ABC1-1; XYZ | 10
<ID> | ABC1; ABC1-2; XYZ | 15
<ID> | XYZ | 5
.....
Now I want to filter records based on individual category like 'ABC1' or 'ABC1 and XYZ'
Expected output from query would like:
Table_Id | Categories | Value
<ID> | ABC1 | 25
<ID> | ABC1-1 | 10
<ID> | ABC1-2 | 15
<ID> | XYZ | 30
.....
So need to group results based on individual categories.

If you have at most 3 values in any "categories" cell you can unnest the cells, get the list of unique values and use that list in a join condition like this:
WITH
values as (
select distinct category
from (
select distinct split_part(categories,';',1) as category from your_table
union select distinct split_part(categories,';',2) from your_table
union select distinct split_part(categories,';',3) from your_table
)
where nullif(category,'') is not null
)
SELECT
t2.category
,sum(t1.value)
FROM your_table t1
JOIN values t2
ON split_part(categories,';',1)=t2.category
OR split_part(categories,';',2)=t2.category
OR split_part(categories,';',3)=t2.category
if you have more than 3 options just add another split_part level both in WITH part and the join condition

#JonScott, #AlexYes and other pals who struggle with similar kinda situations.
I found more better approach other than suggested by #AlexYes.
What I did, I flatter category column which result individual records.
Which I can further process.
Query:
select row_number() over(order by 1) as r1,
to_char(timestamptz 'epoch' + date_time * interval '1 second', 'yyyy-mm-dd') AS DAY,
split_part(categories, ';', numbers.n) as catg,
value
from <TABLE>
join numbers
on numbers.n <= regexp_count(category_string, ';') + 1 <OTHER_CONDITIONS>
Explanation:
Two functions are useful here: first, the split_part function, which takes a string, splits it on ';' delimiter, and returns the first, second, ... , nth value specified from the split string; second, regexp_count, which tells us how many times a particular pattern is found in our string.

To do this fully dynamically, you need to transpose or pivot values in "categories" column into separate rows.
Unfortunately, a "fully dynamic" solution (without knowing the different values beforehand) is NOT possible using redshift.
Your options are as follows:
Use the method suggested by AlexYes in another answer. This is
semi-dynamic and is probably your best option.
Outside of Redshift, run some ETL code to perform
the column -> multiple rows ETL.
Create a hardcoded type solution, and perform the pivot something like this:
select table_id,'ABC1' as category, case when concat(Categories,';') ilike '%ABC1;%' then value else 0 end as value from your_table
union all
select table_id,'ABC1-1' as category, case when concat(Categories,';')ilike '%ABC1-1;%' then value else 0 end as value from your_table
union all
etc

Compare every field in table to every other field in same table

Imagine a table with only one column.
+------+
| v |
+------+
|0.1234|
|0.8923|
|0.5221|
+------+
I want to do the following for row K:
Take row K=1 value: 0.1234
Count how many values in the rest of the table are less than or equal to value in row 1.
Iterate through all rows
Output should be:
+------+-------+
| v |output |
+------+-------+
|0.1234| 0 |
|0.8923| 2 |
|0.5221| 1 |
+------+-------+
Quick Update I was using this approach to compute a statistic at every value of v in the above table. The cross join approach was way too slow for the size of data I was dealing with. So, instead I computed my stat for a grid of v values and then matched them to the vs in the original data. v_table is the data table from before and stat_comp is the statistics table.
AS SELECT t1.*
,CASE WHEN v<=1.000000 THEN pr_1
WHEN v<=2.000000 AND v>1.000000 THEN pr_2
FROM v_table AS t1
LEFT OUTER JOIN stat_comp AS t2

Windows functions were added to ANSI/ISO SQL in 1999 and to to Hive in version 0.11, which was released on 15 May, 2013.
What you are looking for is a variation on rank with ties high which in ANSI/ISO SQL:2011 would look like this-
rank () over (order by v with ties high) - 1
Hive currently does not support with ties ... but the logic can be implemented using count(*) over (...)
select v
,count(*) over (order by v) - 1 as rank_with_ties_high_implicit
from mytable
;
or
select v
,count(*) over
(
order by v
range between unbounded preceding and current row
) - 1 as rank_with_ties_high_explicit
from mytable
;

Generate sample data
select 0.1234 as v into #t
union all
select 0.8923
union all
select 0.5221
This is the query
;with ct as (
select ROW_NUMBER() over (order by v) rn
, v
from #t ot
)
select distinct v, a.cnt
from ct ot
outer apply (select count(*) cnt from ct where ct.rn <> ot.rn and v <= ot.v) a

After seeing your edits, it really does look look like you could use a Cartesian product, i.e. CROSS JOIN here. I called your table foo, and crossed joined it to itself as bar:
SELECT foo.v, COUNT(foo.v) - 1 AS output
FROM foo
CROSS JOIN foo bar
WHERE foo.v >= bar.v
GROUP BY foo.v;
Here's a fiddle.
This query cross joins the column such that every permutation of the column's elements is returned (you can see this yourself by removing the SUM and GROUP BY clauses, and adding bar.v to the SELECT). It then adds one count when foo.v >= bar.v, yielding the final result.

You can take the full Cartesian product of the table with itself and sum a case statement:
select a.x
, sum(case when b.x < a.x then 1 else 0 end) as count_less_than_x
from (select distinct x from T) a
, T b
group by a.x
This will give you one row per unique value in the table with the count of non-unique rows whose value is less than this value.
Notice that there is neither a join nor a where clause. In this case, we actually want that. For each row of a we get a full copy aliased as b. We can then check each one to see whether or not it's less than a.x. If it is, we add 1 to the count. If not, we just add 0.

How to get unique values in the self join and how to get LIMIT number dynamically in psql

Hi i am just learning databases and practicing my skills on the table shown below
id | name | wins | matches
-----+-------------------+------+---------
205 | Twilight Sparkle | 0 | 0
206 | Fluttershy | 0 | 0
207 | Applejack | 0 | 0
208 | Pinkie Pie | 0 | 0
209 | Rarity | 0 | 0
210 | Rainbow Dash | 0 | 0
211 | Princess Celestia | 0 | 0
212 | Princess Luna | 0 | 0
My Job is here is Returns a list of pairs of players for the next round of a match.
Assuming that there are an even number of players registered, each player
appears exactly once in the pairings. Each player is paired with another
player with an equal or nearly-equal win record, that is, a player adjacent to him or her in the standings.
Returns:
A list of tuples, each of which contains (id1, name1, id2, name2)
id1: the first player's unique id
name1: the first player's name
id2: the second player's unique id
name2: the second player's name
to achieve those goals i have done self joined that table and have writen code something like this
SELECT a.id, a.name, b.id, b.name
FROM results AS a, results AS b
WHERE a.id > b.id and a.wins = b.wins
LIMIT COUNT(a.id)/2;
It seems not working. Please help me to dealing with this.
Thanks.

You can sequence them based on their wins then join on the sequence, so they may have the same wins or next closest:
WITH seq_results AS
(
SELECT
id,
name,
ROW_NUMBER() OVER(ORDER BY wins DESC) AS seq
FROM
results
)
SELECT
r1.id,
r1.name,
r2.id,
r2.name
FROM
seq_results r1
JOIN
seq_results r2
ON (r1.seq = (r2.seq - 1))
AND (r2.seq % 2 = 0);
Per your request, here is some information on how this works. I will highly recommend that you visit the documentation for PostgreSQL - it really is some of the best documentation out there: http://www.postgresql.org/docs/current/static/
The first part is a common-table expression (CTE). It allows me to essentially create a table in-memory for use in subsequent queries. You could just as easily create a temp table, but these don't have to be dropped, etc.
See: http://www.postgresql.org/docs/current/static/queries-with.html
WITH seq_results AS
(
SELECT
id,
name,
ROW_NUMBER() OVER(ORDER BY wins DESC) AS seq
FROM
results
)
In this CTE, I am sequencing/sequentially numbering each record using a window function. I will use these numbers later in my join. See: http://www.postgresql.org/docs/current/static/functions-window.html
SELECT
r1.id,
r1.name,
r2.id,
r2.name
FROM
seq_results r1
JOIN
seq_results r2
ON (r1.seq = (r2.seq - 1))
AND (r2.seq % 2 = 0);
Above I am joining the CTE to itself using the sequence. I "offset" the sequence of the second instance of the CTE r2 by -1, essentially joining two sequential records together.
Had I only specified that condition in the join, I would return more than the 4 records expected. I needed to make sure that the ids and names on the "left" are not also on the "right", so I decided to include only the odd-numbered sequenced records on the left and the evens on the right. To do this, I used the modulus operator % to ensure that r2 only returned records where the sequence was even.
Lastly, because the join was an inner join (JOIN is the same as INNER JOIN), any even-numbered sequences in r1 are not returned.

Need to select a JSON array element dynamically from a postgresql table

I have data as follows:
ID Name Data
1 Joe ["Mary","Joe"]
2 Mary ["Sarah","Mary","Mary"]
3 Bill ["Bill","Joe"]
4 James ["James","James","James"]
I want to write a query that selects the LAST element from the array, which does not equal the Name field. For example, I want the query to return the following results:
ID Name Last
1 Joe Mary
2 Mary Sarah
3 Bill Joe
4 James (NULL)
I am getting close - I can select the last element with the following query:
SELECT ID, Name,
(Data::json->(json_array_length(Data::json)-1))::text AS Last
FROM table;
ID Name Last
1 Joe Joe
2 Mary Mary
3 Bill Joe
4 James James
However, I need one more level - to evaluate the last item, and if it is the same as the name field, to try the next to last field, and so on.
Any help or pointers would be most appreciated!

json in Postgres 9.3
This is hard in pg 9.3, because useful functionality is missing.
Method 1
Unnest in a LEFT JOIN LATERAL (clean and standard-conforming), trim double-quotes from json after casting to text. See links below.
SELECT DISTINCT ON (1)
t.id, t.name, d.last
FROM tbl t
LEFT JOIN LATERAL (
SELECT ('[' || d::text || ']')::json->>0 AS last
FROM json_array_elements(t.data) d
) d ON d.last <> t.name
ORDER BY 1, row_number() OVER () DESC;
While this works, and I have never seen it fail, the order of unnested elements depends on undocumented behavior. See links below!
Improved the conversion from json to text with the expression provided by #pozs in the comment. Still hackish, but should be safe.
Method 2
SELECT DISTINCT ON (1)
id, name, NULLIF(last, name) AS last
FROM (
SELECT t.id, t.name
,('[' || json_array_elements(t.data)::text || ']')::json->>0 AS last
, row_number() OVER () AS rn
FROM tbl t
) sub
ORDER BY 1, (last = name), rn DESC;
Unnest in the SELECT list (non-standard).
Attach row number (rn) in parallel (more reliable).
Convert to text like above.
The expression (last = name) in the ORDER BY clause sorts matching names last (but before NULL). So a matching name is only selected if no other name is available. Last link below.
In the SELECT list, NULLIF replaces a matching name with NULL, arriving at the same result as above.
SQL Fiddle.
json or jsonb in Postgres 9.4
pg 9.4 ships all the necessary improvements:
SELECT DISTINCT ON (1)
t.id, t.name, d.last
FROM tbl t
LEFT JOIN LATERAL json_array_elements_text(data) WITH ORDINALITY d(last, rn)
ON d.last <> t.name
ORDER BY d.rn DESC;
Use jsonb_array_elements_text() for jsonb. All else the same.
json / jsonb functions in the manual
Related answers with more explanation:
How to turn json array into postgres array?
PostgreSQL unnest() with element number
Index for finding an element in a JSON array
Time based priority in Active Record Query
Select first row in each GROUP BY group?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count matches between multiple columns and words in a nested array - sql

Related

Creating a category tree table from an array of categories in PostgreSQL

How to get substring for filter and group by clause in AWS Redshift database

Compare every field in table to every other field in same table

How to get unique values in the self join and how to get LIMIT number dynamically in psql

Need to select a JSON array element dynamically from a postgresql table

Categories

Resources