How to find articulation points in a graph using SQL - sql

I'm trying to write a Postgres function that returns the results of every articulation point in an undirected graph. But I can't figure out how to do this properly in relation to relational programming. So for example, if the graph is
select * from graph;
source | target
--------+--------
1 | 2
2 | 1
1 | 3
3 | 1
2 | 3
3 | 2
2 | 4
4 | 2
2 | 5
5 | 2
4 | 5
5 | 4
(12 rows)
then the result should be
select articulation_point();
articulation_point
--------------------
2
(1 row)
But I have no idea how to go about this. I've read some articles on how to do this in a programming language like Python, but no idea how to approach it in Postgres.

Related

Best database design to find relationships between two persons

I want to find relationships between two persons using a database. For example, I have a database like this:
Person:
Id| Name
1 | Edvard
2 | Ivan
3 | Molly
4 | Julian
5 | Emily
6 | Katarina
Relationship:
Id| Type
1 | Parent
2 | Husband\Wife
3 | ex-Husband\ex-Wife
Relationships:
Id| Person_1_Id | Person_2_Id | Relation_Id
1 | 1 | 3 | 2
2 | 3 | 4 | 3
3 | 3 | 2 | 1
4 | 4 | 2 | 1
5 | 1 | 6 | 3
6 | 1 | 5 | 1
7 | 6 | 5 | 1
What the best way to find what relationship between Person-2 and Person-5? This example is not large enough, but what if there were 5 families or 10000. I think, if there are too many families, then it is necessary to introduce the concept of depth. Maybe it will be better to change the database design? Is this possible to make it like trees or graphs? Some ideas on how to solve this problem differently?
As soon as you get above a handful of nodes and a few relationships between them, this becomes a very complex problem: there are whole branches of maths based around this type of challenge and how long it takes to compute a result.
For any non-trivial set of nodes/relationships you are going to need to look at deploying a graph database e.g. Neo4j

influxdb/SQL get field count

I have an influxdb table lets call it my_table
my_table is structured like this (simplified):
+-----+-----+-----
| Time| m1 | m2 |
+=====+=====+=====
| 1 | 8 | 4 |
+-----+-----+-----
| 2 | 1 | 12 |
+-----+-----+-----
| 3 | 6 | 18 |
+-----+-----+-----
| 4 | 4 | 1 |
+-----+-----+-----
However I was wondering if it is possible to find out how many of the metrics are larger than a certain (dynamic) threshold for each time.
So lets say I want to know how many of the metrics (columns) are higher than 5,
I would want to do something like this:
select fieldcount(/m*/) from my_table where /m*/ > 5
Returning:
1
1
2
0
I am relatively restricted in structuring the database as I'm using diamond collector (python) which takes care of all datacollection for me and flushes it to my influxdb without me telling what the tables should look like.
EDIT
I am aware of a possible solution if I hardcode the threshold and add a third metric named mGreaterThan5:
+-----+-----+------------------+
| Time| m1 | m2 |mGreaterThan5|
+=====+=====+====+=============+
| 1 | 8 | 4 | 1 |
+-----+-----+----+-------------+
| 2 | 1 | 12 | 1 |
+-----+-----+----+-------------+
| 3 | 6 | 18 | 2 |
+-----+-----+----+-------------+
| 4 | 4 | 1 | 0 |
+-----+-----+----+-------------+
However this means that I cant easily change this threshold to 6 or any other number so thats why I would prefer a better solution if there is one.
EDIT2
Another similar problem occurs with trying to retrieve the highest x amount of metrics. Eg:
On Jan 1st what were the highest 3 values of m? Given table:
+-----+-----+----+-----+----+-----+----+
| Time| m1 | m2 | m3 | m4 | m5 | m6 |
+=====+=====+====+=====+====+=====+====+
| 1/1 | 8 | 4 | 1 | 7 | 2 | 0 |
+-----+-----+----+-----+----+-----+----+
Am I screwed if I keep the table structured this way?

Most efficient way to query a word & synonym table

I have a WORDTB table with words and their synonyms: ID, WORD1, WORD2, WORD3, WORD4, WORD5. These words are arranged according to their frequency. When any word is given I want to query and retrieve the most frequent synonym of that particular word which is the word in WORD1 column.
This is the query I tried and it works fine, but I think this is inefficient.
SELECT WORD1
FROM WORDTB
WHERE WORD1='xxxx'
OR WORD2='xxxx'
OR WORD3='xxxx'
OR WORD4='xxxx'
OR WORD5='xxxx'
Can anyone suggest a more efficient way of doing this.
A more scalable solution would be to use a single row for each word.
synonym_words(word_id, synonym_id, word, popularity)
Fields:
word_id: The primary key for a word.
synonym_id: The word_id of the first synonym word.
word: The synonym text.
popularity: The sort order for the list of synonyms, 1 being the most popular.
Sample table data:
word_id | synonym_id | word | popularity
==============================================
1 | 1 | start | 1
2 | 1 | begin | 2
3 | 1 | originate | 3
4 | 1 | initiate | 4
5 | 1 | commence | 5
6 | 1 | create | 6
7 | 1 | startle | 7
8 | 1 | leave | 8
9 | 9 | end | 1
10 | 9 | ending | 2
11 | 9 | last | 3
12 | 9 | goal | 4
13 | 9 | death | 5
14 | 9 | conclusion | 6
15 | 9 | close | 7
16 | 9 | closing | 8
Assuming that the words will not change but their popularity may over time, the query should not break if you were to change the popularity order of the words so that the most popular synonym for a word was changed. You want your query to return the most popular word (popularity = 1) which shares the same synonym_id as the word used in the search.
SQL query:
SELECT word FROM synonym_words
WHERE synonym_id = (SELECT synonym_id FROM synonym_words WHERE word = 'conclusion')
AND popularity = 1

How to count columns where values differ?

I have a large table and I need to check for similar rows. I don't need all column values be the same, just similar. The rows must not be "distant" (determined by a query over other table), no value may be too different (I have already done the queries for these conditions) and most other values must be the same. I must expect some ambiguity, so one or two different values shouldn't break the "similarity" (well, I could get better performance by accepting only "completely equal" rows, but this simplification could cause errors; I will do this as an option).
The way I am going to solve this is through PL/pgSQL: to make a FOR LOOP iterating through the results of previous queries. For each column, I have an IF testing whether it differs; if yes, I increment a difference counter and go on. At the end of each loop, I compare the value to a threshold and see if I should keep the row as "similar" or not.
Such a PL/pgSQL-heavy approach seems slow in comparison to a pure SQL query, or to an SQL query with some PL/pgSQL functions involved. It would be easy to test for rows with all but X equivalent rows if I knew which rows should be different, but the difference can occur at any of some 40 rows. Is there any way how to solve this by a single query? If not, is there any faster way than to examine all the rows?
EDIT: I mentioned a table, in fact it is a group of six tables linked by 1:1 relationship. I don't feel like explaining what is what, that's a different question. Extrapolating from doing this over one table to my situation is easy for me. So I simplified it (but not oversimplified it - it should demonstrate all the difficulties I have there) and made an example demonstrating what I need. Null and anything else should count as "different". No need to make a script testing it all - I just need to find out whether it is possible to do in any way more efficient than I thought about.
The point is that I don't need to count rows (as usual), but columns.
EDIT2: previous fiddle - this wasn't so short, so I let it here just for archiving reasons.
EDIT3: simplified example here - just NOT NULL integers, preprocessing omitted. Current state of data:
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5
----+------+------+------+------+------
1 | 4 | 2 | 3 | 4 | 11
2 | 4 | 2 | 4 | 3 | 11
3 | 6 | 3 | 3 | 5 | 13
When I run select similar_records( 1 );, I should get only row 2 (2 columns with different values; this is within limit), not 3 (4 different values - outside the limit of two differences at most).
To find rows that only differ on a given maximum number of columns:
WITH cte AS (
SELECT id
,unnest(ARRAY['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) AS col -- more
,unnest(ARRAY[bar1::text, bar2::text, bar3::text
, bar4::text, bar5::text]) AS val -- more
FROM foo
)
SELECT b.id, count(a.val <> b.val OR NULL) AS cols_different
FROM (SELECT * FROM cte WHERE id = 1) a
JOIN (SELECT * FROM cte WHERE id <> 1) b USING (col)
GROUP BY b.id
HAVING count(a.val <> b.val OR NULL) < 3 -- max. diffs allowed
ORDER BY 2;
I ignored all the other distracting details in your question.
Demonstrating with 5 columns. Add more as required.
If columns can be NULL you may want to use IS DISTINCT FROM instead of <>.
This is using the somewhat unorthodox, but handy parallel unnest(). Both arrays must have the same number of elements to work. Details:
Is there something like a zip() function in PostgreSQL that combines two arrays?
SQL Fiddle (building on yours).
In instead of a loop to compare each row to all the others do a self join
select f0.id, f1.id
from foo f0 inner join foo f1 on f0.id < f1.id
where
f0.bar1 = f1.bar1 and f0.bar2 = f1.bar2
and
#(f0.bar3 - f1.bar3) <= 1
and
f0.bar4 = f1.bar4 and f0.bar5 = f1.bar5
or
f0.bar4 = f1.bar5 and f0.bar5 = f1.bar4
and
#(f0.bar6 - f1.bar6) <= 2
and
f0.bar7 is not null and f1.bar7 is not null and #(f0.bar7 - f1.bar7) <= 5
or
f0.bar7 is null and f1.bar7 <= 3
or
f1.bar7 is null and f0.bar7 <= 3
and
f0.bar8 = f1.bar8
and
#(f0.bar11 - f1.bar11) <= 5
;
id | id
----+----
1 | 4
1 | 5
4 | 5
(3 rows)
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5 | bar6 | bar7 | bar8 | bar9 | bar10 | bar11
----+------+------+------+------+------+------+------+------+------+-------+-------
1 | abc | 4 | 2 | 3 | 4 | 11 | 7 | t | t | f | 42.1
2 | abc | 5 | 1 | 6 | 2 | 8 | 39 | t | t | t | 19.6
3 | xyz | 4 | 2 | 3 | 5 | 14 | 82 | t | f | | 95
4 | abc | 4 | 2 | 4 | 3 | 11 | 7 | t | t | f | 42.1
5 | abc | 4 | 2 | 3 | 4 | 13 | 6 | t | t | | 37.7
Are you aware that the and operator has priority over or? I'm asking because it looks like the where clause in your function is not what you want. I mean in your expression it is enough to f0.bar7 is null and f1.bar7 <= 3 to be true to include the pair

How to get top 3 frequencies in MySQL?

In MySQL I have a table called "meanings" with three columns:
"person" (int),
"word" (byte, 16 possible values)
"meaning" (byte, 26 possible values).
A person assigns one or more meanings to each word:
person word meaning
-------------------
1 1 4
1 2 19
1 2 7 <-- Note: second meaning for word 2
1 3 5
...
1 16 2
Then another person, and so on. There will be thousands of persons.
I need to find for each of the 16 words the top three meanings (with their frequencies). Something like:
+--------+-----------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+-----------------+------------------+-----------------+
| 1 | meaning 5 (35%) | meaning 19 (22%) | meaning 2 (13%) |
| 2 | meaning 8 (57%) | meaning 1 (18%) | meaning 22 (7%) |
+--------+-----------------+------------------+-----------------+
...
Is it possible to solve this with a single MySQL query?
Well, if you group by word and meaning, you can easily get the % of people who use each word/meaning combination out of the dataset.
In order to limit the number of meanings for each word returned, you will need create some sort of filter per word/meaning combination.
Seems like you just want the answer to your homework, so I wont post more than this, but this should be enough to get you on the right track.
Of course you can do
SELECT * FROM words WHERE word = 2 ORDER BY meaning DESC LIMIT 3
But this is cheating since you need to create a loop.
Im working on a better solution
I believe the problem I had a while ago looks similar. I ended up with the #counter thing.
Note about the problem
Let's suppose there is only one person, who says:
+--------+----------------+
| Person | Word | Meaning |
+--------+----------------+
| 1 | 1 | 7 |
| 1 | 1 | 3 |
| 1 | 2 | 8 |
+--------+----------------+
The report should read:
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (100%) | meaning 3 (100%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The following is not OK (50% frequency is absurd in a population of one person):
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (50%) | meaning 3 (50%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The intended meaning of the frequencies is "How many people think this meaning corresponds to that word"?
So it's not merely about counting "cases", but about counting persons in the table.