Most efficient way to query a word & synonym table - sql

I have a WORDTB table with words and their synonyms: ID, WORD1, WORD2, WORD3, WORD4, WORD5. These words are arranged according to their frequency. When any word is given I want to query and retrieve the most frequent synonym of that particular word which is the word in WORD1 column.
This is the query I tried and it works fine, but I think this is inefficient.
SELECT WORD1
FROM WORDTB
WHERE WORD1='xxxx'
OR WORD2='xxxx'
OR WORD3='xxxx'
OR WORD4='xxxx'
OR WORD5='xxxx'
Can anyone suggest a more efficient way of doing this.

A more scalable solution would be to use a single row for each word.
synonym_words(word_id, synonym_id, word, popularity)
Fields:
word_id: The primary key for a word.
synonym_id: The word_id of the first synonym word.
word: The synonym text.
popularity: The sort order for the list of synonyms, 1 being the most popular.
Sample table data:
word_id | synonym_id | word | popularity
==============================================
1 | 1 | start | 1
2 | 1 | begin | 2
3 | 1 | originate | 3
4 | 1 | initiate | 4
5 | 1 | commence | 5
6 | 1 | create | 6
7 | 1 | startle | 7
8 | 1 | leave | 8
9 | 9 | end | 1
10 | 9 | ending | 2
11 | 9 | last | 3
12 | 9 | goal | 4
13 | 9 | death | 5
14 | 9 | conclusion | 6
15 | 9 | close | 7
16 | 9 | closing | 8
Assuming that the words will not change but their popularity may over time, the query should not break if you were to change the popularity order of the words so that the most popular synonym for a word was changed. You want your query to return the most popular word (popularity = 1) which shares the same synonym_id as the word used in the search.
SQL query:
SELECT word FROM synonym_words
WHERE synonym_id = (SELECT synonym_id FROM synonym_words WHERE word = 'conclusion')
AND popularity = 1

Related

Filtering records not containing numbers

I have a table that has numbers in string format. Ideally the table should contain 10 digit number in string format, but it has many junk values. I wanted to filter out the records that are not ideal in nature.
Below is the sample table that I have:
+---------------+--------+----------------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 3423122334 | 10 | As expected, 10 character number |
| 6758439239 | 10 | As expected, 10 character number |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------------+
Expected Output:
+---------------+--------+----------------------------+
| ID_UID | Length | ##Comment |
+---------------+--------+----------------------------+
| +112323456705 | 13 | Contains special character |
| 4323456432 | 11 | Contains blank |
| 58_4323129 | 10 | Contains special character |
| 4567$%6790 | 10 | Contains special character |
| 45684938901 | 11 | Is 11 characters |
| 4568 38901 | 10 | Contains blank |
+---------------+--------+----------------------------+
Basically I want all the records that dont have 10 digit numbers in them.
I have tried out below query:
SELECT *
FROM t1
WHERE ID_UID LIKE '%[^0-9]%'
But this does not returns any records.
Have created a fiddle for the same.
P.S. The columns length and ##Comment are illustrative in nature.
You want RLIKE not LIKE:
SELECT *
FROM t1
WHERE ID_UID RLIKE '[^0-9]'
Note that % is a LIKE wildcard, not a regular expression wildcard. Also, regular expressions match the pattern anywhere it occurs, so no wildcards are needed for the beginning and end of the string.
If you want to find values that are not ten digits, then be explicit:
SELECT *
FROM t1
WHERE ID_UID NOT RLIKE '^[0-9]{10}$'

Specifying condition operator (AND/OR) for a column based on another column value in SQL

I have a recipe table with a many-to-many to a recipe_filter table. Here's some sample data:
recipe:
id | name
----+-----------
1 | test 2019
12 | slug-14
8 | dfadsfd
6 | test 4
4 | test 2
11 | slug-11
10 | Testology
13 | slug-15
5 | test 3
14 | slug-16
(10 rows)
recipe_filter_join:
recipeId | recipeFilterId
----------+----------------
1 | 1
2 | 2
3 | 3
4 | 1
6 | 5
7 | 6
8 | 4
9 | 7
6 | 8
14 | 9
14 | 4
5 | 9
5 | 38
filter:
id | slug | name | label
----+----------------------+-------------+----------------
2 | fdsfa | fdsfa | Category
3 | dsfds | dsfds | Category
6 | fdsaf | fdsaf | Category
7 | dfad | dfad | Category
8 | product-spice-2 | Spice #2 | Product
9 | product-spice-3 | Spice #3 | Product
5 | product-spice-4 | Spice #4 | Product
4 | product-spice-5 | Spice #5 | Product
1 | product-spice-6 | Spice #6 | Product
10 | product-spice-1 | Spice #1 | Product
40 | diet-halal | Halal | Diet
38 | diet-keto | Keto | Diet
41 | diet-gluten-free | Gluten free | Diet
37 | diet-vegan | Vegan | Diet
39 | diet-diabetic | Diabetic | Diet
42 | cooking-method-bake | Bake | Cooking method
43 | cooking-method-fry | Fry | Cooking method
44 | cooking-method-steam | Steam | Cooking method
45 | cooking-method-roast | Roast | Cooking method
(19 rows)
The input to my query is a list of filters.slugs for example product-spice-1, product-spice-5, cooking-method-fry, cooking-method-steam.
For the above example, I want to write a query that gets all recipes where the filter slug is (product-spice-1 or product-spice-5) and (cooking-method-fry or cooking-method-steam).
How do I create a generic query from the example above?
Update: In case it's not clear, for the list of filters given, I want to group them based on label and apply an OR between group members and an AND condition for other groups, if that makes any sense.
You want to INTERSECT two queries
SELECT
rfj."recipeId"
FROM recipe_filter_join rfj
JOIN filter ON filter.id = rfj."recipeFilterId"
WHERE filter.slug IN ('product-spice-1','product-spice-5')
INTERSECT
SELECT
rfj."recipeId"
FROM recipe_filter_join rfj
JOIN filter ON filter.id = rfj."recipeFilterId"
WHERE filter.slug IN ('cooking-method-fry', 'cooking-method-steam')
And this is is quite generalizable. As you can see, the only difference between the two parts is in the WHERE clause. If you have other conditions on Diet or category, you could generate the appropriate query string with the variation on filer & join them with INTERSECT as the separator in your programming language of choice.
I want to group them based on label and apply an OR between group members and an AND condition for other groups.
If you would prefer to have your application code call the query with just a list of slugs, then the following solution is more general.
If we restate the problem description as :
We want to search for recipes which have ingredients intersecting with the provided ingredient list, and the distinct labels for the recipes equals the distinct labels derived from the ingredient list (this last part is handled by the having clause)
We can write
WITH distinct_labels AS (
SELECT
ARRAY_AGG(DISTINCT label ORDER BY label) distinct_labels_filtered
FROM filter
WHERE slug IN ('product-spice-1','product-spice-5','cooking-method-fry', 'cooking-method-steam')
)
SELECT
rfj."recipeId"
FROM filter
JOIN recipe_filter_join rfj
ON filter.id = rfj."recipeFilterId"
WHERE slug IN ('product-spice-1','product-spice-5','cooking-method-fry', 'cooking-method-steam')
GROUP BY 1
HAVING ARRAY_AGG(DISTINCT label ORDER BY label) = (SELECT distinct_labels_filtered FROM distinct_labels)

How to count the unique rows after aggregating to array

Trying to solve the problem in a read-only manner.
My table (answers) looks like the one below:
| user_id | value |
+----------------+-------------+
| 6 | pizza |
| 6 | tosti |
| 9 | fries |
| 9 | tosti |
| 10 | pizza |
| 10 | tosti |
| 12 | pizza |
| 12 | tosti |
| 13 | sushi | -> did not finish the quiz.
NOTE: the actual table has 15+ different possible values. (Answers to questions).
I've been able to make create the table below:
| value arr | count | user_id |
+----------------+--------------+-----------+
| pizza, tosti | 2 | 6 |
| fries, tosti | 2 | 9 |
| pizza, tosti | 2 | 10 |*
| pizza, tosti | 2 | 12 |*
| sushi | 1 | 13 |
I'm not sure if the * rows show up in my current query (DB has 30k rows and 15+ value options). The problem here is that "count" is counting the number of answers and not the number of unique outcomes.
Current query looks a bit like:
select string_agg(DISTINCT value, ',' order by value) AS value, user_id,
COUNT(DISTINCT value)
FROM answers
GROUP BY user_id;
Looking for the unique answer combinations like the table shown below:
| value arr | count unique |
+----------------+--------------+
| pizza, tosti | 3 |
| fries, tosti | 1 |
| sushi | 1 | --> Hidden in perfect situation.
Tried a bunch of queries, both written and generated by tools. From super simplified to quite complex, I keep ending up with the answers being count instead of the unique combination accros users.
If this is a duplicate question, please re-direct me to it. Learned a lot these last few days, but haven't been able to find the answer yet.
Any help would be highly appreciated.
Here's what you need. Your almost there.
select t1.value, count(1) From (
select string_agg(DISTINCT value, ',' order by value) AS value, user_id
FROM answers
GROUP BY user_id) t1
group by t1.value;
You can try (this is for SQL Server):
select count(*), string_agg(value, ",")
within group (order by value) as count_unique
from answers
group by string_agg(value, ",")

influxdb/SQL get field count

I have an influxdb table lets call it my_table
my_table is structured like this (simplified):
+-----+-----+-----
| Time| m1 | m2 |
+=====+=====+=====
| 1 | 8 | 4 |
+-----+-----+-----
| 2 | 1 | 12 |
+-----+-----+-----
| 3 | 6 | 18 |
+-----+-----+-----
| 4 | 4 | 1 |
+-----+-----+-----
However I was wondering if it is possible to find out how many of the metrics are larger than a certain (dynamic) threshold for each time.
So lets say I want to know how many of the metrics (columns) are higher than 5,
I would want to do something like this:
select fieldcount(/m*/) from my_table where /m*/ > 5
Returning:
1
1
2
0
I am relatively restricted in structuring the database as I'm using diamond collector (python) which takes care of all datacollection for me and flushes it to my influxdb without me telling what the tables should look like.
EDIT
I am aware of a possible solution if I hardcode the threshold and add a third metric named mGreaterThan5:
+-----+-----+------------------+
| Time| m1 | m2 |mGreaterThan5|
+=====+=====+====+=============+
| 1 | 8 | 4 | 1 |
+-----+-----+----+-------------+
| 2 | 1 | 12 | 1 |
+-----+-----+----+-------------+
| 3 | 6 | 18 | 2 |
+-----+-----+----+-------------+
| 4 | 4 | 1 | 0 |
+-----+-----+----+-------------+
However this means that I cant easily change this threshold to 6 or any other number so thats why I would prefer a better solution if there is one.
EDIT2
Another similar problem occurs with trying to retrieve the highest x amount of metrics. Eg:
On Jan 1st what were the highest 3 values of m? Given table:
+-----+-----+----+-----+----+-----+----+
| Time| m1 | m2 | m3 | m4 | m5 | m6 |
+=====+=====+====+=====+====+=====+====+
| 1/1 | 8 | 4 | 1 | 7 | 2 | 0 |
+-----+-----+----+-----+----+-----+----+
Am I screwed if I keep the table structured this way?

How to get top 3 frequencies in MySQL?

In MySQL I have a table called "meanings" with three columns:
"person" (int),
"word" (byte, 16 possible values)
"meaning" (byte, 26 possible values).
A person assigns one or more meanings to each word:
person word meaning
-------------------
1 1 4
1 2 19
1 2 7 <-- Note: second meaning for word 2
1 3 5
...
1 16 2
Then another person, and so on. There will be thousands of persons.
I need to find for each of the 16 words the top three meanings (with their frequencies). Something like:
+--------+-----------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+-----------------+------------------+-----------------+
| 1 | meaning 5 (35%) | meaning 19 (22%) | meaning 2 (13%) |
| 2 | meaning 8 (57%) | meaning 1 (18%) | meaning 22 (7%) |
+--------+-----------------+------------------+-----------------+
...
Is it possible to solve this with a single MySQL query?
Well, if you group by word and meaning, you can easily get the % of people who use each word/meaning combination out of the dataset.
In order to limit the number of meanings for each word returned, you will need create some sort of filter per word/meaning combination.
Seems like you just want the answer to your homework, so I wont post more than this, but this should be enough to get you on the right track.
Of course you can do
SELECT * FROM words WHERE word = 2 ORDER BY meaning DESC LIMIT 3
But this is cheating since you need to create a loop.
Im working on a better solution
I believe the problem I had a while ago looks similar. I ended up with the #counter thing.
Note about the problem
Let's suppose there is only one person, who says:
+--------+----------------+
| Person | Word | Meaning |
+--------+----------------+
| 1 | 1 | 7 |
| 1 | 1 | 3 |
| 1 | 2 | 8 |
+--------+----------------+
The report should read:
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (100%) | meaning 3 (100%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The following is not OK (50% frequency is absurd in a population of one person):
+--------+------------------+------------------+-----------------+
| Word | 1st Most Ranked | 2nd Most Ranked | 3rd Most Ranked |
+--------+------------------+------------------+-----------------+
| 1 | meaning 7 (50%) | meaning 3 (50%) | NULL |
| 2 | meaning 8 (100%) | NULL | NULL |
+--------+------------------+------------------+-----------------+
The intended meaning of the frequencies is "How many people think this meaning corresponds to that word"?
So it's not merely about counting "cases", but about counting persons in the table.