how to count occurrences of multiple words in a sql query - sql

I have two tables one has a column containing words and their weightage. for example
word
weight
First
9
Second
4
third
6
fourth
7
and another table that has sentences. like
Sentence
Random column
this is first
..
Second sentence
..
another first
..
first and third word
..
now what i want to do is select words with weight 5 or more, and see how many sentence contain the keyword at-least once so the query result will be
(in the above example, I tried to handle scenarios like only one keyword is in the sentence, multiple in a single sentence, keyword not in any sentence.) Also if a keyword occurs multiple time in a sentence, the sentence should be counted once. I am trying to count sentences that contain the keyword, and not the number of keyword occurrence.
word
count
First
3
third
1
fourth
0
How to calculate this while keeping the query as simple as possible.(if possible)

If you want to count the number of sentences that contain the words, you can use:
select w.*,
(select count(*)
from sentences s
where s.sentence like concat('%', w.word, '%')
) as cnt
from words w
where weightage > 5;
This makes some assumptions on what you really mean.
"how many times each word is used in the sentence table" means "how many sentences contain the word at least once".
"used" can be handled by looking for the word anywhere in the sentence, regardless of surrounding characters.

Related

Select entries with substring in a specific position of the word

I am trying to write an SQL query where my goal is to select all the entries that contains a substring in a specific position of a word (the entries are phrases with multiple words).
To make it clearer, suppose that I have these two entries (values inside a column, call it phrase):
1 - "Jamming in New York"
2 - "bosbessenjam 30g"
3 - "30g bosbessenjam"
4 - "Tranches de jambon fumé"
I want to select all the rows that contains a word that ends with "jam". So I want the second and third row.
I tried using LIKE '%jam%', however it just check the overall string and not the single word. So LIKE '%jam' returns the third row, but not the second.
Any idea on how to do this?

LIKE [0-9] much slower than LIKE %

I have a somewhat complex view that selects a bunch of items and the amount of stock they have remaining. I want to narrow down the result set to items codes that conform to a pattern so that unwanted items are filtered out.
The item codes that I want to include are formatted as a 4 digit number, followed by a hyphen followed by another 4 digit number
1234-0001
I also want to include item numbers that are formatted as an 13-digit number (an ISBN which starts with 978), followed by a hyphen, followed by another 2 digit number.
9781234567890-01
Originally I planned to use the below, which would match all item codes that included a hyphen in.
SELECT *
FROM Vw_Stock
WHERE ItemCode LIKE '%-%'
Unfortunately not all of our item codes are that uniform and people have sporadically used hyphens in items that do not conform to the two accepted formats above, so I switched to the following.
SELECT *
FROM Vw_Stock
WHERE ItemCode LIKE '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
OR ItemCode LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]'
The query originally took 1 second and now it takes roughly 3 minutes 50 seconds. Why does it take so much longer? And is there a more efficient way for me to query against the item formats identified?

selecting queries only 3 queries that starts with an alphabet "i" in sqlite3

I have a table with two attribute word and frequency. Now,I want to select those word which starts with "i" but only 3 of them
for example :
I have 180 words starting with an i
I
Is
It
.
.
.
now i want to select 1st three words that starts with 'i'
Thankyou
You need LIKE, LIMIT and optionally ORDER BY (if you need to retrieve top values from a ordered list otherwise comment it) as below.
select words
from table1
where words like 'i%'
order by words asc
limit 3;
DEMO

How do you count the frequency of a word in a column?

I have a table with 20000 rows and a column called XXX. XXX has string(VARCHAR2)(names with more than one word) in it and I want to find the first word in that name and display with a query if it occurs more than 30 times.
For example, if the first word of the word is foo and it occurs 30 times or boo occurs 40 times, then
Word Count
foo 30
boo 40
the word can be anything. Only condition is frequency. I tried to solve it with INSTR, but I couldn't get it.
Thanks a lot for your help
If you have your column values seperated by a delimiter like ',' or '.' or space you can use a group by query like below
select count(*),substr(col,1,(instr(col,'.')-1) from tab group by substr(col,1,(instr(col,'.')-1) order by 2;
This might help you REGEXP

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.