Selecting Strings With Alphabetized Characters - In SQL Server 2008 R2 - sql

This is a recreational pursuit, and is not homework. If you value academic challenges, please read on.
A radio quiz show had a segment requesting listeners to call in with words that have their characters in alphabetical order, e.g. "aim", "abbot", "celt", "deft", etc. I got these few examples by a quick Notepad++ (NPP) inspection of a Scrabble dictionary word list.
I'm looking for an elegant way in T-SQL to determine if a word qulifies for the list, i.e. all its letters are in alpha order, case insensitive.
It seemed to me that there should be some kind of T-SQL algorithm possible that will do a SELECT on a table of English words and return the complete list of all words in the Srcabble dictionary that meets the spec. I've spent considerable time looking at regex strings, but haven't hit on anything that comes even remotely close. I've thought about the obvious looping scenario, but abandoned it for now as "inelegant". I'm looking for your ideas that will obtain the qualifying word list,
preferably using
- a REGEX expression
- a tally-table-based approach
- a scalar UDF that returns 1 if the input word meets the requirement, else 0.
- Other, only limited by your creativity.
But preferably NOT using
- a looping structure
- a recursive solution
- a CLR solution
Assumptions/observations:
1. A "word" is defined here as two or more characters. My dictionary shows 55 2-character words, of which only 28 qualify.
2. No word will have more than two concecutive characters that are identical. (If you find one, please point it out.)
3. At 21 characters, "electroencephalograms" is the longest word in my Scrabble dictionary
(though why that word is in the Scrabble dictionary escapes me--the board is only a 15-by-15 grid.)
Consider 21 as the upper limit on word length.
4. All words LIKE 'Z%' can be dismissed because all you can create is {'Z','ZZ', ... , 'ZZZ...Z'}.
5. As the dictionary's words' initial character proceedes through the alphabet, fewer words will qualify.
6. As the word lengths get longer, fewer words will qualify.
7. I suspect that there will be less than 0.2% of my dictionary's 60,387 words that will qualify.
For example, I've tried NPP regex searches like "^a[a-z][b-z][b-z][c-z][c-z][d-z][d-z][e-z]" for 9-letter words starting with "a", but the character-by-character alphabetic enforcement is not handled properly. This search will return "abilities" which fails the test with the "i" that follows the "l".
There's several free Scrabble word lists available on the web, but Phil Factor gives a really interesting treatment of T-SQL/Scrabble considerations at https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-scrabble-and-rapping/ which is where I got my word list.
Care to give it a shot?

Split the word into individual characters using a numbers table. Use the numbers as one set of indices. Use ROW_NUMBER to create another set. Compare the two sets of indices to see if they match for every character to see if they match. If they do, the letters in the word are in the alphabetical order.
DECLARE #Word varchar(100) = 'abbot';
WITH indexed AS (
SELECT
Index1 = n.Number,
Index2 = ROW_NUMBER() OVER (ORDER BY x.Letter, n.Number),
x.Letter
FROM
dbo.Numbers AS n
CROSS APPLY
(SELECT SUBSTRING(#Word, n.Number, 1)) AS x (Letter)
WHERE
n.Number BETWEEN 1 AND LEN(#Word)
)
SELECT
Conclusion = CASE COUNT(NULLIF(Index1, Index2))
WHEN 0 THEN 'Alphabetical'
ELSE 'Not alphabetical'
END
FROM
indexed
;
The NULLIF(Index, Index2) expression does the comparison: it returns a NULL if the the arguments are equal, otherwise it returns the value of Index1. If all indices match, all the results will be NULL and COUNT will return 0, which means the order of letters in the word was alphabetical.

I did something similar to Andriy. I created a numbers table with value 1-21. I use it to create one set of data with the individual letters order by the index and the a second set ordered alphabetically. Joined the sets to each other on the letter and numbers. I then count nulls. Anything over 0 means it is not in order.
DECLARE #word VARCHAR(21)
SET #word = 'abbot'
SELECT Count(1)
FROM (SELECT Substring(#word, number, 1) AS Letter,
Row_number() OVER ( ORDER BY number) AS letterNum
FROM numbers
WHERE number <= CONVERT(INT, Len(#word))) a
LEFT OUTER JOIN (SELECT Substring(#word, number, 1) AS letter,
Row_number() OVER ( ORDER BY Substring(#word, number, 1)) AS letterNum
FROM numbers
WHERE number <= CONVERT(INT, Len(#word))) b
ON a.letternum = b.letternum
AND a.letter = b.letter
WHERE b.letter IS NULL

Interesting idea...
Here's my take on it. This returns a list of words that are in order, but you could easily return 1 instead.
DECLARE #WORDS TABLE (VAL VARCHAR(MAX))
INSERT INTO #WORDS (VAL)
VALUES ('AIM'), ('ABBOT'), ('CELT'), ('DAVID')
;WITH CHARS
AS
(
SELECT VAL AS SOURCEWORD, UPPER(VAL) AS EVALWORD, ASCII(LEFT(UPPER(VAL),1)) AS ASCIICODE, RIGHT(VAL,LEN(UPPER(VAL))-1) AS REMAINS, 1 AS ROWID, 1 AS INORDER, LEN(VAL) AS WORDLENGTH
FROM #WORDS
UNION ALL
SELECT SOURCEWORD, REMAINS, ASCII(LEFT(REMAINS,1)), RIGHT(REMAINS,LEN(REMAINS)-1), ROWID+1, INORDER+CASE WHEN ASCII(LEFT(REMAINS,1)) >= ASCIICODE THEN 1 ELSE 0 END AS INORDER, WORDLENGTH
FROM CHARS
WHERE LEN(REMAINS)>=1
),
ONLYINORDER
AS
(
SELECT *
FROM CHARS
WHERE ROWID=WORDLENGTH AND INORDER=WORDLENGTH
)
SELECT SOURCEWORD
FROM ONLYINORDER
Here it is as a UDF:
CREATE FUNCTION dbo.AlphabetSoup (#Word VARCHAR(MAX))
RETURNS BIT
AS
BEGIN
SET #WORD = UPPER(#WORD)
DECLARE #RESULT INT
;WITH CHARS
AS
(
SELECT #WORD AS SOURCEWORD,
#WORD AS EVALWORD,
ASCII(LEFT(#WORD,1)) AS ASCIICODE,
RIGHT(#WORD,LEN(#WORD)-1) AS REMAINS,
1 AS ROWID,
1 AS INORDER,
LEN(#WORD) AS WORDLENGTH
UNION ALL
SELECT SOURCEWORD,
REMAINS,
ASCII(LEFT(REMAINS,1)),
RIGHT(REMAINS,LEN(REMAINS)-1),
ROWID+1,
INORDER+CASE WHEN ASCII(LEFT(REMAINS,1)) >= ASCIICODE THEN 1 ELSE 0 END AS INORDER,
WORDLENGTH
FROM CHARS
WHERE LEN(REMAINS)>=1
),
ONLYINORDER
AS
(
SELECT 1 AS RESULT
FROM CHARS
WHERE ROWID=WORDLENGTH AND INORDER=WORDLENGTH
UNION
SELECT 0
FROM CHARS
WHERE NOT (ROWID=WORDLENGTH AND INORDER=WORDLENGTH)
)
SELECT #RESULT = RESULT FROM ONLYINORDER
RETURN #RESULT
END

Related

Generate random numbers, letters or characters within a range

I'm in the middle of a data anonymization for SQL Server.
I have this 3 formulas that help me create what I want:
SELECT CHAR(cast((90 - 65) * rand() + 65 AS INTEGER)) -- only letters
SELECT CAST((128 - 48) * RAND() + 48 AS INTEGER) -- only numbers
SELECT CHAR(CAST((128 - 48) * RAND() + 48 AS INTEGER)) -- letters, numbers, symbols
However, this only can create 1 number or 1 letter or 1 symbol.
I want to have the freedom that allows me to create a random string or number of the length I want. Like 3 or 5 numbers, 3 or 5 letters, 3 or 5 between numbers, letters or symbols.
I also have found something very close to what I want:
SELECT LEFT(CAST(NEWID() AS VARCHAR(100)), 3) -- letters and numbers
this is a very smart idea because uses NEWID() and it allows me to create a random sequence of numbers and letters of the length I want (3 in this case). But symbols are missing.
I need 3 different SELECT:
One for numbers only
One for letters only
One for numbers, letters and symbols
With the freedom of choice about the length of the data.
Some work required for a complete solution but here's the workings of an idea you might want to experiment with further, if you still need it:
declare #type varchar(10)='letters', #length tinyint=5;
with chars as (
select top(59) 31 + Row_Number() over (order by (select 1)) n from master.dbo.spt_values
), s as (
select top (#length) Char(n.n) c
from chars n
where #type='all'
or (#type='symbols' and n between 33 and 47)
or (#type='numbers' and n between 48 and 57)
or (#type='letters' and n between 65 and 90)
order by newid()
)
select String_Agg(s.c,'')
from s
Recursive query might work with rand() function:
declare #desiredlength tinyint=5;
With builder As (
Select *
From (Values (0, '', 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')) initial (length, randstr, pool)
Union All
Select length+1, randstr + substring(pool,cast(rand()*len(pool)+1 AS int),1), pool
From builder
Where length<#desiredlength
)
Select randstr From builder
Where length=#desiredlength
rand() in a single select returns the same random number in each row of a select. But here in a recursive select you're in a grey area where each recursion might be treated like a separate query.
Obviously you can tailor the pool definition to be any character set you want and the rest of the code will choose from whatever's there.

TSQL - Remove list of parameters from Text

I have a field in a SQL table which has text looks something like this:
'The Employee <PARAM1> was replaced with <PARAM2> and was given the new IP address <PARAM3> with limited access <PARAM4>. <PARAM2> loves the new role'
I want to remove all the <PARAMs> from the text and just show as below using TSQL
'The Employee was replaced with and was given the new IP address with limited access. loves the new role'
What is the best way to do that?
One way to solve this is to generate a common table expression that will hold the start position and length of each <PARAMn> in the string. To do that, you can use a single common table expression but I've done it with three so that the process is easy to understand.
Please note that I'm assuming the string only contains < and > as params separators - so there is no < or > chars in the content. If that's not the case, it's still solvable but the solution would need some changes.
You start with a numbers (tally) cte that starts with 1 and ends with the length of your string.
Then another cte to get the start position of each <PARAMn>.
A third cte is used to get the length of each <PARAMn> (since I'm assuming you are not limited to 10 parameters in a string, so you can have <PARAM12> or even <PARAM105>).
Then, create a query that will update the original string and remove the <PARAMn> one by one.
-- Test data:
DECLARE #str nvarchar(1000) = 'The Employee <PARAM1> was replaced with <PARAM2> and was given the new IP address <PARAM3> with limited access <PARAM4>. <PARAM2> loves the new role';
-- The numbers (Tally) cte:
WITH Tally AS
(
SELECT TOP(LEN(#Str)) ROW_NUMBER() OVER(ORDER BY ##SPID) As N
FROM sys.objects A CROSS JOIN sys.objects B
), -- The StartPosition cte contains the position of each < char in the string
StartPosition AS
(
SELECT DISTINCT
CHARINDEX('<', #Str, N) As Start
FROM Tally
WHERE CHARINDEX('<', #Str, N) > 0
), -- the Length cte contains both the start position and the length of each <PARAMn> in the string
Length As
(
SELECT Start,
CHARINDEX('>', #Str, Start) - Start + 1 As Length
FROM StartPosition
)
-- Use STUFF to remove `<PARAMn>`. Note the order by is critical to remove from end to start.
SELECT #Str = STUFF(#Str, Start, Length, '')
FROM Length
ORDER BY Start DESC
-- verify the results:
SELECT #Str
Result:
The Employee was replaced with and was given the new IP address with limited access . loves the new role
You might note that the results have places with double spaces where the <PARAMn> used to be - that can be solved by using the technique Gordon Linoff shows in this answer.

How to remove non alphanumeric characters in SQL without creating a function?

I'm trying to remove non alphanumeric characters in multiple columns in a table, and have no permission to create functions nor temporary functions. I'm wonder whether anyone here have any experiences removing non alphanumeric characters without creating any functions at all? Thanks. I'm using MS SQL Server Management Studio v17.9.1
If you have to use a single SELECT query like #Forty3 mentioned then the multiple REPLACEs like #Gordon-Linoff said is probably best (but definitely not ideal).
If you can update the data or use T-SQL, then you could do something like this from https://searchsqlserver.techtarget.com/tip/Replacing-non-alphanumeric-characters-in-strings-using-T-SQL:
while ##rowcount > 0
update user_list_original
set fname = replace(fname, substring(fname, patindex('%[^a-zA-Z ]%', fname), 1), '')
where patindex('%[^a-zA-Z ]%', fname) <> 0
Here is a starting point - you will need to adjust it to accommodate all of the columns which require cleansing:
;WITH allcharcte ( id, textcol1, textcol2, textcol1where, textcol2where )
AS (SELECT id,
CAST(textcol1 AS NVARCHAR(255)),
CAST(textcol2 AS NVARCHAR(255)),
-- Start the process of looking for non-alphanumeric chars in each
-- of the text columns. The returned value from PATINDEX is the position
-- of the non-alphanumeric char and is stored in the *where columns
-- of the CTE.
PATINDEX(N'%[^0-9A-Z]%', textcol1),
PATINDEX(N'%[^0-9A-Z]%', textcol2)
FROM #temp
UNION ALL
-- This is the recursive part. It works through the rows which have been
-- returned thus far processing them for use in the next iteration
SELECT prev.id,
-- If the *where column relevant for each of the columns is NOT NULL
-- and NOT ZERO, then use the STUFF command to replace the char
-- at that location with an empty string
CASE ISNULL(prev.textcol1where, 0)
WHEN 0 THEN CAST(prev.textcol1 AS NVARCHAR(255))
ELSE CAST(STUFF(prev.textcol1, prev.textcol1where, 1, N'') AS NVARCHAR(255))
END,
CASE ISNULL(prev.textcol2where, 0)
WHEN 0 THEN CAST(prev.textcol2 AS NVARCHAR(255))
ELSE CAST(STUFF(prev.textcol2, prev.textcol2where, 1, N'') AS NVARCHAR(255))
END,
-- We now check for the existence of the next non-alphanumeric
-- character AFTER we replace the most recent finding
ISNULL(PATINDEX(N'%[^0-9A-Z]%', STUFF(prev.textcol1, prev.textcol1where, 1, N'')), 0),
ISNULL(PATINDEX(N'%[^0-9A-Z]%', STUFF(prev.textcol2, prev.textcol2where, 1, N'')), 0)
FROM allcharcte prev
WHERE ISNULL(prev.textcol1where, 0) > 0
OR ISNULL(prev.textcol2where, 0) > 0)
SELECT *
FROM allcharcte
WHERE textcol1where = 0
AND textcol2where = 0
Essentially, it is a recursive CTE which will repeatedly replace any non-alphanumeric character (found via the PATINDEX(N'%[^0-9A-Z]%', <column>)) with an empty string (via the STUFF(<column>, <where>, N'')). By replicating the blocks, you should be able to adapt it to any number of columns.
EDIT: If you anticipate having more than 100 instances of non-alphanumeric characters to strip out of any one column, you will need to adjust the MAXRECURSION property ahead of the call.

Implement find/find next algorithm

I have a database table (mysql/pgsql) with the following format:
id|text
1| the cat is black
2| a cat is a cat
3| a dog
I need to select the line that contains nth match of a word:
eg: "Select the 3rd match for the word cat, that is the number 2 entry."
Results: the 2nd row from the result where the 3rd word is cat
The only solution I could find is to search for all entries that have the text cat, load them in memory and find the match by counting them. But this is not efficient for a big number of matches(>1 million).
How would you handle this in an efficient way? Is there anything you can do directly in the database? Maybe using other technologies like lucene?
Update: having 1 million strings in memory might not be a big issue but the expectation of the application is to have between 1k-50k active users that might do this operation concurrently.
Consider creating another table with the below structure
Table : index_table
columns :
index_id , word, occurrence, id(foreign key to your original table)
Do one time indexing process as below:
Iterate over each entry in your original table split the text into words and for each word lookup in the new table for existence if not present insert a new entry with occurrence set as 1. If exists insert a new entry with occurrence = existing occurrence +1
Once you have done this one off indexing your selects become pretty simple.
For example for cat with 3rd match will be
SELECT *
FROM original_table o, index_table idx
WHERE idx.word = 'cat'
AND idx.occurrence = 3
AND o.id = idx.id
You do not need Lucene for this job. Furthermore, if you have a large number of positive matches, the effort to pump all required data out of your DB will well exceed the computational cost.
Here's a simple solution:
Index: we require two properties:
efficiently access the words for each id
efficiently access all IDs in ascending order
as follows:
create index i_words on example_data (id, string_to_array(txt, ' '));
Query: find the ID associated with the nth match with the following query:
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = :w -- :w = 'cat'
offset :n - 1 -- :n = 3
limit 1;
Executes in 2ms on 1 million rows.
Here's the full PostgreSQL setup if you'd rather try for yourself than take my word for it:
drop table if exists example_data;
create table example_data (
id integer primary key,
txt text not null
);
insert into example_data
(select generate_series(1, 1000000, 3) as id, 'the cat is black' as txt
union all
select generate_series(2, 1000000, 3), 'a cat is a cat'
union all
select generate_series(3, 1000000, 3), 'a dog'
order by id);
commit;
drop index if exists i_words;
create index i_words on example_data (id, string_to_array(txt, ' '));
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
select
id, word
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
Note that I'm still unsure what exactly "Select the 3rd match for the word cat, that is the number 2 entry" is supposed to mean.
Possible meanings:
the 2nd row from the result where the 3rd word is cat
the 3rd row where the 2nd word is "cat"
from all rows where "cat" appears at least 3 times, take the second row
from all rows where "cat" appears at least 2 times, take the third row
If it's 1 or 2, I think this could be done in an acceptable speed by using a trigram index to reduce the possible number of matching lines. A trigram index (supplied by the module pg_trgm) will allow Postgres to make use of an index when doing a e.g. like '%cat%'.
Assuming that only a small number of rows will satisfy that condition, the resulting lines can then be split into arrays and checked for the nth word.
Something like this:
with matching_rows as (
select id, line,
row_number() over (order by id) as rn
from the_table
where line like '%cat%' -- this hopefully reduces the result to only very few rows
)
select *
from matching_rows
where rn = 3 --<< "the third match for the word cat"
and (string_to_array(line, ' '))[2] = 'cat' -- "the second word is "cat"
Note that a trigram index does have disadvantages as well. Maintaining such an index is much more expensive (=slower) than maintaining a regular b-tree index. So if your table is heavily updated, this might not be a good solution - but you need to test that for yourself.
Also if the condition `like '%cat%' doesn't really reduce the number of rows substantially, this is probably not going to perform well either.
Some more information on trigram indexes:
http://www.depesz.com/index.php/2011/02/19/waiting-for-9-1-faster-likeilike/
http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html
Another option would be to filter out the "relevant" rows using Postgres' full text search instead of a plain LIKE condition.
Whatever algorithm you come up with for the database as-it-is is likely to be slow for this kind of data. You do need an efficient text-based search, lucene-based solutions like solr or elasticsearch will do nicely here. It would be the best option here, though finding a match against a 3rd token in a string is not something I know how to build without further googling.
You can also write a job in your db which will let you build a reverse map, string->id. like this:
rownum, id, text
1 1 the cat is black
2 3 nice cat
to
key, rownum, id
1_the 1 1
2_cat 1 1
3_is 1 1
4_black 1 1
1_nice 2 3
2_cat 2 3
If you can order by ID you don't need rownum. You should also call the column something else instead of rownum, I leave it like that for clarity
Now you can search for 1st ID where the word cat is a 2nd word like this by searching
SELECT ID WHERE ROWNUM=1 AND key='3_CAT'
Provided you created an (id, key) or (key, id) index, your searches should be pretty quick.
If you can fit all that data into memory, then you can use a simple Map<MyKey, Long> to do your search. MyKey would be, more or less Pair<Long,String> with proper equals and hashCode (and/or Comparable, if you use TreeMap) implementations.
(Thanks to Daniel Grosskopf for pointing out that I initially misinterpreted the question.)
This query will give you what you want with just SQL. It gets a running total of the counts of the occurrences of a word (e.g. 'cat') within the text, and then it returns the first row that hits the threshold that you want (e.g. 3).
SELECT id, text
FROM (SELECT entries.*,
SUM((SELECT COUNT(*)
FROM regexp_split_to_table(text, E'\\s+') AS words(word)
WHERE word = 'cat')) OVER (ORDER BY id) AS running_count
FROM entries) AS entries_with_running_count
WHERE running_count >= 3
LIMIT 1
See it in action in SQL Fiddle
How would you handle this in an efficient way? Is there any trick you
can do directly in the database?
You are not specifying what other restrictions/requirements you may have or what is your definition of
a big number of matches.
As a general answer I would say that doing string manipulation in the database is not an efficient approach.
It is too slow and imposes much work on your DB which is usually a shared resource.
IMO you should do this programmatically.
A way to do this could be to keep metadata in another table i.e. indexes of rows that contain the text cat and where in the sentence.
You can query this meta-table in order to figure the rows to query from your main table.
This extra table is more efficient than searching your defined table because queries with LIKE on suffixes can not use an index and you will end up with serial scans which would result in very slow performance
Solution for the Postgres database:
Add a new column to your table:
alter table my_table add text_as_array text[];
This column will contain the sentence spliced into words:
"the cat is black" -> ["the","cat","is","black"]
Populate this column with values from current records:
update my_table set text_as_array = string_to_array(text,' ');
(and don't forget to set it's value to string_to_array(text,' ') when inserting new records)
Create a gin index on it:
create index my_table_text_as_array_index on text_as_array gin(text_as_array);
analyze my_table;
Then all you need is run a fast query as simple as this:
select *
from my_table
where text_as_array #> ARRAY['cat']
and text_as_array[3] = 'cat' -- third word in sentence
order by id
limit 1
offset 2 -- second occurrence
It took 11ms to search over ~2,400,000 records in tests I did in my machine.
Explain:
Limit (cost=11252.08..11252.08 rows=1 width=104)
-> Sort (cost=11252.07..11252.12 rows=19 width=104)
Sort Key: id
-> Bitmap Heap Scan on my_table (cost=48.21..11251.83 rows=19 width=104)
Recheck Cond: (text_as_array #> '{cat}'::text[])
Filter: (text_as_array[3] = 'cat'::text)
-> Bitmap Index Scan on my_table_text_as_array_index (cost=0.00..48.20 rows=3761 width=0)
Index Cond: (text_as_array #> '{cat}'::text[])
A "directly in the database" solution seems preferable from an efficiency standpoint as most types of abstraction layer or loading/processing elsewhere are likely to incur additional overheads.
If the source text can be massaged such that only spaces separate the words (as mentioned in the comments - perhaps by pre-processing to suitably replace all non-alphabetical characters?), the following (My)SQL-only solution will work:
#############################################################
SET #searchWord = 'cat', # Search word: Must be lower case #
#n = 1, # n where nth match is to be found #
#############################################################
#matches = 0; # Initialise local variable
SELECT s.*
FROM sentence s
WHERE id =
(SELECT subq.id
FROM
(SELECT *,
#matches AS prevMatches,
(#matches := #matches + LENGTH(`text`) - LENGTH(
REPLACE(LOWER(`text`),
CONCAT(' ', #searchWord, ' '),
CONCAT(#searchWord, ' ')))
+ CASE WHEN LEFT(LOWER(`text`), 4) = CONCAT(#searchWord, ' ') THEN 1 ELSE 0 END
+ CASE WHEN RIGHT(LOWER(`text`), 4) = CONCAT(' ', #searchWord) THEN 1 ELSE 0 END)
AS matches
FROM sentence) AS subq
WHERE subq.prevMatches < #n AND #n <= subq.matches);
Explanation
All instances of ' cat ' on each line are replaced with a word that is one letter shorter. The difference in length is then calculated to find out the number of instances. Finally, the single possibilities of 'cat ' and ' cat' appearing a the start and end of the line are respectively catered for. Having done this, a cumulative total of matches is maintained for each line. This is bundled up into a subquery from which the nth match can be picked by finding the row where the number of cumulative number of matches is no greater than n but the previous total is less than n.
Further potential improvements
The above could of course be slightly simplified by making the source text lower case (which seems sensible if it is being pre-processed) and removing all calls to LOWER().
The subquery calculates a cumulative total number of matches. If it is likely that the same search terms will be reused, it might conceivably be possible to cache these results in another table and use triggers to maintain this whenever records are updated, inserted or deleted - however this would greatly add to the complexity and data storage requirements.
I would search for all rows with "cat" but limit the rows by n. This should give you a reasonably sized subset of your data that is guaranteed to contain the row you are looking for. The SQL would look similar to this:
select id, text
from your_table
where text ~* 'cat'
order by id
limit 3 --nth time cat appears
I would then implement your solution as a pl/pgsql function to get the id that contains the nth occurrence of your word:
CREATE OR REPLACE FUNCTION your_schema.row_with_nth_occurrence(character varying, integer)
RETURNS integer AS
$BODY$
Declare
arg_search_word ALIAS FOR $1;
arg_occurrence ALIAS FOR $2;
v_sql text;
v_sql2 text;
v_count integer;
v_count_total integer;
v_record your_table%ROWTYPE;
BEGIN
v_sql := 'select id, text
from your_table
where text ~* ' || arg_search_word || '
order by id
limit ' || arg_occurrence || ';';
v_count := 0;
v_count_total := 0;
FOR v_record IN v_sql LOOP
v_sql2 := 'SELECT count(*)
FROM regexp_split_to_table('||v_record.text||', E'\\s+') a
WHERE a = '|| arg_search_word ||';';
EXECUTE v_sql2 INTO v_count;
v_count_total := v_count_total + v_count;
IF v_count_total >= arg_occurrence THEN
RETURN v_record.id;
END IF;
END LOOP;
RAISE EXCEPTION '% does not occur % times in the database.', arg_search_word, arg_occurrence;
END;
All this function does is loop through the subset of rows potentially containing the desired word, counts the number of times it occurs in each row, and then returns the Id when it finds the row with the nth occurrence of the word.
Solution one:
Keep the rows in memory but centralized. All clients loop over the same list. Probably fast enough en reasonably memory friendly.
Solution two:
Use the streaming ResultSet technique from the JDBC driver; e.g.
Statement select = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
select.setFetchSize(Integer.MIN_VALUE);
ResultSet result = select.executeQuery(sql);
As explained in http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html, scroll down to Resultset. This should be memory friendly.
Now simply count on the result rows until satisfied and close the result.
I am having trouble understanding your statement:
eg: "Select the 3rd match for the word cat, that is the number 2
entry." Results: the 2nd row from the result where the 3rd word is cat
I will assume that you mean, you want to search for entries where the 3rd word of the text is "cat", and from those entries you want to second entry.
Since you mentioned that your problem lies with the concurrent access and the speed, you will need to somehow build an index which is optimized for your query. You could use anything for this, database, lucene, etc. My suggestion would be to build the index in-memory. Just think of it as a warm up for your service before it could start serving request.
In your case, you would want some kind of map with the word and word position as the key. This key will then map to a list of row numbers which is matching the key. So in the end, you will just have to do a lookup twice, first is to get a list of row numbers where it matches, then the row number which you want. So the performance you will need in the end will be a simple map lookup + array list lookup (constant).
I've provided a very simple example below. It's untested code, but it should roughly give you an idea.
You could also save the index into a file after it's been built if you want. After you have been the index and load them into memory, this will be very very fast.
// text entry from the DB
public class TextEntry {
private int rowNb;
private String text;
// getters & setters
}
// your index class
public class Index {
private Map<Key, List<Integer>> indexMap;
// getters and setters
public static class Key {
private int wordPosition;
private String word;
// getters and setters
}
}
// your searcher class
public class Searcher {
private static Index index = null;
private static List<TextEntry> allTextEntries = null;
public static init() {
// init all data with some synchronization check
// synchronization check whether index has been built
allTextEntries.forEach(entry -> {
// split the words, and build the index based on the word position and the word
String[] words = entry.split(" ");
for (int i = 0; i < words.length; i++) {
Index.Key key = new Index.Key(i + 1, words[i]);
int rowNumber = entry.getRowNb();
// if the key is already there, just add the row number if it's not the last one
if (indexMap.contains(key)) {
List entryMatch = indexMap.get(key);
if (entryMatch.get(entryMatch.size() - 1) !== rowNumber) {
entryMatch.add(rowNumber);
}
} else {
// if key is not there, add a new one
List entryMatch = new ArrayList<Integer>()
entryMatch.add(rowNumber);
indexMap.put(key, entryMatch);
}
}
});
}
public static TextEntry search(String word, int wordPosition, int resultNb) {
// call init if not yet called, do some check
int rowNb = index.getIndexMap().get(new Index.Key(word, wordPosition)).get(resultNb - 1);
return allTextEntries.get(rowNb);
}
}
In mysql
We need one function where we can count number of occurence of given substring in a field.
Create the Function (This function will count occurence of substring in given column)
CREATE FUNCTION substrCount(
x varchar(255), delim varchar(12)) returns int
return (length(x)-length(REPLACE(x,delim, '')))/length(delim);
This function should be able to find how many times 'cat' was present in text.
Please bear with me for syntax of code as it may not be fully functional(correct as required).
I will break this problem into 3 parts and we can do with the help of stored procedure.
Select all the rows containing the string 'cat' (or any other input).This should select maximum of n rows( n= no of occurences), so we will use limit in our query.
With cursor, iterate matched rows in while roop.
Increment occurence matches per row in count variable and exit once number of matches found.(Should be able to find match within 1 to n loops)
create stored procedure.
Assuming proper index ,this should be fast.
DELIMITER $$
CREATE PROCEDURE find_match(INOUT string_to_match varchar(100),
INOUT occurence_count INTEGER,OUT match_field varchar(100))
BEGIN
DECLARE v_count INTEGER DEFAULT 0;
DECLARE v_text varchar(100) DEFAULT "";
-- declare cursor and select by the order you want.
DEClARE matcher_cursor CURSOR FOR
SELECT textField FROM myTable
where textField like string_to_match
order by id
LIMIT 0, occurence_count;
-- declare NOT FOUND handler
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET v_finished = -1;
OPEN matcher_cursor;
get_matching_occurence: LOOP
FETCH matcher_cursor INTO v_text;
IF v_count = -1 THEN
LEAVE get_matching_occurence;
END IF;
-- use substring count function
v_count:= v_count + substrCount(v_text,string_to_match));
-- if count is equal to greater than occurenece that means matching row is found.
IF (v_count>= occurence_count) THEN
SET match_field = v_text;
v_count:=-1;
END IF;
END LOOP get_matching_occurence;
CLOSE _
END$$
DELIMITER ;
I tested this on a table with 1.2 million rows and it returns data in less than a second. I am using a split function (which is a modified form of Jeff Modem's splitter function) from here: 'http://sqlperformance.com/2012/08/t-sql-queries/splitting-strings-follow-up'.`
-- Step 1. Create table
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Sentence](
[id] [int] IDENTITY(1,1) NOT NULL,
[Text][varchar](250) NULL,
CONSTRAINT [PK_Sentence] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
Step 2. Create a split function
CREATE FUNCTION [dbo].[SplitSentence]
(
#CSVString NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
cteTally(N) AS (SELECT 0
UNION ALL
SELECT TOP (DATALENGTH(ISNULL(#CSVString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E2),
cteStart(N1) AS (SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(#CSVString,t.N,1) = #Delimiter OR t.N = 0))
SELECT Word = SUBSTRING(#CSVString, s.N1, ISNULL(NULLIF(CHARINDEX(#Delimiter,#CSVString,s.N1),0)-s.N1,50))
FROM cteStart s;
Step 3. Create a sql script to return the required data
DECLARE #n int = 3
DECLARE #Word varchar(50) = 'cat'
;WITH myData AS
(SELECT TOP (#n)
id
,[Text]
,sp.word
,ROW_NUMBER() OVER (ORDER BY Id) RowNo
FROM
Sentence
CROSS APPLY (SELECT * FROM SplitSentence(Sentence.[Text],' ')) sp
WHERE Word = #Word)
SELECT
*
FROM
myData
WHERE
RowNo = #n
Assumptions:
1. The sentence has a max length of 250 characters. If needed this can be modified in the create table statement.
2. The sentence will not have more than a 100 words. If more than 100 words are needed, the split function will have to be modified.
3. Any word in the sentence has a max length of 50 characters.
SQL Fiddle demo here: http://sqlfiddle.com/#!3/0a1d0/1
Notes:
I am aware that the original requirement is for MySQL/pgsql,
but I have limited knowledge of these and therefore my solution has been created/tested in MSSQL.
I would simply count the number of words in each line and then do a cumulative sum. I'm not sure what the most efficient way is to count words, but a difference of lengths might win:
select t.*
from (select t.*, sum(cnt) over (order by id) as cumecnt
from (select t.*,
(length(' ' || str || ' ') - length(replace(' ' || str || ' '), ' cat ', '')) / length(' cat ') as cnt
from t
) t
where num > 0
) t
where cumecnt >= 3 and cumecnt - cnt <= 3;
You would simply replace "3" and "cat" with the appropriate strings.
This method requires scanning the strings a handful of times in each row (once for each of the lengths and once for the replace). My guess is that this is faster than various array operations, regular expressions, or text. If you have more complicated definitions of what a word is, then you probably need to use regular expression replace:
Doing the work in the database is usually a big win. However, if you are looking for the 6th match out of one million rows, it might be faster to read back the values from the subquery and do the accumulation in the application. I don't think there is a way to short-circuit the database calculation to stop just on the "6th" row.

SQL query--String Permutations

I am trying to create a query using a db on OpenOffice where a string is entered in the query, and all permutations of the string are searched in the database and the matches are displayed. My database has fields for a word and its definition, so if I am looking for GOOD I will get its definition as well as the definition for DOG.
You'll need a third column as well. In this column you'll have the word - but with the letters sorted in alphabetical order. For example, you'll have the word APPLE and in the next column the word AELPP.
You would sort the word your looking for - and run a some SQL code like
WHERE sorted_words = 'my_sorted_word'
for the word apple, you would get something like this:
unsorted sorted
AELPP APPLE
AELPP PEPLA
AELPP APPEL
Now, you also wanted - correct me if I'm wrong, but you want all the words that can be made with **any combination ** of the letters, meaning APPLE also returns words like LEAP and PEA.
To do this, you would have to use some programming language - you would have to write a function that preformed the above recursively, for example - for the word AELLP you have
ELLP
ALLP
AELP
and so forth.. (each time subtracting one letter in every combination, and then two letters in every combination possible ect..)
Basically, you can't easily do permutations in single SQL statement. You can easily do them in another language though, for example here's how to do it in C#: http://msdn.microsoft.com/en-us/magazine/cc163513.aspx
Ok, corrected version that I think handles all situations. This will work in MS SQL Server, so you may need to adjust it for your RDBMS as far as using the local table and the REPLICATE function. It assumes a passed parameter called #search_string. Also, since it's using VARCHAR instead of NVARCHAR, if you're using extended characters be sure to change that.
One last point that I'm just thinking of now... it will allow duplication of letters. For example, "GOOD" would find "DODO" even though there is only one "D" in "GOOD". It will NOT find words of greater length than your original word though. In other words, while it would find "DODO", it wouldn't find "DODODO". Maybe this will give you a starting point to work from though depending on your exact requirements.
DECLARE #search_table TABLE (search_string VARCHAR(4000))
DECLARE #i INT
SET #i = 1
WHILE (#i <= LEN(#search_string))
BEGIN
INSERT INTO #search_table (search_string)
VALUES (REPLICATE('[' + #search_string + ']', #i)
SET #i = #i + 1
END
SELECT
word,
definition
FROM
My_Words
INNER JOIN #search_table ST ON W.word LIKE ST.search_string
The original query before my edit, just to have it here:
SELECT
word,
definition
FROM
My_Words
WHERE
word LIKE REPLICATE('[' + #search_string + ']', LEN(#search_string))
maybe this can help:
Suppose you have a auxiliary Numbers table with integer numbers.
DECLARE #s VARCHAR(5);
SET #s = 'ABCDE';
WITH Subsets AS (
SELECT CAST(SUBSTRING(#s, Number, 1) AS VARCHAR(5)) AS Token,
CAST('.'+CAST(Number AS CHAR(1))+'.' AS VARCHAR(11)) AS Permutation,
CAST(1 AS INT) AS Iteration
FROM dbo.Numbers WHERE Number BETWEEN 1 AND 5
UNION ALL
SELECT CAST(Token+SUBSTRING(#s, Number, 1) AS VARCHAR(5)) AS Token,
CAST(Permutation+CAST(Number AS CHAR(1))+'.' AS VARCHAR(11)) AS
Permutation,
s.Iteration + 1 AS Iteration
FROM Subsets s JOIN dbo.Numbers n ON s.Permutation NOT LIKE
'%.'+CAST(Number AS CHAR(1))+'.%' AND s.Iteration < 5 AND Number
BETWEEN 1 AND 5
--AND s.Iteration = (SELECT MAX(Iteration) FROM Subsets)
)
SELECT * FROM Subsets
WHERE Iteration = 5
ORDER BY Permutation
Token Permutation Iteration
----- ----------- -----------
ABCDE .1.2.3.4.5. 5
ABCED .1.2.3.5.4. 5
ABDCE .1.2.4.3.5. 5
(snip)
EDBCA .5.4.2.3.1. 5
EDCAB .5.4.3.1.2. 5
EDCBA .5.4.3.2.1. 5
(120 row(s) affected)