How can I add a column to an R set, containing the amount of matches of a regex - sql

To be able to execute regular expressions in SQL Server without the use of CLR, I'm looking into using R language. I have a set of texts, and I want to count the number of matches of a regex on each row.
Note: the "R" part can be found within the #script variable in the code block shown below. This is the part where this issue is.
For the sake of example: I have a table InputData:
CREATE TABLE InputData (ID INT IDENTITY(1, 1), Text NVARCHAR(MAX))
This table contains 3 rows:
INSERT INTO InputData(Text)
VALUES ('This is the first row'),
('This is the second row'),
('This is the third row')
I'd like to run a query, that returns the number of time the letter i is found in each row, by using a regex (since the actual search I need is a bit more complicated). After some googling, I came up with the following:
EXEC SP_EXECUTE_EXTERNAL_SCRIPT
#Language=N'R',
#script = N'pattern = ".*i.*"
outData <- inData;
outData$MatchCount <- length(gregexpr(pattern, outData$Text));'
, #input_data_1 = N'select ID, Text, 0 MatchCount from InputData'
, #input_data_1_name = N'inData'
, #output_data_1_name=N'outData'
with result sets ((ID INT, Text NVARCHAR(MAX), MatchCount INT));
Now the above doesn't work, because the regexpr runs the expression over the entire set, and will always return "3", since the set contains 3 rows. I would like it to return the number of matches per row, and put the result in the correct row.
So the result should be:
1, 'This is the first row', 3
2, 'This is the second row' 2
3, 'This is the third row', 3
Ideally, I wouldn't even need to return the text in the resulting set. Only the ID and match count is enough

Related

How can we read a varchar column, take the integer part out and add new column incrementing that integer part using script

I need to write a SCRIPT for below scenario:
We have a column X with rows value for this column X as X01,X02,X03,X04........
The problem I am stuck with is that I needed to add another row to this table based on the value of the last row that is X04, Well I am able to identify the logic that I need to work which is given below:
I need to read value X04
Take the integer part 04
Increment by 1 => 05
Save column value as X05
I am able to pass with the 1st step which is not very hard. The problem that I am facing is the next steps. I have researched and tried quite a lot commands but none worked.
Any help is highly appreciated. Thanks.
You seem to be describing:
select concat(left(max(x), 1),
right(concat('00', try_convert(int, right(max(x), 2)) + 1), 2)
from t;
This is doing the following:
Taking the left most character.
Converting the two right characters to a number and adding one.
Converting that back to a zero-padded string.
Here is a db<>fiddle.
Now: That you want to increment a string value seems broken. You should just use an identity column or sequence to assign a number. You can format the value as a string when you query the table -- or use a computed column to store that.
Try below Script
CREATE TABLE #table (x varchar(20))
INSERT INTO #table VALUES('X01'),('X02'),('X03'),('X04')
DECLARE #maxno NVARCHAR(20)
DECLARE #maxstring NVARCHAR(20)
DECLARE #finalno NVARCHAR(20)
DECLARE #loopminno INT =1 -- you can change based on the requirement
DECLARE #loopmaxno INT =10 -- how many number we want to increment
WHILE #loopminno < #loopmaxno
BEGIN
select #maxno = MAX(CAST(SUBSTRING(x, PATINDEX('%[0-9]%', x), 100) as INT))
, #maxstring = MAX(SUBSTRING(x, 1, PATINDEX('%[0-9]%',x)-1))
from #table
where PATINDEX('%[1-9]%',x)>0
SELECT #finalno = #maxstring + CASE WHEN CAST(#maxno AS INT)<9 THEN '0' ELSE '' END + CAST(#maxno+1 AS VARCHAR(20))
INSERT INTO #table
SELECT #finalno
SET #loopminno = #loopminno+1
END

SQL Server function to parse docket numbers

On our database in a Cases table, the Docket field stores the docket number(s) for each case. Each docket number takes the form such as
AB19-1-000
CD19-1043-000
EF18-24-001
These are comprised of "root" dockets and "sub" dockets. The roots here are
AB19-1
CD19-1043
EF18-24
The root dockets are comprised of the two alpha character docket prefix, which indicates the case type. Followed by a two numerical character code indicating the fiscal year the case was filed. Then a hyphen. Then a "sequence" number (with no fixed # of digits, although none have ever had more than 4 digits) indicating the sequence that the case was filed (relative to other cases of the same type that were filed in that fiscal year).
The final three digits (after the final hyphen) of the overall docket number represent the "subdocket," and are to allow for multiple filings within the docket. The initial filing is always docketed with a 000 subdocket. Subsequent filings within that root docket are subdocketed as 001, 002, 003, etc.
To make things more complicated, there can be multiple docket numbers listed within the Docket field, and (in the horrid legacy database design we have) multiple docket numbers are always separated with exactly one space. (I know. Don't get me started.)
I want to create a tool that will help us generate the docket number for new cases easier/quicker than our current approach (which uses a VBA loop and is very slow). Specifically, I want to write a function for use on SQL Server that will spit out the next sequence number for a new filing of a given type and fiscal year.
The steps would be roughly:
Accept as an argument a two character case type and a two digit fiscal year.
Identify all docket numbers entered where the alpha prefix is the specified case type and the next two characters are the specified fiscal year (ignoring subdocket, which we don't care about here).
Identify the highest existing sequence number for that case type and docket year.
Add one to the identified number, and return that number.
I'm a decent programmer, but my SQL is pretty limited to fairly normal queries. I have very limited experience creating functions. So any help (even a general outline of what this kind of function might looks like and how to create it) is much appreciated.
Here's some code to generate some simple test data.
CREATE TABLE MyCases (
CaseId INTEGER PRIMARY KEY,
Docket VARCHAR(50) not null
);
INSERT INTO MyCases
VALUES
(1, 'XL14-204-001 TS14-1-000 PI14-1-000'),
(2, 'PI14-2-000'),
(3, 'PI14-3-000'),
(4, 'PI14-4-001 XL14-22-000'),
(5, 'PI14-6-000'),
(6, 'PI14-7-000 XL14-382-000'),
(7, 'PI15-1-000 XL15-23-000'),
(8, 'PI15-2-000 TS15-23-000'),
(9, 'PI15-3-000'),
(10, 'PI15-4-000 TS15-2-000')
;
And with the desired function, if the user entered MyFunction('PI',14), the result would be 8, because the highest existing sequential number for all PI14 docket numbers is PI14-7, and adding one to 7 gives 8. Similarly, the result for MyFunction('PI',15) would be 5.
Something like this:
create or alter function MyFunction(#RootDocket char(2), #FiscalYear smallint)
returns int as
begin
declare #NextSequenceNumber int;
with q as
(
select
c.CaseId,
cd.Docket,
RootDocket = left(cd.Docket,2),
FiscalYear = cast(right(left(cd.docket,4),2) as tinyint),
SequenceNumber = cast(substring(cd.Docket,6, charindex('-',cd.Docket,7)-6) as smallint),
SubDocket = cast(right(cd.Docket,3) as smallint)
from dbo.mycases c
cross apply (select value Docket from string_split(Docket,' ') ) cd(Docket)
)
select #NextSequenceNumber = max(SequenceNumber) + 1
from q
where RootDocket = #RootDocket
and FiscalYear = #FiscalYear
return #NextSequenceNumber;
end
go
select dbo.MyFunction('PI',15);
select dbo.MyFunction('PI',14);
outputs
-----------
5
-----------
8
Here is an approach that uses a loop to sequentially check each possible sequence number for the given case type and year. As soon as an available sequence number is found, it is returned.
This might be a little more optimized that what you requested, in the sense that it will fill the gaps in the sequence, if there are any. This might, or might not be what you need.
Code:
CREATE FUNCTION GetNextAvailableSequence (
#case_type VARCHAR(2),
#fiscal_year INT
)
RETURNS INT
AS
BEGIN
DECLARE #seq INT;
DECLARE #done INT;
SET #done = 0;
SET #seq = 0;
WHILE #done = 0
BEGIN
SET #seq = #seq + 1;
IF (
SELECT COUNT(*)
FROM MyCases
WHERE ' ' + docket LIKE
'% '
+ #case_type
+ CAST(#fiscal_year as VARCHAR(2))
+ '-'
+ CAST(#seq as VARCHAR(2))
+ '%'
) = 0
BEGIN
SET #done = 1;
END;
END;
RETURN #seq;
END;
Demo on DB Fiddle:
SELECT dbo.GetNextAvailableSequence('PI', 14);
| (No column name) |
| ---------------: |
| 5 |
This fills the first gap for PI-14.
select dbo.GetNextAvailableSequence('PI', 15);
| (No column name) |
| ---------------: |
| 5 |
There are no gaps for PI-15, this is the first available sequence.

Needing to parse out data

I am trying to parse out certain data from a string and I am having issues.
Here is the string:
1=BETA.1.0^2=175^3=812^4=R^5=N^9=1^12=1^13=00032^14=REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR.^10=107~117~265~1114~3143~3505~3506~3513~5717^11=SA16~1~WY~WY~A~S~20100210~001~SE62^-omitted due to existing Rep Not Found
I need to return this "REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR."
Here is my query SELECT CONVERT(VARCHAR(5000),CHARINDEX('14=',Column))FROM Table
If you're parsing, can we assume that you don't know what might come after the '^14=', but you need to capture whatever does? So searching for a particular string won't work because anything could come after '^14='. The best approach is to identify the longest reliable specific string that gives you a "foothold" to find the data you're looking for. What you don't want to do is accidentally capture the wrong data if the '^14=' appears more than once in your string. It looks like the '^' is your delimiter, since I don't see one at the start of the string. So you were actually on the right track, you just need to use SUBSTRING as a commenter mentioned. You also need to identify a marker for the end of the error message, which looks like it might be the next occurring '^', correct? Check several samples to be sure of this, and make sure the end marker doesn't at any point exist before your start marker or you'll get an error.
SELECT CAST((SUBSTRING(Column,CHARINDEX('14=',Column,0),CHARINDEX('^',Column,CHARINDEX('14=',Column,0) + 1) - CHARINDEX('14=',Column,0))) AS VARCHAR(5000)) FROM Table
You may need to increment or decrement the start position and end position by doing a +1 or -1 to fully capture your error message. But this should dynamically grab any length error message provided you are positive of your starting and ending markers.
I also have here a table-valued parsing function, where you would pass it the string and the '^' and it will return a table of data with not only the 14=, but everything.
CREATE function [dbo].[fn_SplitStringByDelimeter]
(
#list nvarchar(8000)
,#splitOn char(1)
)
returns #rtnTable table
(
id int identity(1,1)
,value nvarchar(100)
)
as
begin
declare #index int
declare #string nvarchar(4000)
select #index = 1
if len(#list) < 1 or #list is null return
--
while #index!= 0
begin
set #index = charindex(#splitOn,#list)
if #index!=0
set #string = left(#list,#index - 1)
else
set #string = #list
if(len(#string)>0)
insert into #rtnTable(value) values(#string)
--
set #list = right(#list,len(#list) - #index)
if len(#list) = 0 break
end
return
end
It sounds like you're trying to get the value of argument 14. This should do it:
select substring(
someData
, charindex('^14=',someData) + 4
, charindex('^',someData, charindex('^14=',someData) + 4) - charindex('^14=',someData) - 4
) errorMessage
from myData
where charindex('^14=',someData) > 0
and charindex('^',someData, charindex('^14=',someData) + 4) > 0
Try it here: http://sqlfiddle.com/#!18/22f23/2
This gets a substring of the given input.
The substring starts at the first character after the string ^14=; i.e. we get the index of ^14= in the string, then add 4 to it to skip over the matched characters themselves.
The substring ends at the first ^ character after the one in ^14=. We get the index of that character, then subtract the starting position from it to get the length of the desired output.
Caveats: If there is no parameter (^) after ^14= this will not work. Equally if there is no ^14= (even if the string starts 14=) this will not work. From the information available that's OK; but if this is a concern please say and we can provide something to handle that more complex scenario.
Code to create table & populate demo data
create table myData (someData nvarchar(256))
insert myData (someData)
values ('1=BETA.1.0^2=175^3=812^4=R^5=N^9=1^12=1^13=00032^14=REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR.^10=107~117~265~1114~3143~3505~3506~3513~5717^11=SA16~1~WY~WY~A~S~20100210~001~SE62^-omitted due to existing Rep Not Found')
, ('1xx^14=something else.^10=xx')
You could try to use a Case When statement with wildcards to find the value that you want.
Example:
SELECT
CASE
WHEN x LIKE '%REP Not Found%'
THEN 'REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR'
ELSE
''
END AS x
FROM
#T1
You could use this query (assuming MySQL database):
-- item is the column that contains the string
select SUBSTR(item, LOCATE('REP',item), LOCATE('REPRGR.',item) + LENGTH('REPRGR.') - LOCATE('REP', item)) info_msg from Table;
Illustration:
create table parsetest (item varchar(5000));
insert into parsetest values('1=BETA.1.0^2=175^3=812^4=R^5=N^9=1^12=1^13=00032^14=REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR.^10=107~117~265~1114~3143~3505~3506~3513~5717^11=SA16~1~WY~WY~A~S~20100210~001~SE62^-omitted due to existing Rep Not Found');
select * from parsetest;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| item |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1=BETA.1.0^2=175^3=812^4=R^5=N^9=1^12=1^13=00032^14=REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR.^10=107~117~265~1114~3143~3505~3506~3513~5717^11=SA16~1~WY~WY~A~S~20100210~001~SE62^-omitted due to existing Rep Not Found |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
select SUBSTR(item, LOCATE('REP',item), LOCATE('REPRGR.',item) + LENGTH('REPRGR.') - LOCATE('REP', item)) info_msg from parsetest;
+------------------------------------------------------+
| info_msg |
+------------------------------------------------------+
| REP NOT FOUND ON REP TABLE, CANNOT INSERT TO REPRGR. |
+------------------------------------------------------+

Extract number between two substrings in sql

I had a previous question and it got me started but now I'm needing help completing this. Previous question = How to search a string and return only numeric value?
Basically I have a table with one of the columns containing a very long XML string. There's a number I want to extract near the end. A sample of the number would be this...
<SendDocument DocumentID="1234567">true</SendDocument>
So I want to use substrings to find the first part = true so that Im only left with the number.
What Ive tried so far is this:
SELECT SUBSTRING(xml_column, CHARINDEX('>true</SendDocument>', xml_column) - CHARINDEX('<SendDocument',xml_column) +10087,9)
The above gives me the results but its far from being correct. My concern is that, what if the number grows from 7 digits to 8 digits, or 9 or 10?
In the previous question I was helped with this:
SELECT SUBSTRING(cip_msg, CHARINDEX('<SendDocument',cip_msg)+26,7)
and thats how I got started but I wanted to alter so that I could subtract the last portion and just be left with the numbers.
So again, first part of the string that contains the digits, find the two substrings around the digits and remove them and retrieve just the digits no matter the length.
Thank you all
You should be able to setup your SUBSTRING() so that both the starting and ending positions are variable. That way the length of the number itself doesn't matter.
From the sound of it, the starting position you want is right After the "true"
The starting position would be:
CHARINDEX('<SendDocument DocumentID=', xml_column) + 25
((adding 25 because I think CHARINDEX gives you the position at the beginning of the string you are searching for))
Length would be:
CHARINDEX('>true</SendDocument>',xml_column) - CHARINDEX('<SendDocument DocumentID=', xml_column)+25
((Position of the ending text minus the position of the start text))
So, how about something along the lines of:
SELECT SUBSTRING(xml_column, CHARINDEX('<SendDocument DocumentID=', xml_column)+25,(CHARINDEX('>true</SendDocument>',xml_column) - CHARINDEX('<SendDocument DocumentID=', xml_column)+25))
Have you tried working directly with the xml type? Like below:
DECLARE #TempXmlTable TABLE
(XmlElement xml )
INSERT INTO #TempXmlTable
select Convert(xml,'<SendDocument DocumentID="1234567">true</SendDocument>')
SELECT
element.value('./#DocumentID', 'varchar(50)') as DocumentID
FROM
#TempXmlTable CROSS APPLY
XmlElement.nodes('//.') AS DocumentID(element)
WHERE element.value('./#DocumentID', 'varchar(50)') is not null
If you just want to work with this as a string you can do the following:
DECLARE #SearchString varchar(max) = '<SendDocument DocumentID="1234567">true</SendDocument>'
DECLARE #Start int = (select CHARINDEX('DocumentID="',#SearchString)) + 12 -- 12 Character search pattern
DECLARE #End int = (select CHARINDEX('">', #SearchString)) - #Start --Find End Characters and subtract start position
SELECT SUBSTRING(#SearchString,#Start,#End)
Below is the extended version of parsing an XML document string. In the example below, I create a copy of a PLSQL function called INSTR, the MS SQL database does not have this by default. The function will allow me to search strings at a designated starting position. In addition, I'm parsing a sample XML string into a variable temp table into lines and only looking at lines that match my search criteria. This is because there may be many elements with the words DocumentID and I'll want to find all of them. See below:
IF EXISTS (select * from sys.objects where name = 'INSTR' and type = 'FN')
DROP FUNCTION [dbo].[INSTR]
GO
CREATE FUNCTION [dbo].[INSTR] (#String VARCHAR(8000), #SearchStr VARCHAR(255), #Start INT, #Occurrence INT)
RETURNS INT
AS
BEGIN
DECLARE #Found INT = #Occurrence,
#Position INT = #Start;
WHILE 1=1
BEGIN
-- Find the next occurrence
SET #Position = CHARINDEX(#SearchStr, #String, #Position);
-- Nothing found
IF #Position IS NULL OR #Position = 0
RETURN #Position;
-- The required occurrence found
IF #Found = 1
BREAK;
-- Prepare to find another one occurrence
SET #Found = #Found - 1;
SET #Position = #Position + 1;
END
RETURN #Position;
END
GO
--Assuming well formated xml
DECLARE #XmlStringDocument varchar(max) = '<SomeTag Attrib1="5">
<SendDocument DocumentID="1234567">true</SendDocument>
<SendDocument DocumentID="1234568">true</SendDocument>
</SomeTag>'
--Split Lines on this element tag
DECLARE #SplitOn nvarchar(25) = '</SendDocument>'
--Let's hold all lines in Temp variable table
DECLARE #XmlStringLines TABLE
(
Value nvarchar(100)
)
While (Charindex(#SplitOn,#XmlStringDocument)>0)
Begin
Insert Into #XmlStringLines (value)
Select
Value = ltrim(rtrim(Substring(#XmlStringDocument,1,Charindex(#SplitOn,#XmlStringDocument)-1)))
Set #XmlStringDocument = Substring(#XmlStringDocument,Charindex(#SplitOn,#XmlStringDocument)+len(#SplitOn),len(#XmlStringDocument))
End
Insert Into #XmlStringLines (Value)
Select Value = ltrim(rtrim(#XmlStringDocument))
--Now we have a table with multple lines find all Document IDs
SELECT
StartPosition = CHARINDEX('DocumentID="',Value) + 12,
--Now lets use the INSTR function to find the first instance of '">' after our search string
EndPosition = dbo.INSTR(Value,'">',( CHARINDEX('DocumentID="',Value)) + 12,1),
--Now that we know the start and end lets use substring
Value = SUBSTRING(value,(
-- Start Position
CHARINDEX('DocumentID="',Value)) + 12,
--End Position Minus Start Position
dbo.INSTR(Value,'">',( CHARINDEX('DocumentID="',Value)) + 12,1) - (CHARINDEX('DocumentID="',Value) + 12))
FROM
#XmlStringLines
WHERE Value like '%DocumentID%' --Only care about lines with a document id

Implement find/find next algorithm

I have a database table (mysql/pgsql) with the following format:
id|text
1| the cat is black
2| a cat is a cat
3| a dog
I need to select the line that contains nth match of a word:
eg: "Select the 3rd match for the word cat, that is the number 2 entry."
Results: the 2nd row from the result where the 3rd word is cat
The only solution I could find is to search for all entries that have the text cat, load them in memory and find the match by counting them. But this is not efficient for a big number of matches(>1 million).
How would you handle this in an efficient way? Is there anything you can do directly in the database? Maybe using other technologies like lucene?
Update: having 1 million strings in memory might not be a big issue but the expectation of the application is to have between 1k-50k active users that might do this operation concurrently.
Consider creating another table with the below structure
Table : index_table
columns :
index_id , word, occurrence, id(foreign key to your original table)
Do one time indexing process as below:
Iterate over each entry in your original table split the text into words and for each word lookup in the new table for existence if not present insert a new entry with occurrence set as 1. If exists insert a new entry with occurrence = existing occurrence +1
Once you have done this one off indexing your selects become pretty simple.
For example for cat with 3rd match will be
SELECT *
FROM original_table o, index_table idx
WHERE idx.word = 'cat'
AND idx.occurrence = 3
AND o.id = idx.id
You do not need Lucene for this job. Furthermore, if you have a large number of positive matches, the effort to pump all required data out of your DB will well exceed the computational cost.
Here's a simple solution:
Index: we require two properties:
efficiently access the words for each id
efficiently access all IDs in ascending order
as follows:
create index i_words on example_data (id, string_to_array(txt, ' '));
Query: find the ID associated with the nth match with the following query:
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = :w -- :w = 'cat'
offset :n - 1 -- :n = 3
limit 1;
Executes in 2ms on 1 million rows.
Here's the full PostgreSQL setup if you'd rather try for yourself than take my word for it:
drop table if exists example_data;
create table example_data (
id integer primary key,
txt text not null
);
insert into example_data
(select generate_series(1, 1000000, 3) as id, 'the cat is black' as txt
union all
select generate_series(2, 1000000, 3), 'a cat is a cat'
union all
select generate_series(3, 1000000, 3), 'a dog'
order by id);
commit;
drop index if exists i_words;
create index i_words on example_data (id, string_to_array(txt, ' '));
select id
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
select
id, word
from (
select id, unnest(string_to_array(txt, ' ')) as word
from example_data
) words
where word = 'cat'
offset 3 - 1
limit 1;
Note that I'm still unsure what exactly "Select the 3rd match for the word cat, that is the number 2 entry" is supposed to mean.
Possible meanings:
the 2nd row from the result where the 3rd word is cat
the 3rd row where the 2nd word is "cat"
from all rows where "cat" appears at least 3 times, take the second row
from all rows where "cat" appears at least 2 times, take the third row
If it's 1 or 2, I think this could be done in an acceptable speed by using a trigram index to reduce the possible number of matching lines. A trigram index (supplied by the module pg_trgm) will allow Postgres to make use of an index when doing a e.g. like '%cat%'.
Assuming that only a small number of rows will satisfy that condition, the resulting lines can then be split into arrays and checked for the nth word.
Something like this:
with matching_rows as (
select id, line,
row_number() over (order by id) as rn
from the_table
where line like '%cat%' -- this hopefully reduces the result to only very few rows
)
select *
from matching_rows
where rn = 3 --<< "the third match for the word cat"
and (string_to_array(line, ' '))[2] = 'cat' -- "the second word is "cat"
Note that a trigram index does have disadvantages as well. Maintaining such an index is much more expensive (=slower) than maintaining a regular b-tree index. So if your table is heavily updated, this might not be a good solution - but you need to test that for yourself.
Also if the condition `like '%cat%' doesn't really reduce the number of rows substantially, this is probably not going to perform well either.
Some more information on trigram indexes:
http://www.depesz.com/index.php/2011/02/19/waiting-for-9-1-faster-likeilike/
http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html
Another option would be to filter out the "relevant" rows using Postgres' full text search instead of a plain LIKE condition.
Whatever algorithm you come up with for the database as-it-is is likely to be slow for this kind of data. You do need an efficient text-based search, lucene-based solutions like solr or elasticsearch will do nicely here. It would be the best option here, though finding a match against a 3rd token in a string is not something I know how to build without further googling.
You can also write a job in your db which will let you build a reverse map, string->id. like this:
rownum, id, text
1 1 the cat is black
2 3 nice cat
to
key, rownum, id
1_the 1 1
2_cat 1 1
3_is 1 1
4_black 1 1
1_nice 2 3
2_cat 2 3
If you can order by ID you don't need rownum. You should also call the column something else instead of rownum, I leave it like that for clarity
Now you can search for 1st ID where the word cat is a 2nd word like this by searching
SELECT ID WHERE ROWNUM=1 AND key='3_CAT'
Provided you created an (id, key) or (key, id) index, your searches should be pretty quick.
If you can fit all that data into memory, then you can use a simple Map<MyKey, Long> to do your search. MyKey would be, more or less Pair<Long,String> with proper equals and hashCode (and/or Comparable, if you use TreeMap) implementations.
(Thanks to Daniel Grosskopf for pointing out that I initially misinterpreted the question.)
This query will give you what you want with just SQL. It gets a running total of the counts of the occurrences of a word (e.g. 'cat') within the text, and then it returns the first row that hits the threshold that you want (e.g. 3).
SELECT id, text
FROM (SELECT entries.*,
SUM((SELECT COUNT(*)
FROM regexp_split_to_table(text, E'\\s+') AS words(word)
WHERE word = 'cat')) OVER (ORDER BY id) AS running_count
FROM entries) AS entries_with_running_count
WHERE running_count >= 3
LIMIT 1
See it in action in SQL Fiddle
How would you handle this in an efficient way? Is there any trick you
can do directly in the database?
You are not specifying what other restrictions/requirements you may have or what is your definition of
a big number of matches.
As a general answer I would say that doing string manipulation in the database is not an efficient approach.
It is too slow and imposes much work on your DB which is usually a shared resource.
IMO you should do this programmatically.
A way to do this could be to keep metadata in another table i.e. indexes of rows that contain the text cat and where in the sentence.
You can query this meta-table in order to figure the rows to query from your main table.
This extra table is more efficient than searching your defined table because queries with LIKE on suffixes can not use an index and you will end up with serial scans which would result in very slow performance
Solution for the Postgres database:
Add a new column to your table:
alter table my_table add text_as_array text[];
This column will contain the sentence spliced into words:
"the cat is black" -> ["the","cat","is","black"]
Populate this column with values from current records:
update my_table set text_as_array = string_to_array(text,' ');
(and don't forget to set it's value to string_to_array(text,' ') when inserting new records)
Create a gin index on it:
create index my_table_text_as_array_index on text_as_array gin(text_as_array);
analyze my_table;
Then all you need is run a fast query as simple as this:
select *
from my_table
where text_as_array #> ARRAY['cat']
and text_as_array[3] = 'cat' -- third word in sentence
order by id
limit 1
offset 2 -- second occurrence
It took 11ms to search over ~2,400,000 records in tests I did in my machine.
Explain:
Limit (cost=11252.08..11252.08 rows=1 width=104)
-> Sort (cost=11252.07..11252.12 rows=19 width=104)
Sort Key: id
-> Bitmap Heap Scan on my_table (cost=48.21..11251.83 rows=19 width=104)
Recheck Cond: (text_as_array #> '{cat}'::text[])
Filter: (text_as_array[3] = 'cat'::text)
-> Bitmap Index Scan on my_table_text_as_array_index (cost=0.00..48.20 rows=3761 width=0)
Index Cond: (text_as_array #> '{cat}'::text[])
A "directly in the database" solution seems preferable from an efficiency standpoint as most types of abstraction layer or loading/processing elsewhere are likely to incur additional overheads.
If the source text can be massaged such that only spaces separate the words (as mentioned in the comments - perhaps by pre-processing to suitably replace all non-alphabetical characters?), the following (My)SQL-only solution will work:
#############################################################
SET #searchWord = 'cat', # Search word: Must be lower case #
#n = 1, # n where nth match is to be found #
#############################################################
#matches = 0; # Initialise local variable
SELECT s.*
FROM sentence s
WHERE id =
(SELECT subq.id
FROM
(SELECT *,
#matches AS prevMatches,
(#matches := #matches + LENGTH(`text`) - LENGTH(
REPLACE(LOWER(`text`),
CONCAT(' ', #searchWord, ' '),
CONCAT(#searchWord, ' ')))
+ CASE WHEN LEFT(LOWER(`text`), 4) = CONCAT(#searchWord, ' ') THEN 1 ELSE 0 END
+ CASE WHEN RIGHT(LOWER(`text`), 4) = CONCAT(' ', #searchWord) THEN 1 ELSE 0 END)
AS matches
FROM sentence) AS subq
WHERE subq.prevMatches < #n AND #n <= subq.matches);
Explanation
All instances of ' cat ' on each line are replaced with a word that is one letter shorter. The difference in length is then calculated to find out the number of instances. Finally, the single possibilities of 'cat ' and ' cat' appearing a the start and end of the line are respectively catered for. Having done this, a cumulative total of matches is maintained for each line. This is bundled up into a subquery from which the nth match can be picked by finding the row where the number of cumulative number of matches is no greater than n but the previous total is less than n.
Further potential improvements
The above could of course be slightly simplified by making the source text lower case (which seems sensible if it is being pre-processed) and removing all calls to LOWER().
The subquery calculates a cumulative total number of matches. If it is likely that the same search terms will be reused, it might conceivably be possible to cache these results in another table and use triggers to maintain this whenever records are updated, inserted or deleted - however this would greatly add to the complexity and data storage requirements.
I would search for all rows with "cat" but limit the rows by n. This should give you a reasonably sized subset of your data that is guaranteed to contain the row you are looking for. The SQL would look similar to this:
select id, text
from your_table
where text ~* 'cat'
order by id
limit 3 --nth time cat appears
I would then implement your solution as a pl/pgsql function to get the id that contains the nth occurrence of your word:
CREATE OR REPLACE FUNCTION your_schema.row_with_nth_occurrence(character varying, integer)
RETURNS integer AS
$BODY$
Declare
arg_search_word ALIAS FOR $1;
arg_occurrence ALIAS FOR $2;
v_sql text;
v_sql2 text;
v_count integer;
v_count_total integer;
v_record your_table%ROWTYPE;
BEGIN
v_sql := 'select id, text
from your_table
where text ~* ' || arg_search_word || '
order by id
limit ' || arg_occurrence || ';';
v_count := 0;
v_count_total := 0;
FOR v_record IN v_sql LOOP
v_sql2 := 'SELECT count(*)
FROM regexp_split_to_table('||v_record.text||', E'\\s+') a
WHERE a = '|| arg_search_word ||';';
EXECUTE v_sql2 INTO v_count;
v_count_total := v_count_total + v_count;
IF v_count_total >= arg_occurrence THEN
RETURN v_record.id;
END IF;
END LOOP;
RAISE EXCEPTION '% does not occur % times in the database.', arg_search_word, arg_occurrence;
END;
All this function does is loop through the subset of rows potentially containing the desired word, counts the number of times it occurs in each row, and then returns the Id when it finds the row with the nth occurrence of the word.
Solution one:
Keep the rows in memory but centralized. All clients loop over the same list. Probably fast enough en reasonably memory friendly.
Solution two:
Use the streaming ResultSet technique from the JDBC driver; e.g.
Statement select = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
select.setFetchSize(Integer.MIN_VALUE);
ResultSet result = select.executeQuery(sql);
As explained in http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html, scroll down to Resultset. This should be memory friendly.
Now simply count on the result rows until satisfied and close the result.
I am having trouble understanding your statement:
eg: "Select the 3rd match for the word cat, that is the number 2
entry." Results: the 2nd row from the result where the 3rd word is cat
I will assume that you mean, you want to search for entries where the 3rd word of the text is "cat", and from those entries you want to second entry.
Since you mentioned that your problem lies with the concurrent access and the speed, you will need to somehow build an index which is optimized for your query. You could use anything for this, database, lucene, etc. My suggestion would be to build the index in-memory. Just think of it as a warm up for your service before it could start serving request.
In your case, you would want some kind of map with the word and word position as the key. This key will then map to a list of row numbers which is matching the key. So in the end, you will just have to do a lookup twice, first is to get a list of row numbers where it matches, then the row number which you want. So the performance you will need in the end will be a simple map lookup + array list lookup (constant).
I've provided a very simple example below. It's untested code, but it should roughly give you an idea.
You could also save the index into a file after it's been built if you want. After you have been the index and load them into memory, this will be very very fast.
// text entry from the DB
public class TextEntry {
private int rowNb;
private String text;
// getters & setters
}
// your index class
public class Index {
private Map<Key, List<Integer>> indexMap;
// getters and setters
public static class Key {
private int wordPosition;
private String word;
// getters and setters
}
}
// your searcher class
public class Searcher {
private static Index index = null;
private static List<TextEntry> allTextEntries = null;
public static init() {
// init all data with some synchronization check
// synchronization check whether index has been built
allTextEntries.forEach(entry -> {
// split the words, and build the index based on the word position and the word
String[] words = entry.split(" ");
for (int i = 0; i < words.length; i++) {
Index.Key key = new Index.Key(i + 1, words[i]);
int rowNumber = entry.getRowNb();
// if the key is already there, just add the row number if it's not the last one
if (indexMap.contains(key)) {
List entryMatch = indexMap.get(key);
if (entryMatch.get(entryMatch.size() - 1) !== rowNumber) {
entryMatch.add(rowNumber);
}
} else {
// if key is not there, add a new one
List entryMatch = new ArrayList<Integer>()
entryMatch.add(rowNumber);
indexMap.put(key, entryMatch);
}
}
});
}
public static TextEntry search(String word, int wordPosition, int resultNb) {
// call init if not yet called, do some check
int rowNb = index.getIndexMap().get(new Index.Key(word, wordPosition)).get(resultNb - 1);
return allTextEntries.get(rowNb);
}
}
In mysql
We need one function where we can count number of occurence of given substring in a field.
Create the Function (This function will count occurence of substring in given column)
CREATE FUNCTION substrCount(
x varchar(255), delim varchar(12)) returns int
return (length(x)-length(REPLACE(x,delim, '')))/length(delim);
This function should be able to find how many times 'cat' was present in text.
Please bear with me for syntax of code as it may not be fully functional(correct as required).
I will break this problem into 3 parts and we can do with the help of stored procedure.
Select all the rows containing the string 'cat' (or any other input).This should select maximum of n rows( n= no of occurences), so we will use limit in our query.
With cursor, iterate matched rows in while roop.
Increment occurence matches per row in count variable and exit once number of matches found.(Should be able to find match within 1 to n loops)
create stored procedure.
Assuming proper index ,this should be fast.
DELIMITER $$
CREATE PROCEDURE find_match(INOUT string_to_match varchar(100),
INOUT occurence_count INTEGER,OUT match_field varchar(100))
BEGIN
DECLARE v_count INTEGER DEFAULT 0;
DECLARE v_text varchar(100) DEFAULT "";
-- declare cursor and select by the order you want.
DEClARE matcher_cursor CURSOR FOR
SELECT textField FROM myTable
where textField like string_to_match
order by id
LIMIT 0, occurence_count;
-- declare NOT FOUND handler
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET v_finished = -1;
OPEN matcher_cursor;
get_matching_occurence: LOOP
FETCH matcher_cursor INTO v_text;
IF v_count = -1 THEN
LEAVE get_matching_occurence;
END IF;
-- use substring count function
v_count:= v_count + substrCount(v_text,string_to_match));
-- if count is equal to greater than occurenece that means matching row is found.
IF (v_count>= occurence_count) THEN
SET match_field = v_text;
v_count:=-1;
END IF;
END LOOP get_matching_occurence;
CLOSE _
END$$
DELIMITER ;
I tested this on a table with 1.2 million rows and it returns data in less than a second. I am using a split function (which is a modified form of Jeff Modem's splitter function) from here: 'http://sqlperformance.com/2012/08/t-sql-queries/splitting-strings-follow-up'.`
-- Step 1. Create table
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Sentence](
[id] [int] IDENTITY(1,1) NOT NULL,
[Text][varchar](250) NULL,
CONSTRAINT [PK_Sentence] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
Step 2. Create a split function
CREATE FUNCTION [dbo].[SplitSentence]
(
#CSVString NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
cteTally(N) AS (SELECT 0
UNION ALL
SELECT TOP (DATALENGTH(ISNULL(#CSVString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E2),
cteStart(N1) AS (SELECT t.N+1
FROM cteTally t
WHERE (SUBSTRING(#CSVString,t.N,1) = #Delimiter OR t.N = 0))
SELECT Word = SUBSTRING(#CSVString, s.N1, ISNULL(NULLIF(CHARINDEX(#Delimiter,#CSVString,s.N1),0)-s.N1,50))
FROM cteStart s;
Step 3. Create a sql script to return the required data
DECLARE #n int = 3
DECLARE #Word varchar(50) = 'cat'
;WITH myData AS
(SELECT TOP (#n)
id
,[Text]
,sp.word
,ROW_NUMBER() OVER (ORDER BY Id) RowNo
FROM
Sentence
CROSS APPLY (SELECT * FROM SplitSentence(Sentence.[Text],' ')) sp
WHERE Word = #Word)
SELECT
*
FROM
myData
WHERE
RowNo = #n
Assumptions:
1. The sentence has a max length of 250 characters. If needed this can be modified in the create table statement.
2. The sentence will not have more than a 100 words. If more than 100 words are needed, the split function will have to be modified.
3. Any word in the sentence has a max length of 50 characters.
SQL Fiddle demo here: http://sqlfiddle.com/#!3/0a1d0/1
Notes:
I am aware that the original requirement is for MySQL/pgsql,
but I have limited knowledge of these and therefore my solution has been created/tested in MSSQL.
I would simply count the number of words in each line and then do a cumulative sum. I'm not sure what the most efficient way is to count words, but a difference of lengths might win:
select t.*
from (select t.*, sum(cnt) over (order by id) as cumecnt
from (select t.*,
(length(' ' || str || ' ') - length(replace(' ' || str || ' '), ' cat ', '')) / length(' cat ') as cnt
from t
) t
where num > 0
) t
where cumecnt >= 3 and cumecnt - cnt <= 3;
You would simply replace "3" and "cat" with the appropriate strings.
This method requires scanning the strings a handful of times in each row (once for each of the lengths and once for the replace). My guess is that this is faster than various array operations, regular expressions, or text. If you have more complicated definitions of what a word is, then you probably need to use regular expression replace:
Doing the work in the database is usually a big win. However, if you are looking for the 6th match out of one million rows, it might be faster to read back the values from the subquery and do the accumulation in the application. I don't think there is a way to short-circuit the database calculation to stop just on the "6th" row.