lucene, or sql fulltext?

lucene, or sql fulltext? - sql

I want to create a search website to search docs (all kinds of formats including pdf), images, videos, and audio. I also want to be able to filter my search results based on some criteria like author name, date, etc.
I'm doing this in .NET, so what's the easiest way to get up and running? SQL fulltext searching seems tempting because I'm familiar with sql, and plus since I want to filter my search results, it will be easy to store the filter fields for each item.

If your primary concern is getting it up and running quickly and easily, then SQL fulltext search is definitely the way to go.
Lucene.NET has its advantages, but it is by no means a walk in the park to set up correctly. The documentation is a bit lacking and there are a very limited number of examples on the web.

Stored procedure for snippets:
CREATE PROCEDURE SimpleCommentar
#SearchTerm nvarchar(100),
#Style nvarchar(200)
AS
BEGIN
CREATE TABLE #match_docs
(
doc_id bigint NOT NULL PRIMA
);
INSERT INTO #match_docs
(
doc_id
)
SELECT DISTINCT
Commentary_ID
FROM Commentary
WHERE FREETEXT
(
Commentary,
#SearchTerm,
LANGUAGE N'English'
);
DECLARE #db_id int = DB_ID(),
#table_id int = OBJECT_ID(N'
#column_id int =
(
SELECT
column_id
FROM sys.columns
WHERE object_id = OBJECT_I
AND name = N'Commentary'
);
SELECT
s.Commentary_ID,
t.Title,
MIN
(
N'...' + SUBSTRING
(
REPLACE
(
c.Commentary,
s.Display_Term,
N'<span style="' + #Style + '">' + s.Display_Term + '</span>'
),
s.Pos - 512,
s.Length + 1024
) + N'...'
) AS Snippet
FROM
(
SELECT DISTINCT
c.Commentary_ID,
w.Display_Term,
PATINDEX
(
N'%[^a-z]' + w.Display_Term + N'[^a-z]%',
c.Commentary
) AS Pos,
LEN(w.Display_Term) AS Length
FROM sys.dm_fts_index_keywords_by_document
(
#db_id,
#table_id
) w
INNER JOIN dbo.Commentary c
ON w.document_id = c.Commentary_ID
WHERE w.column_id = #column_id
AND EXISTS
(
SELECT 1
FROM #match_docs m
WHERE m.doc_id = w.document_id
)
AND EXISTS
(
SELECT 1
FROM sys.dm_fts_parser
(
N'FORMSOF(FREETEXT, "' + #SearchTerm + N'")',
1033,
0,
1
) p
WHERE p.Display_Term = w.Display_Term
)
) s
INNER JOIN dbo.Commentary c
ON s.Commentary_ID = c.Commentary_ID
INNER JOIN dbo.Book_Commentary bc
ON c.Commentary_ID = bc.Commentary_ID
INNER JOIN dbo.Book_Title bt
ON bc.Book_ID = bt.Book_ID
INNER JOIN dbo.Title t
ON bt.Title_ID = t.Title_ID
WHERE t.Is_Primary_Title = 1
GROUP BY
s.Commentary_ID,
t.Title;
DROP TABLE #match_docs;
END;

Related

Recursive query from the same table

I have different product serial numbers in one table ProdHistory which contains, as the table name suggest, production history.
For example I have product serial SER001 which uses parts with its own serial number.
We also produce these parts thus uses the same table ProdHistory to track its subparts.
The same goes with the subparts and if it has sub-sub parts.
Sample Table
IF OBJECT_ID('tempDB.dbo.#SAMPLETable') IS NOT NULL DROP TABLE #SAMPLETable
CREATE TABLE #SAMPLETable
(
ITEMSEQ INT IDENTITY(1,1),
SERIAL NVARCHAR(10) COLLATE SQL_Latin1_General_CP850_CI_AS,
ITEMID NVARCHAR(10) COLLATE SQL_Latin1_General_CP850_CI_AS,
PARTSERIAL NVARCHAR(10) COLLATE SQL_Latin1_General_CP850_CI_AS,
PARTID NVARCHAR(10) COLLATE SQL_Latin1_General_CP850_CI_AS,
CREATEDDATETIME DATETIME
)
INSERT INTO
#SAMPLETable (SERIAL,ITEMID,PARTSERIAL,PARTID,CREATEDDATETIME)
VALUES ('SER0001','ASY-1342','ITM0001','PRT-0808','2017-01-17'),
('SER0001','ASY-1342','ITM0002','PRT-0809','2017-01-17'),
('SER0001','ASY-1342','ITM0003','PRT-0810','2017-01-17'),
('SER0001','ASY-1342','ITM0004','PRT-0811','2017-01-17'),
('ITM0001','PRT-0808','UNT0001','PRT-2020','2017-01-16'),
('ITM0002','PRT-0809','UNT0002','PRT-2021','2017-01-16'),
('ITM0002','PRT-0809','UNT0003','PRT-2022','2017-01-16'),
('ITM0003','PRT-0810','UNT0004','PRT-2023','2017-01-16'),
('UNT0002','PRT-2021','DTA0000','PRT-1919','2017-01-15'),
('UNT0003','PRT-2022','DTA0001','PRT-1818','2017-01-15'),
('DTA0001','PRT-1818','LST0001','PRT-1717','2017-01-14')
The question is, if I'm given just the main serial number, how can I return all the parts and subparts serial associated with it?
Sample Result:
MainSerial SubSerial1 SubSerial2 SubSerial3 SubSerial4
-------------------------------------------------------
SER0001 ITM0001 UNT0001
SER0001 ITM0002 UNT0002 DTA0000
SER0001 ITM0002 UNT0003 DTA0001 LST0001
SER0001 ITM0003 UNT0004
SER0001 ITM0004
In above, it is not definite how many parts and subparts there are for a serial number.
I did not post my code since what I'm doing right now is to query it one by one.
If I have known number of subparts, I can do nested Joins, however it is not.
Another question is, if I'm just given any of the subparts above, is it possible to return the same result?

I think a way is to use Dynamic SQL like this:
-- Variables to generate SQL query string dynamically
declare #cols nvarchar(max) = '', #joins nvarchar(max) = '', #sql nvarchar(max) = '';
-- Using CTE to iterate parent-child records
with cte(i, cols, joins, itemId, serial, partId, partSerial) as (
select
1, -- Level or depth of hierarchically tree
N's1.serial MainSerial, s1.partSerial SubSerial'+cast(1 as varchar(max)),
N'yourTable s'+cast(1 as varchar(max)),
s.itemId, s.serial, s.partId, s.partSerial
from yourTable s
-- A way to filter root-parents is filtering items those are not in parts
where s.itemId not in (select si.partId from yourTable si)
union all
select
i+1,
cols + N', s'+cast(i+1 as varchar(max))+N'.partSerial SubSerial'+cast(i+1 as varchar(max)),
joins + N' left join yourTable s'+cast(i+1 as varchar(max))+N' on s'+cast(i as varchar(max))+N'.partId = s'+cast(i+1 as varchar(max))+N'.itemId',
st.itemId, st.serial, st.partId, st.partSerial
from cte
join #sampleTable st on cte.partId = st.itemId
)
-- Now we need only strings of deepest level
select top(1)
#cols = cols, #joins = joins
from cte
order by i desc;
-- Finalize and executing query string
set #sql = N'select ' + #cols + N' from ' + #joins + N' where s1.itemId not in (select s.partId from yourTable s)';
exec(#sql);
Additional Note: Generated query is:
select s1.serial MainSerial
, s1.partSerial SubSerial1
, s2.partSerial SubSerial2
, s3.partSerial SubSerial3
, s4.partSerial SubSerial4
--, ...
from yourTable s1
left join yourTable s2 on s1.partId = s2.itemId
left join yourTable s3 on s2.partId = s3.itemId
left join yourTable s4 on s3.partId = s4.itemId
--left join ...
where s1.itemId not in (select s.partId from yourTable s);

Operand type clash: varchar is incompatible with User-Defined Table Type

I created a User-Defined Table Type:
CREATE TYPE dbo.ListTableType AS TABLE(
ITEM varchar(500) NULL
)
I leverage it in a function:
CREATE FUNCTION dbo.fn_list_to_string
(
#LIST dbo.ListTableType READONLY
)
RETURNS varchar(max)
AS
BEGIN
DECLARE #RESULT varchar(max)
SET #RESULT = ''
DECLARE #NL AS CHAR(2) = CHAR(13) + CHAR(10)
SELECT #RESULT = #RESULT + ITEM + #NL FROM #LIST
SET #RESULT = SUBSTRING(#RESULT, 1, LEN(#RESULT) - 1)
RETURN #RESULT
END
Finally, I try to use this function in a simple select:
SELECT
P.PROGRAM_ID,
PROGRAM_NAME,
PROGRAM_DESC,
P.STATUS_ID,
STATUS_DESC,
P.CONTACT_SID,
I.FIRST_NAME + ' ' + I.LAST_NAME as CONTACT_NAME,
P.CLARITY_ID,
dbo.fn_list_to_string(
( SELECT CONVERT(varchar,CLARITY_ID) as ITEM
FROM dbo.MUSEUM_PROGRAM_PROJECTS as A
JOIN dbo.MUSEUM_PROJECTS as B on B.PROJECT_ID = A.PROJECT_ID
WHERE PROGRAM_ID = P.PROGRAM_ID )
) as PROJECT_CLARITY_IDS
FROM dbo.MUSEUM_PROGRAMS as P
LEFT JOIN dbo.MUSEUM_PROGRAM_STATUS_TYPES as S on S.STATUS_ID = P.STATUS_ID
LEFT JOIN dbo.v_IDVAULT_ENRICHED_CURRENT_EMPLOYEES as I on I.[SID] = P.CONTACT_SID
But I get this error:
Operand type clash: varchar is incompatible with ListTableType
Any idea why? Also if there's another [more elegant] way to achieve what I'm trying to do I'm open to suggestions as well! Thanks in advance!

Here is a simple demonstration of the FOR XML PATH technique which does all of this with a very simple subquery and no table types or extremely inefficient multi-statement table-valued functions etc.
USE tempdb;
GO
CREATE TABLE dbo.P(Program_ID INT);
CREATE TABLE dbo.M(Clarity_ID INT, Program_ID INT);
INSERT dbo.P VALUES(1),(2),(3),(4);
INSERT dbo.M VALUES(1,1),(1,2),(2,3),(3,2),(1,4),(4,1);
SELECT
P.PROGRAM_ID,
PROJECT_CLARITY_IDS = STUFF((
SELECT CHAR(13)+CHAR(10)+CONVERT(VARCHAR(12),Clarity_ID)
FROM dbo.M WHERE Program_ID = p.Program_ID
FOR XML PATH(''), TYPE).value('.[1]','nvarchar(max)'),1,2,'')
FROM dbo.P AS p;
SQLfiddle demo
The output doesn't look right in SQLfiddle or in results to grid in Management Studio, because they strip out carriage returns/line feeds for display purposes, but you can replace CHAR(13)+CHAR(10) with two commas or semi-colons or something to verify that there are two characters there.

Using STUFF..FOR XML PATH construct for concatanation in combination with CTE will get the results you'd like. Something like this:
WITH CTE_PROJECT_CLARITIES AS
(
SELECT DISTINCT PROGRAM_ID
, STUFF((
SELECT CHAR(13) + CHAR(10) + CONVERT(varchar(11),CLARITY_ID)
FROM dbo.MUSEUM_PROGRAM_PROJECTS as A
JOIN dbo.MUSEUM_PROJECTS as B on B.PROJECT_ID = A.PROJECT_ID
WHERE A.PROGRAM_ID = X.PROGRAM_ID
FOR XML PATH ('')),1,2,'') AS PROJECT_CLARITY_IDS
FROM MUSEUM_PROGRAM_PROJECTS X
)
SELECT
P.PROGRAM_ID,
PROGRAM_NAME,
PROGRAM_DESC,
P.STATUS_ID,
STATUS_DESC,
P.CONTACT_SID,
I.FIRST_NAME + ' ' + I.LAST_NAME as CONTACT_NAME,
P.CLARITY_ID,
X.PROJECT_CLARITY_IDS
FROM dbo.MUSEUM_PROGRAMS as P
LEFT JOIN dbo.MUSEUM_PROGRAM_STATUS_TYPES as S on S.STATUS_ID = P.STATUS_ID
LEFT JOIN dbo.v_IDVAULT_ENRICHED_CURRENT_EMPLOYEES as I on I.[SID] = P.CONTACT_SID
LEFT JOIN CTE_PROJECT_CLARITIES X ON X.PROGRAM_ID = p.PROGRAM_ID
SQLFiddle DEMO (not sure if I got the columns right, but you'll get the idea)

replace value in varchar(max) field with join

I have a table that contains text field with placeholders. Something like this:
Row Notes
1. This is some notes ##placeholder130## this ##myPlaceholder##, #oneMore#. End.
2. Second row...just a ##test#.
(This table contains about 1-5k rows on average. Average number of placeholders in one row is 5-15).
Now, I have a lookup table that looks like this:
Name Value
placeholder130 Dog
myPlaceholder Cat
oneMore Cow
test Horse
(Lookup table will contain anywhere from 10k to 100k records)
I need to find the fastest way to join those placeholders from strings to a lookup table and replace with value. So, my result should look like this (1st row):
This is some notes Dog this Cat, Cow. End.
What I came up with was to split each row into multiple for each placeholder and then join it to lookup table and then concat records back to original row with new values, but it takes around 10-30 seconds on average.

You could try to split the string using a numbers table and rebuild it with for xml path.
select (
select coalesce(L.Value, T.Value)
from Numbers as N
cross apply (select substring(Notes.notes, N.Number, charindex('##', Notes.notes + '##', N.Number) - N.Number)) as T(Value)
left outer join Lookup as L
on L.Name = T.Value
where N.Number <= len(notes) and
substring('##' + notes, Number, 2) = '##'
order by N.Number
for xml path(''), type
).value('text()[1]', 'varchar(max)')
from Notes
SQL Fiddle
I borrowed the string splitting from this blog post by Aaron Bertrand

SQL Server is not very fast with string manipulation, so this is probably best done client-side. Have the client load the entire lookup table, and replace the notes as they arrived.
Having said that, it can of course be done in SQL. Here's a solution with a recursive CTE. It performs one lookup per recursion step:
; with Repl as
(
select row_number() over (order by l.name) rn
, Name
, Value
from Lookup l
)
, Recurse as
(
select Notes
, 0 as rn
from Notes
union all
select replace(Notes, '##' + l.name + '##', l.value)
, r.rn + 1
from Recurse r
join Repl l
on l.rn = r.rn + 1
)
select *
from Recurse
where rn =
(
select count(*)
from Lookup
)
option (maxrecursion 0)
Example at SQL Fiddle.
Another option is a while loop to keep replacing lookups until no more are found:
declare #notes table (notes varchar(max))
insert #notes
select Notes
from Notes
while 1=1
begin
update n
set Notes = replace(n.Notes, '##' + l.name + '##', l.value)
from #notes n
outer apply
(
select top 1 Name
, Value
from Lookup l
where n.Notes like '%##' + l.name + '##%'
) l
where l.name is not null
if ##rowcount = 0
break
end
select *
from #notes
Example at SQL Fiddle.

I second the comment that tsql is just not suited for this operation, but if you must do it in the db here is an example using a function to manage the multiple replace statements.
Since you have a relatively small number of tokens in each note (5-15) and a very large number of tokens (10k-100k) my function first extracts tokens from the input as potential tokens and uses that set to join to your lookup (dbo.Token below). It was far too much work to look for an occurrence of any of your tokens in each note.
I did a bit of perf testing using 50k tokens and 5k notes and this function runs really well, completing in <2 seconds (on my laptop). Please report back how this strategy performs for you.
note: In your example data the token format was not consistent (##_#, ##_##, #_#), I am guessing this was simply a typo and assume all tokens take the form of ##TokenName##.
--setup
if object_id('dbo.[Lookup]') is not null
drop table dbo.[Lookup];
go
if object_id('dbo.fn_ReplaceLookups') is not null
drop function dbo.fn_ReplaceLookups;
go
create table dbo.[Lookup] (LookupName varchar(100) primary key, LookupValue varchar(100));
insert into dbo.[Lookup]
select '##placeholder130##','Dog' union all
select '##myPlaceholder##','Cat' union all
select '##oneMore##','Cow' union all
select '##test##','Horse';
go
create function [dbo].[fn_ReplaceLookups](#input varchar(max))
returns varchar(max)
as
begin
declare #xml xml;
select #xml = cast(('<r><i>'+replace(#input,'##' ,'</i><i>')+'</i></r>') as xml);
--extract the potential tokens
declare #LookupsInString table (LookupName varchar(100) primary key);
insert into #LookupsInString
select distinct '##'+v+'##'
from ( select [v] = r.n.value('(./text())[1]', 'varchar(100)'),
[r] = row_number() over (order by n)
from #xml.nodes('r/i') r(n)
)d(v,r)
where r%2=0;
--tokenize the input
select #input = replace(#input, l.LookupName, l.LookupValue)
from dbo.[Lookup] l
join #LookupsInString lis on
l.LookupName = lis.LookupName;
return #input;
end
go
return
--usage
declare #Notes table ([Id] int primary key, notes varchar(100));
insert into #Notes
select 1, 'This is some notes ##placeholder130## this ##myPlaceholder##, ##oneMore##. End.' union all
select 2, 'Second row...just a ##test##.';
select *,
dbo.fn_ReplaceLookups(notes)
from #Notes;
Returns:
Tokenized
--------------------------------------------------------
This is some notes Dog this Cat, Cow. End.
Second row...just a Horse.

Try this
;WITH CTE (org, calc, [Notes], [level]) AS
(
SELECT [Notes], [Notes], CONVERT(varchar(MAX),[Notes]), 0 FROM PlaceholderTable
UNION ALL
SELECT CTE.org, CTE.[Notes],
CONVERT(varchar(MAX), REPLACE(CTE.[Notes],'##' + T.[Name] + '##', T.[Value])), CTE.[level] + 1
FROM CTE
INNER JOIN LookupTable T ON CTE.[Notes] LIKE '%##' + T.[Name] + '##%'
)
SELECT DISTINCT org, [Notes], level FROM CTE
WHERE [level] = (SELECT MAX(level) FROM CTE c WHERE CTE.org = c.org)
SQL FIDDLE DEMO
Check the below devioblog post for reference
devioblog post

To get speed, you can preprocess the note templates into a more efficient form. This will be a sequence of fragments, with each ending in a substitution. The substitution might be NULL for the last fragment.
Notes
Id FragSeq Text SubsId
1 1 'This is some notes ' 1
1 2 ' this ' 2
1 3 ', ' 3
1 4 '. End.' null
2 1 'Second row...just a ' 4
2 2 '.' null
Subs
Id Name Value
1 'placeholder130' 'Dog'
2 'myPlaceholder' 'Cat'
3 'oneMore' 'Cow'
4 'test' 'Horse'
Now we can do the substitutions with a simple join.
SELECT Notes.Text + COALESCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
This produces a list of fragments with substitutions complete. I am not an MSQL user, but in most dialects of SQL you can concatenate these fragments in a variable quite easily:
DECLARE #Note VARCHAR(8000)
SELECT #Note = COALESCE(#Note, '') + Notes.Text + COALSCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
Pre-processing a note template into fragments will be straightforward using the string splitting techniques of other posts.
Unfortunately I'm not at a location where I can test this, but it ought to work fine.

I really don't know how it will perform with 10k+ of lookups.
how does the old dynamic SQL performs?
DECLARE #sqlCommand NVARCHAR(MAX)
SELECT #sqlCommand = N'PlaceholderTable.[Notes]'
SELECT #sqlCommand = 'REPLACE( ' + #sqlCommand +
', ''##' + LookupTable.[Name] + '##'', ''' +
LookupTable.[Value] + ''')'
FROM LookupTable
SELECT #sqlCommand = 'SELECT *, ' + #sqlCommand + ' FROM PlaceholderTable'
EXECUTE sp_executesql #sqlCommand
Fiddle demo

And now for some recursive CTE.
If your indexes are correctly set up, this one should be very fast or very slow. SQL Server always surprises me with performance extremes when it comes to the r-CTE...
;WITH T AS (
SELECT
Row,
StartIdx = 1, -- 1 as first starting index
EndIdx = CAST(patindex('%##%', Notes) as int), -- first ending index
Result = substring(Notes, 1, patindex('%##%', Notes) - 1)
-- (first) temp result bounded by indexes
FROM PlaceholderTable -- **this is your source table**
UNION ALL
SELECT
pt.Row,
StartIdx = newstartidx, -- starting index (calculated in calc1)
EndIdx = EndIdx + CAST(newendidx as int) + 1, -- ending index (calculated in calc4 + total offset)
Result = Result + CAST(ISNULL(newtokensub, newtoken) as nvarchar(max))
-- temp result taken from subquery or original
FROM
T
JOIN PlaceholderTable pt -- **this is your source table**
ON pt.Row = T.Row
CROSS APPLY(
SELECT newstartidx = EndIdx + 2 -- new starting index moved by 2 from last end ('##')
) calc1
CROSS APPLY(
SELECT newtxt = substring(pt.Notes, newstartidx, len(pt.Notes))
-- current piece of txt we work on
) calc2
CROSS APPLY(
SELECT patidx = patindex('%##%', newtxt) -- current index of '##'
) calc3
CROSS APPLY(
SELECT newendidx = CASE
WHEN patidx = 0 THEN len(newtxt) + 1
ELSE patidx END -- if last piece of txt, end with its length
) calc4
CROSS APPLY(
SELECT newtoken = substring(pt.Notes, newstartidx, newendidx - 1)
-- get the new token
) calc5
OUTER APPLY(
SELECT newtokensub = Value
FROM LookupTable
WHERE Name = newtoken -- substitute the token if you can find it in **your lookup table**
) calc6
WHERE newstartidx + len(newtxt) - 1 <= len(pt.Notes)
-- do this while {new starting index} + {length of txt we work on} exceeds total length
)
,lastProcessed AS (
SELECT
Row,
Result,
rn = row_number() over(partition by Row order by StartIdx desc)
FROM T
) -- enumerate all (including intermediate) results
SELECT *
FROM lastProcessed
WHERE rn = 1 -- filter out intermediate results (display only last ones)

Call SQL Functions after PIVOT

I have a stored procedure that's taking a very long time because I have 2 function calls that are being called before a PIVOT, which means it's calling the functions 5 times for each record rather than once for each record. How can I get rewrite my query so that the 2 function calls right at the end of the query are run after the Pivot rather than before?
Here's the query
CREATE TABLE #Temp
(
ServiceRecordID INT,
LocationStd VARCHAR(1000),
AreaServedStd VARCHAR(1000),
RegionalLimited BIT,
Region VARCHAR(255),
Visible BIT
)
DECLARE #RegionCount INT
SELECT #RegionCount = COUNT(RegionID) FROM Regions WHERE SiteID = #SiteID AND RegionID % 100 != 0
INSERT INTO #Temp
SELECT TOP (#RegionCount * 100) SR.ServiceRecordID, SR.LocationStd, SR.AreaServedStd, SR.RegionalLimited, R.Region,
CASE WHEN (ISNULL(R_SR.RegionID,0) = 0 AND ISNULL(R_SR_Serv.RegionID,0) = 0) THEN 0 ELSE 1 END AS Visible
FROM ServiceRecord SR
INNER JOIN Sites S ON SR.SiteID = S.SiteID
INNER JOIN Regions R ON R.SiteID = S.SiteID
LEFT OUTER JOIN lkup_Region_ServiceRecord R_SR ON R_SR.RegionID = R.RegionID AND R_SR.ServiceRecordID = SR.ServiceRecordID
LEFT OUTER JOIN lkup_Region_ServiceRecord_Serv R_SR_Serv ON R_SR_Serv.RegionID = R.RegionID AND R_SR_Serv.ServiceRecordID = SR.ServiceRecordID AND SR.RegionalLimited = 0
WHERE SR.SiteID = #SiteID
AND R.RegionID % 100 != 0
ORDER BY SR.ServiceRecordID
DECLARE #RegionList varchar(2000),#SQL varchar(max)
SELECT #RegionList = STUFF((SELECT DISTINCT ',[' + Region + ']' FROM #Temp ORDER BY ',[' + Region + ']' FOR XML PATH('')),1,1,'')
SET #SQL='SELECT * FROM
(SELECT ServiceRecordID,
dbo.fn_ServiceRecordGetServiceName(ServiceRecordID,'''') AS ServiceName,
LocationStd,
AreaServedStd,
RegionalLimited,
Region As Region,
dbo.fn_GetOtherRegionalSitesForServiceRecord(ServiceRecordID) AS OtherSites,
CAST(Visible AS INT) AS Visible FROM #Temp) B PIVOT(MAX(Visible) FOR Region IN (' + #RegionList + ')) A'
EXEC(#SQL)

Move the function calls after the PIVOT:
SET #SQL='
SELECT
A.*,
N.ServiceName,
S.OtherSites
FROM
(
SELECT
ServiceRecordID,
LocationStd,
AreaServedStd,
RegionalLimited,
Region,
CAST(Visible AS INT) AS Visible
FROM #Temp
) B
PIVOT(MAX(Visible) FOR Region IN (' + #RegionList + ')) A
OUTER APPLY (
SELECT dbo.fn_ServiceRecordGetServiceName(A.ServiceRecordID,'''')
) N (ServiceName)
OUTER APPLY (
SELECT dbo.fn_GetOtherRegionalSitesForServiceRecord(A.ServiceRecordID)
) S (OtherSites);
';
Or just put them in the outer SELECT:
SET #SQL='
SELECT
A.*,
ServiceName = dbo.fn_ServiceRecordGetServiceName(A.ServiceRecordID,''''),
OtherSites = dbo.fn_GetOtherRegionalSitesForServiceRecord(A.ServiceRecordID)
FROM
(
SELECT
ServiceRecordID,
LocationStd,
AreaServedStd,
RegionalLimited,
Region,
CAST(Visible AS INT) AS Visible
FROM #Temp
) B
PIVOT(MAX(Visible) FOR Region IN (' + #RegionList + ')) A
';
If you can possibly convert those functions to be table-valued rowset-returning consisting of a single SELECT statement, you may get a huge performance improvement as well.
CREATE FUNCTION dbo.fn_ServiceRecordGetServiceName2(
#ServiceRecordID itn
)
RETURNS TABLE
AS
RETURN ( -- single select statement
SELECT ServiceName = Blah
FROM dbo.Gorp
WHERE Gunk = 'Ralph'
);
Then
OUTER APPLY dbo.fn_ServiceRecordGetServiceName(ServiceRecordID,'''') N
And N.ServiceName will return the value(s).
Also, it is not correct to tack on square brackets to convert data values to valid sysnames. You should use the QuoteName function. This will ensure your system doesn't break no matter WHAT crazy value is entered 13 years from now (think 'Taiwan [North]'):
STUFF((SELECT DISTINCT ',' + QuoteName(Region) FROM #Temp ...
Note:
Since you said that this is for display in a web page, you don't even need to do the pivoting on the server. Instead, return 2 rowsets to the client, one with the Site data and one with the column data for the Regions. You would need an additional pass through every row in the Region rowset to find out all the regions needed, but this can be done very quickly. Finally, adjust your program code to step through the Region rows as needed for each matching Site, and created your output.
One reason this is worth the investment is that if your application grows in size, you can always throw another web server at the problem, but it's a lot harder to throw another database at it. A new web server will cost less than continually beefing up your SQL Server.
P.S. Even dynamic SQL is easier to deal with when you format it well. :)

Join [one word per row] to rows of phrases with [multiple words per row]

Please excuse the length of the question. I included a test script to demo the situation and my best attempt at a solution.
There are two tables:
test_WORDS = Words extracted in order from several sources. The OBJ_FK column is the ID of the source. WORD_ID is an identifier for the word itself that is unique within the source. Each row contains one word.
test_PHRASE = a list of phrases to be searched for in test_WORDS. The PHRASE_TEXT column is a space separated phrase like 'foo bar' (see below) so that each row contains multiple words.
Requirement:
Return the first word from test_WORDS that is the start of a matching a phrase from test_PHRASE.
I would prefer something set based to avoid RBAR approach below. Also my solution is limited to 5 word phrases. I need to support up to 20 word phrases. Is it possible to match the words from a row in test_PHRASE to contiguous rows in the test_WORD without cursors?
After breaking the phrase words out into a temporary table, the problem boils down to matching portions of two sets together in row order.
-- Create test data
CREATE TABLE [dbo].[test_WORDS](
[OBJ_FK] [bigint] NOT NULL, --FK to the source object
[WORD_ID] [int] NOT NULL, --The word order in the source object
[WORD_TEXT] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_test_WORDS] PRIMARY KEY CLUSTERED
(
[OBJ_FK] ASC,
[WORD_ID] ASC
)
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[test_PHRASE](
[ID] [int], --PHRASE ID
[PHRASE_TEXT] [nvarchar](150) NOT NULL --Space-separated phrase
CONSTRAINT [PK_test_PHRASE] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
INSERT INTO dbo.test_WORDS
SELECT 1,1,'aaa' UNION ALL
SELECT 1,2,'bbb' UNION ALL
SELECT 1,3,'ccc' UNION ALL
SELECT 1,4,'ddd' UNION ALL
SELECT 1,5,'eee' UNION ALL
SELECT 1,6,'fff' UNION ALL
SELECT 1,7,'ggg' UNION ALL
SELECT 1,8,'hhh' UNION ALL
SELECT 2,1,'zzz' UNION ALL
SELECT 2,2,'yyy' UNION ALL
SELECT 2,3,'xxx' UNION ALL
SELECT 2,4,'www'
INSERT INTO dbo.test_PHRASE
SELECT 1, 'bbb ccc ddd' UNION ALL --should match
SELECT 2, 'ddd eee fff' UNION ALL --should match
SELECT 3, 'xxx xxx xxx' UNION ALL --should NOT match
SELECT 4, 'zzz yyy xxx' UNION ALL --should match
SELECT 5, 'xxx www ppp' UNION ALL --should NOT match
SELECT 6, 'zzz yyy xxx www' --should match
-- Create variables
DECLARE #maxRow AS INTEGER
DECLARE #currentRow AS INTEGER
DECLARE #phraseSubsetTable AS TABLE(
[ROW] int IDENTITY(1,1) NOT NULL,
[ID] int NOT NULL, --PHRASE ID
[PHRASE_TEXT] nvarchar(150) NOT NULL
)
--used to split the phrase into words
--note: No permissions to sys.dm_fts_parser
DECLARE #WordList table
(
ID int,
WORD nvarchar(50)
)
--Records to be returned to caller
DECLARE #returnTable AS TABLE(
OBJECT_FK INT NOT NULL,
WORD_ID INT NOT NULL,
PHRASE_ID INT NOT NULL
)
DECLARE #phrase AS NVARCHAR(150)
DECLARE #phraseID AS INTEGER
-- Get subset of phrases to simulate a join that would occur in production
INSERT INTO #phraseSubsetTable
SELECT ID, PHRASE_TEXT
FROM dbo.test_PHRASE
--represent subset of phrases caused by join in production
WHERE ID IN (2,3,4)
-- Loop each phrase in the subset, split into rows of words and return matches to the test_WORDS table
SET #maxRow = ##ROWCOUNT
SET #currentRow = 1
WHILE #currentRow <= #maxRow
BEGIN
SELECT #phrase=PHRASE_TEXT, #phraseID=ID FROM #phraseSubsetTable WHERE row = #currentRow
--clear previous phrase that was split into rows
DELETE FROM #WordList
--Recursive Function with CTE to create recordset of words, one per row
;WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(' ', #phrase)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(' ', #phrase, stop + 1)
FROM Pieces
WHERE stop > 0)
--Create the List of words with the CTE above
insert into #WordList
SELECT pn,
SUBSTRING(#phrase, start, CASE WHEN stop > 0 THEN stop-start ELSE 1056 END) AS WORD
FROM Pieces
DECLARE #wordCt as int
select #wordCt=count(ID) from #WordList;
-- Do the actual query using a CTE with a rownumber that repeats for every SOURCE OBJECT
;WITH WordOrder_CTE AS (
SELECT OBJ_FK, WORD_ID, WORD_TEXT,
ROW_NUMBER() OVER (Partition BY OBJ_FK ORDER BY WORD_ID) AS rownum
FROM test_WORDS)
--CREATE a flattened record of the first word in the phrase and join it to the rest of the words.
INSERT INTO #returnTable
SELECT r1.OBJ_FK, r1.WORD_ID, #phraseID AS PHRASE_ID
FROM WordOrder_CTE r1
INNER JOIN #WordList w1 ON r1.WORD_TEXT = w1.WORD and w1.ID=1
LEFT JOIN WordOrder_CTE r2
ON r1.rownum = r2.rownum - 1 and r1.OBJ_FK = r2.OBJ_FK
LEFT JOIN #WordList w2 ON r2.WORD_TEXT = w2.WORD and w2.ID=2
LEFT JOIN WordOrder_CTE r3
ON r1.rownum = r3.rownum - 2 and r1.OBJ_FK = r3.OBJ_FK
LEFT JOIN #WordList w3 ON r3.WORD_TEXT = w3.WORD and w3.ID=3
LEFT JOIN WordOrder_CTE r4
ON r1.rownum = r4.rownum - 3 and r1.OBJ_FK = r4.OBJ_FK
LEFT JOIN #WordList w4 ON r4.WORD_TEXT = w4.WORD and w4.ID=4
LEFT JOIN WordOrder_CTE r5
ON r1.rownum = r5.rownum - 4 and r1.OBJ_FK = r5.OBJ_FK
LEFT JOIN #WordList w5 ON r5.WORD_TEXT = w5.WORD and w5.ID=5
WHERE (#wordCt < 2 OR w2.ID is not null) and
(#wordCt < 3 OR w3.ID is not null) and
(#wordCt < 4 OR w4.ID is not null) and
(#wordCt < 5 OR w5.ID is not null)
--loop
SET #currentRow = #currentRow+1
END
--Return the first words of each matching phrase
SELECT OBJECT_FK, WORD_ID, PHRASE_ID FROM #returnTable
GO
--Clean up
DROP TABLE [dbo].[test_WORDS]
DROP TABLE [dbo].[test_PHRASE]
Edited solution:
This is an edit of the correct solution provided below to account for non-contiguous word IDs. Hope this helps someone as much as it did me.
;WITH
numberedwords AS (
SELECT
OBJ_FK,
WORD_ID,
WORD_TEXT,
rowcnt = ROW_NUMBER() OVER
(PARTITION BY OBJ_FK ORDER BY WORD_ID DESC),
totalInSrc = COUNT(WORD_ID) OVER (PARTITION BY OBJ_FK)
FROM dbo.test_WORDS
),
phrasedwords AS (
SELECT
nw1.OBJ_FK,
nw1.WORD_ID,
nw1.WORD_TEXT,
PHRASE_TEXT = RTRIM((
SELECT [text()] = nw2.WORD_TEXT + ' '
FROM numberedwords nw2
WHERE nw1.OBJ_FK = nw2.OBJ_FK
AND nw2.rowcnt BETWEEN nw1.rowcnt AND nw1.totalInSrc
ORDER BY nw2.OBJ_FK, nw2.WORD_ID
FOR XML PATH ('')
))
FROM numberedwords nw1
GROUP BY nw1.OBJ_FK, nw1.WORD_ID, nw1.WORD_TEXT, nw1.rowcnt, nw1.totalInSrc
)
SELECT *
FROM phrasedwords pw
INNER JOIN test_PHRASE tp
ON LEFT(pw.PHRASE_TEXT, LEN(tp.PHRASE_TEXT)) = tp.PHRASE_TEXT
ORDER BY pw.OBJ_FK, pw.WORD_ID
Note: The final query I used in production uses indexed temp tables instead of CTEs. I also limited the length of the PHRASE_TEXT column to my needs. With these improvements, I was able to reduce my query time from over 3 minutes to 3 seconds!

Here's a solution that uses a different approach: instead of splitting the phrases into words it combines the words into phrases.
Edited: changed the rowcnt expression to using COUNT(*) OVER …, as suggested by #ErikE in the comments.
;WITH
numberedwords AS (
SELECT
OBJ_FK,
WORD_ID,
WORD_TEXT,
rowcnt = COUNT(*) OVER (PARTITION BY OBJ_FK)
FROM dbo.test_WORDS
),
phrasedwords AS (
SELECT
nw1.OBJ_FK,
nw1.WORD_ID,
nw1.WORD_TEXT,
PHRASE_TEXT = RTRIM((
SELECT [text()] = nw2.WORD_TEXT + ' '
FROM numberedwords nw2
WHERE nw1.OBJ_FK = nw2.OBJ_FK
AND nw2.WORD_ID BETWEEN nw1.WORD_ID AND nw1.rowcnt
ORDER BY nw2.OBJ_FK, nw2.WORD_ID
FOR XML PATH ('')
))
FROM numberedwords nw1
GROUP BY nw1.OBJ_FK, nw1.WORD_ID, nw1.WORD_TEXT, nw1.rowcnt
)
SELECT *
FROM phrasedwords pw
INNER JOIN test_PHRASE tp
ON LEFT(pw.PHRASE_TEXT, LEN(tp.PHRASE_TEXT)) = tp.PHRASE_TEXT
ORDER BY pw.OBJ_FK, pw.WORD_ID

Using a Split function should work.
Split Function
CREATE FUNCTION dbo.Split
(
#RowData nvarchar(2000),
#SplitOn nvarchar(5)
)
RETURNS #RtnValue table
(
Id int identity(1,1),
Data nvarchar(100)
)
AS
BEGIN
Declare #Cnt int
Set #Cnt = 1
While (Charindex(#SplitOn,#RowData)>0)
Begin
Insert Into #RtnValue (data)
Select
Data = ltrim(rtrim(Substring(#RowData,1,Charindex(#SplitOn,#RowData)-1)))
Set #RowData = Substring(#RowData,Charindex(#SplitOn,#RowData)+1,len(#RowData))
Set #Cnt = #Cnt + 1
End
Insert Into #RtnValue (data)
Select Data = ltrim(rtrim(#RowData))
Return
END
SQL Statement
SELECT DISTINCT p.*
FROM dbo.test_PHRASE p
LEFT OUTER JOIN (
SELECT p.ID
FROM dbo.test_PHRASE p
CROSS APPLY dbo.Split(p.PHRASE_TEXT, ' ') sp
LEFT OUTER JOIN dbo.test_WORDS w ON w.WORD_TEXT = sp.Data
WHERE w.OBJ_FK IS NULL
) ignore ON ignore.ID = p.ID
WHERE ignore.ID IS NULL

This performs a little better than other solutions given. if you don't need WORD_ID, just WORD_TEXT, you can remove a whole column. I know this was over a year ago, but I wonder if you can get 3 seconds down to 30 ms? :)
If this query seems good, then my biggest speed advice is to put the entire phrases into a separate table (using your example data, it would have only 2 rows with phrases of length 8 words and 4 words).
SELECT
W.OBJ_FK,
X.Phrase,
P.*,
Left(P.PHRASE_TEXT,
IsNull(NullIf(CharIndex(' ', P.PHRASE_TEXT), 0) - 1, 2147483647)
) WORD_TEXT,
Len(Left(X.Phrase, PatIndex('%' + P.PHRASE_TEXT + '%', ' ' + X.Phrase) - 1))
- Len(Replace(
Left(X.Phrase, PatIndex('%' + P.PHRASE_TEXT + '%', X.Phrase) - 1), ' ', '')
)
WORD_ID
FROM
(SELECT DISTINCT OBJ_FK FROM dbo.test_WORDS) W
CROSS APPLY (
SELECT RTrim((SELECT WORD_TEXT + ' '
FROM dbo.test_WORDS W2
WHERE W.OBJ_FK = W2.OBJ_FK
ORDER BY W2.WORD_ID
FOR XML PATH (''))) Phrase
) X
INNER JOIN dbo.test_PHRASE P
ON X.Phrase LIKE '%' + P.PHRASE_TEXT + '%';
Here's another version for curiosity's sake. It doesn't perform quite as well.
WITH Calc AS (
SELECT
P.ID,
P.PHRASE_TEXT,
W.OBJ_FK,
W.WORD_ID StartID,
W.WORD_TEXT StartText,
W.WORD_ID,
Len(W.WORD_TEXT) + 2 NextPos,
Convert(varchar(150), W.WORD_TEXT) MatchingPhrase
FROM
dbo.test_PHRASE P
INNER JOIN dbo.test_WORDS W
ON P.PHRASE_TEXT + ' ' LIKE W.WORD_TEXT + ' %'
UNION ALL
SELECT
C.ID,
C.PHRASE_TEXT,
C.OBJ_FK,
C.StartID,
C.StartText,
W.WORD_ID,
C.NextPos + Len(W.WORD_TEXT) + 1,
Convert(varchar(150), C.MatchingPhrase + Coalesce(' ' + W.WORD_TEXT, ''))
FROM
Calc C
INNER JOIN dbo.test_WORDS W
ON C.OBJ_FK = W.OBJ_FK
AND C.WORD_ID + 1 = W.WORD_ID
AND Substring(C.PHRASE_TEXT, C.NextPos, 2147483647) + ' ' LIKE W.WORD_TEXT + ' %'
)
SELECT C.OBJ_FK, C.PHRASE_TEXT, C.StartID, C.StartText, C.ID
FROM Calc C
WHERE C.PHRASE_TEXT = C.MatchingPhrase;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

lucene, or sql fulltext? - sql

Related

Recursive query from the same table

Operand type clash: varchar is incompatible with User-Defined Table Type

replace value in varchar(max) field with join

Call SQL Functions after PIVOT

Join [one word per row] to rows of phrases with [multiple words per row]

Categories

Resources