How to implement Full-Text search in multilingual content in SQL Server - sql

We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search.
The table structure we have currently like below.
CREATE TABLE Product
(
ID INT IDENTITY(1,1),
Code VARCHAR(50),
........
........
)
CREATE TABLE ProductLanguage
(
ID INT,
LanguageID INT,
Name NVARCHAR(200),
........
........
)
We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.
So what is the best way to implement Full-Text search in SQL Server for multilingual content.
Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?
We are using SQL Server 2008 R2.

Certain content (document) types support language settings - e.g. Microsoft Office Documents, PDF, [X]HTML, or XML.
If you change the type of your Name column to XML, you can determine the language of each value (i.e. per row). For instance:
Instead of storing values as strings
name 1
name 2
name 3
...you could store them as XML documents with the appropriate language declarations:
<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>
During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.
Of course, this would require a significant change in the way your data is managed and consumed.
ML

I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it.
One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, i.e. again: performance.

Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.
When the indexed content is of binary type (such as a Microsoft Word
document), the iFilter responsible for processing the text content
before sending it to the word breaker might honor specific language
tags in the binary file. When this is the case, at indexing time the
iFilter invokes the correct word breaker for a specific document or
section of a document specified in a particular language. All you need
to do in this case is to verify after indexing that the multilanguage
content was indexed correctly. Filters for Word, HTML, and XML
documents honor language specification attributes in document content:
Word – language settings
HTML - <meta name=“MS.locale”…>
XML –
xml:lang attribute
When your content is plain text, you
can convert it to the XML data type and add specific language tags to
indicate the language corresponding to that specific document or
document section. Note that for this to work, before you index you
must know the language that will be used.
https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx

Related

How to create full text index for a multi language column?

I have a table called TestTable with the below schema.
---------------------------------------------------------
ID(Integer) | Text(nvarchar(450)) | LanguageCode(Integer)
---------------------------------------------------------
Where ID is a primary key and Text column contains text strings in multiple languages.
I would like to create a full text index on the above table.
CREATE FULLTEXT INDEX ON TestTable
(
Text
Language <Should get language code from Language column>,
)
KEY INDEX PK_TestTableID;
GO
How can I achieve this?
Please help.
It's impossible to do it on a single table at the moment. With SQL Server you can only index a column with a single language.
What Microsoft suggest is to use a neutral word breaker (http://technet.microsoft.com/en-us/library/ms142507.aspx).
You're query would be as follows:
SELECT Description
FROM Asset
WHERE FREETEXT(Description, 'Cats', 'de-DE');
The problem with this is that you don't get the obvious benefits of breaking in the language of that text.
What you could do is have a view for each culture of the table and index with that cultures specific word breaker:
e.g TestTableGermanView
I stumbled upon this in the SQL docs. I haven't dug much deeper, but it looks interesting.
"For plain text content - When your content is plain text, you can convert it to the xml data type and add language tags that indicate the language corresponding to each specific document or document section. For this to work, however, you need to know the language before full-text indexing."
https://technet.microsoft.com/en-us/library/ms142507.aspx

Using Full Text Search on file names

I have a table that stores a tree like structure of file names. There are currently 8 million records in this table. I am working on a way to quickly find a list of files what have a specific serial number embedded in the name.
FS_NODES
-----------------------------------
NODE_ID bigint PK
ROOT_ID bigint
PARENT_ID bigint
NODE_TYPE tinyint
NODE_NAME nvarchar(250)
REC_MODIFIED_UTC datetime
REC_DELETION_BIT bit
Example file name (as stored in the node_name):
scriptname_SomeSerialNumber_201205240730.xml
As expected, the LIKE statement to find the files takes several minutes to scan the entire table and would like to improve this. There is no consistent patterns for the names as each developer likes to create their own naming convention.
I tried using the Full Text Search and really love the idea but not able to get it to find files based off keywords in the name. I believe the problem is due to the underscores.
Any suggestions on how I can get this to work? I am using a neutral language for the catalog.
##VERSION
Microsoft SQL Server 2005 - 9.00.4035.00 (Intel X86)
Nov 24 2008 13:01:59
Copyright (c) 1988-2005 Microsoft Corporation
Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)
Is there a way to alter the catalog and split the keywords out manually?
Thank you!
Full-text search is not the answer. It is used for words, not partial string matching. What you should do is, when inserting or updating data in this table, extract the parts of the filename that are relevant for future searching into their own column(s) which you can index. After all, they are separate pieces of data the way you are using them. You could also consider enforcing a more predictable naming convention instead of just letting the developers do whatever they want.
EDIT per user request:
Add a computed column that is REPLACE(filename, '_', ' '). Or instead of a computed column, just a column you manually populate for existing data and change your insert procedure to deal with going forward. Or even break those out into separate rows in a related table.

Indexing multilingual words in lucene

I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese

SQL Full Text search on HTML/XML data

I have a sql full text catalog on a cms database (SQL 2005). The database holds the CMS page content within a ntext column which is part of the full text catalog. As expected the searching takes into account the xml tags within the page content so searching for "H1" returns all the pages with H1 tags.
Is it possible to apply filters within the full text search to only index data within the xml tags.
I can see it is possible for SQL full text search to index/search .html binary types or xml columns. However as you can see the setup is slightly different to this.
Many Thanks,
Adam
Unfortunately, you can't change away from the default "text" iFilter on a text/varchar ntext/nvarchar column.
If you can't change the data type of the column to varbinary, your next-best bet might be to add the HTML tag names as stop words, so they get ignored during indexing and searching.
I should add that ntext has been deprecated, so you will need to move away from it eventually anyway.

Postgres XML datatype

What are the benefits of using the "xml" datatype versus storing the xml content inside a "text" datatype?
Am I able to query by some specific xml attribute or element?
What about indexing and query performance?
Besides the postgresql manual what other online sources can you point me to?
Right now the biggest thing you get from XML fields over raw text is XPath. So if you had something similar to
CREATE TABLE pages (id int, html xml);
you could get the title of page 4 by
SELECT xpath('/html/head/title/text()', html) FROM pages WHERE id = 4;
Right now XML support is fairly limited, but got a lot better in 8.3, current docs are at link text
Generally speaking, the benefits are the same ones as for any other data type and why you have data types other than text at all:
Data integrityYou can only store valid (well, well-formed) XML values in columns of type xml.
Type safetyYou can only perform operations on XML values that make sense for XML.
One example is the xpath() function (XML Path Language), which only operates on values of type xml, not text.
Indexing and query performance characteristics are not better or worse than for say the text type at the moment.