How to create full text index for a multi language column? - sql

I have a table called TestTable with the below schema.
---------------------------------------------------------
ID(Integer) | Text(nvarchar(450)) | LanguageCode(Integer)
---------------------------------------------------------
Where ID is a primary key and Text column contains text strings in multiple languages.
I would like to create a full text index on the above table.
CREATE FULLTEXT INDEX ON TestTable
(
Text
Language <Should get language code from Language column>,
)
KEY INDEX PK_TestTableID;
GO
How can I achieve this?
Please help.

It's impossible to do it on a single table at the moment. With SQL Server you can only index a column with a single language.
What Microsoft suggest is to use a neutral word breaker (http://technet.microsoft.com/en-us/library/ms142507.aspx).
You're query would be as follows:
SELECT Description
FROM Asset
WHERE FREETEXT(Description, 'Cats', 'de-DE');
The problem with this is that you don't get the obvious benefits of breaking in the language of that text.
What you could do is have a view for each culture of the table and index with that cultures specific word breaker:
e.g TestTableGermanView

I stumbled upon this in the SQL docs. I haven't dug much deeper, but it looks interesting.
"For plain text content - When your content is plain text, you can convert it to the xml data type and add language tags that indicate the language corresponding to each specific document or document section. For this to work, however, you need to know the language before full-text indexing."
https://technet.microsoft.com/en-us/library/ms142507.aspx

Related

MS SQL fulltext search with ignoring special characters

I have a following (simplified) database structure:
[places]
name NVARCHAR(255)
description TEXT (usually quite a lot of text)
region_id INT FK
[regions]
id INT PK
name NVARCHAR(255)
[regions_translations]
lang_code NVARCHAR(5) FK
label NVARCHAR(255)
region_id INT FK
In real db I have few more fields in [places] table to search in, and [countries] table with similar structure to [regions].
My requirements are:
Search using name, description and region label, using the same behaviour as name LIKE '%text%' OR description LIKE '%text5' OR regions_translations.label LIKE '%text%'
Ignore all special characters like Ą, Ć, Ó, Š, Ö, Ü, etc. so for example, when someone search for
PO ZVAIGZDEM I return a place with name PO ŽVAIGŽDĖM - but of course also return this record, when user uses proper characters with accents.
Quite fast. ;)
I had a few approaches to solve this issue.
Create new column 'searchable_content', normalize text (so replace Ą -> A, Ö -> O and so on) and just do simple SELECT ... FROM places WHERE searchable_content LIKE '%text%' but it was slow
Add fulltext search index to table places and regions_translations - it was faster, but I could not find a way to ignore special characters (characters are from various of languages, so specyfying index language will not work)
Create new column as in first attempt, and addfulltext index only on that column - it was faster then attempt 1 (probably because I do not need to join tables) and I could manually normalize the content, but I feel like it's not a great solution.
Question is - what is the best approach here?
My top priority is to ignore special characters.
EDIT:
ALTER FULLTEXT CATALOG [catalog_name] REBUILD WITH ACCENT_SENSITIVITY = OFF
Probably is a solution to my issue with special characters (need to test it a bit more) - I query too fast, and index did not rebuild, that's why I did not get any records.
You can use the COLLATE clause on a column to specify a sql collation that will treat these special characters as their non-accented counterparts. Think of it as essentially casting one data type as another, except you're casting é as e (for example). You can use the same tool to return case sensitive or case insensitive results.
The documentation talks a little more about it, and you can do a search to find exactly which collation works best for you.
https://learn.microsoft.com/en-us/sql/t-sql/statements/collations?view=sql-server-ver16

Full Text Search SQL can't find digital value in nvarchar field

I have a Stored Procedure, which uses full-text search for my nvarchar fields. And I'm stuck when I realized, that Full-Text Search can't find field if I type only numeric values of this field.
For example, I have field Name in my table with value 'Request_121'
If I type Запрос_120 or Request - it's okay
If I type 120 - nothing is found
What is going on?
Screenshots:
No results found: https://gyazo.com/9e9e061ce68432c368db7e9162909771
Results found: https://gyazo.com/e4cb9a06da5bf8b9f4d702c55e7f181e
You cannot find 121 word part in your full-indexed column because SQL Server treats Request_121 as a single term. You can verify this by running the fts parser manually:
select * from sys.dm_fts_parser('"Request_121"', 1033, 0, 0)
Returns:
while running:
select * from sys.dm_fts_parser('"Request 121"', 1033, 0, 0)
Returns:
Note, in the second example 121 was picked as separate search term.
What you could do is to try using wildcards in your FTS query like:
FROM dbo.CardSearchIndexes idx WHERE CONTAINS(idx.Name, '"121*"');
However, again I doubt it will pick 121 being inside a non-breakable word part, only if you have 121 as standalone word. Play with sys.dm_fts_parser to see how SQL FTS engine breaks up your input and adjust your query accordingly.
UPDATE: I've noticed that you use Cyrillic search terms together with English. Notice, when running FTS queries it's also important to know what Language was specified when FTS index was created for Name column. If the FTS language locale is Cyrillic then it will not find English term Request in the Name column.
Note, in my dm_fts_parser examples above I have used 1033 (English) language id. Examine the LANGUAGE language_term operator in your CREATE FULLTEXT INDEX statement to check what language was used for FTS index.
I have field Name in my table with value 'Request_121'
Your query is wrong, you have a typo, write 121 instead of 120
FROM dbo.CardSearchIndexes idx WHERE CONTAINS(idx.Name, '121');

How to implement Full-Text search in multilingual content in SQL Server

We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search.
The table structure we have currently like below.
CREATE TABLE Product
(
ID INT IDENTITY(1,1),
Code VARCHAR(50),
........
........
)
CREATE TABLE ProductLanguage
(
ID INT,
LanguageID INT,
Name NVARCHAR(200),
........
........
)
We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.
So what is the best way to implement Full-Text search in SQL Server for multilingual content.
Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?
We are using SQL Server 2008 R2.
Certain content (document) types support language settings - e.g. Microsoft Office Documents, PDF, [X]HTML, or XML.
If you change the type of your Name column to XML, you can determine the language of each value (i.e. per row). For instance:
Instead of storing values as strings
name 1
name 2
name 3
...you could store them as XML documents with the appropriate language declarations:
<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>
During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.
Of course, this would require a significant change in the way your data is managed and consumed.
ML
I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it.
One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, i.e. again: performance.
Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.
When the indexed content is of binary type (such as a Microsoft Word
document), the iFilter responsible for processing the text content
before sending it to the word breaker might honor specific language
tags in the binary file. When this is the case, at indexing time the
iFilter invokes the correct word breaker for a specific document or
section of a document specified in a particular language. All you need
to do in this case is to verify after indexing that the multilanguage
content was indexed correctly. Filters for Word, HTML, and XML
documents honor language specification attributes in document content:
Word – language settings
HTML - <meta name=“MS.locale”…>
XML –
xml:lang attribute
When your content is plain text, you
can convert it to the XML data type and add specific language tags to
indicate the language corresponding to that specific document or
document section. Note that for this to work, before you index you
must know the language that will be used.
https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx

Fulltext search (sql server 2005) works only on some fields

OK this is the situation..
I am enabling fulltext search on a table but it only works on some fields..
CREATE FULLTEXT CATALOG [defaultcatalog]
CREATE UNIQUE INDEX ui_staticid on static(id)
CREATE FULLTEXT INDEX ON static(title_gr LANGUAGE 19,title_en,description_gr LANGUAGE 19,description_en) KEY INDEX staticid ON [defaultcatalog] WITH CHANGE_TRACKING AUTO
now why the following will bring results
Select * from static where freetext(description_en, N'str')
and this not (while the both have text with str in it ..)
Select * from static where freetext(description_gr, N'str')
(i have tried it also without the language specification - greek in this case)
(the collation is of the database is Greek_CI_AS)
btw
Select * from static where description_gr like N'%str%'
will work just fine ..
all fields are nvarchar type and the _gr fields hold english and greek text..(should not matter)
All help will be greatly appreciated
Just trying to figure out what's going on: what do you get with this query here?
SELECT * FROM static WHERE FREETEXT(*, N'str')
If you're not explicitly specifying any column to search in - does it give you the expected results?
Another point: I think you have a wrong language ID in your statement. According to SQL Server Books Online:
When specified as a string,
language_term corresponds to the alias
column value in the syslanguages
system table. The string must be
enclosed in single quotation marks, as
in 'language_term'. When specified
as an integer, language_term is the
actual LCID that identifies the
language.
and from what I found on the internet searching around, the LCID for Greek is 1032 - not 19. Can you try with 1032 instead of 19? Does that make a difference?
Marc

Postgres XML datatype

What are the benefits of using the "xml" datatype versus storing the xml content inside a "text" datatype?
Am I able to query by some specific xml attribute or element?
What about indexing and query performance?
Besides the postgresql manual what other online sources can you point me to?
Right now the biggest thing you get from XML fields over raw text is XPath. So if you had something similar to
CREATE TABLE pages (id int, html xml);
you could get the title of page 4 by
SELECT xpath('/html/head/title/text()', html) FROM pages WHERE id = 4;
Right now XML support is fairly limited, but got a lot better in 8.3, current docs are at link text
Generally speaking, the benefits are the same ones as for any other data type and why you have data types other than text at all:
Data integrityYou can only store valid (well, well-formed) XML values in columns of type xml.
Type safetyYou can only perform operations on XML values that make sense for XML.
One example is the xpath() function (XML Path Language), which only operates on values of type xml, not text.
Indexing and query performance characteristics are not better or worse than for say the text type at the moment.