SQL Full Text search on HTML/XML data - sql

I have a sql full text catalog on a cms database (SQL 2005). The database holds the CMS page content within a ntext column which is part of the full text catalog. As expected the searching takes into account the xml tags within the page content so searching for "H1" returns all the pages with H1 tags.
Is it possible to apply filters within the full text search to only index data within the xml tags.
I can see it is possible for SQL full text search to index/search .html binary types or xml columns. However as you can see the setup is slightly different to this.
Many Thanks,
Adam

Unfortunately, you can't change away from the default "text" iFilter on a text/varchar ntext/nvarchar column.
If you can't change the data type of the column to varbinary, your next-best bet might be to add the HTML tag names as stop words, so they get ignored during indexing and searching.
I should add that ntext has been deprecated, so you will need to move away from it eventually anyway.

Related

SOLR: Get Full Text Content for document that matches query

I have a SOLR instance and wish to extract out the full text content that was indexed within the instance. Is this possible?
If there is a query type I can use to fetch the full text content from the instance, I'd be grateful if someone could point me to it!
Well, it turns out that the SOLR schema must specify stored="true" on the field that stores the full text in order for a query to fetch all of the content in that field.
My schema specified the opposite, which means that this text is lot retrievable (though it is searchable, as the schema specifies index="true"!

Efficient way to store HTML in database

I have a text editor on webpage. It contains function like Bold, Italics, Highlight. So a text may contain any of these. It may even contain numbered or unnumbered lists.
The text editor generates HTML for the formatted text.
Due to this, the format text data (html) is atleast 60% more than what unformatted text would have been.
This consumes lot of space (in terms of characters) which leads to space hungry database.
Is there a way to compress or some other way to store this efficiently ?
There is no built-in compression function in Db2. But you may write your own external functions (using Java or C/C++) to implement such a functionality. I can provide a java example (using java.util.zip package) of such an implementation, if you are interesting.
Another way is to use Db2 Row compression. Db2 may compress any non-LOB columns and so called "inlined" LOBs.
Storing LOBs inline in table rows
If you store your data as XML in a Db2 XML data-type column, it will be stored in a more efficient form than raw text
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.xml.doc/doc/c0022770.html

How to implement Full-Text search in multilingual content in SQL Server

We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search.
The table structure we have currently like below.
CREATE TABLE Product
(
ID INT IDENTITY(1,1),
Code VARCHAR(50),
........
........
)
CREATE TABLE ProductLanguage
(
ID INT,
LanguageID INT,
Name NVARCHAR(200),
........
........
)
We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.
So what is the best way to implement Full-Text search in SQL Server for multilingual content.
Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?
We are using SQL Server 2008 R2.
Certain content (document) types support language settings - e.g. Microsoft Office Documents, PDF, [X]HTML, or XML.
If you change the type of your Name column to XML, you can determine the language of each value (i.e. per row). For instance:
Instead of storing values as strings
name 1
name 2
name 3
...you could store them as XML documents with the appropriate language declarations:
<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>
During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.
Of course, this would require a significant change in the way your data is managed and consumed.
ML
I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it.
One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, i.e. again: performance.
Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.
When the indexed content is of binary type (such as a Microsoft Word
document), the iFilter responsible for processing the text content
before sending it to the word breaker might honor specific language
tags in the binary file. When this is the case, at indexing time the
iFilter invokes the correct word breaker for a specific document or
section of a document specified in a particular language. All you need
to do in this case is to verify after indexing that the multilanguage
content was indexed correctly. Filters for Word, HTML, and XML
documents honor language specification attributes in document content:
Word – language settings
HTML - <meta name=“MS.locale”…>
XML –
xml:lang attribute
When your content is plain text, you
can convert it to the XML data type and add specific language tags to
indicate the language corresponding to that specific document or
document section. Note that for this to work, before you index you
must know the language that will be used.
https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx

What data type to use for variable length data (for performance)?

What data type should I use for data that can be very short, eg. html link (think twitter), or very long eg. html blog post (think wordpress).
I am thinking if I use varchar(4000), it maybe too short for a html formated blog entry? but if I use text, it will take up more space and is less efficient?
[update]
i am still condering using MySQL (if PHP 5.3/Zend Framework) or MSSQL (if ASP.NET MVC 2)
MySQL also has a Text data type for storing an arbitrarily large amount of text. You can find more here: The BLOB and TEXT Types
If you are using Micrsoft SQL server 2008 you can use varchar(max).
Edit:
Text is also available but isn't searchable without text indexing..

Postgres XML datatype

What are the benefits of using the "xml" datatype versus storing the xml content inside a "text" datatype?
Am I able to query by some specific xml attribute or element?
What about indexing and query performance?
Besides the postgresql manual what other online sources can you point me to?
Right now the biggest thing you get from XML fields over raw text is XPath. So if you had something similar to
CREATE TABLE pages (id int, html xml);
you could get the title of page 4 by
SELECT xpath('/html/head/title/text()', html) FROM pages WHERE id = 4;
Right now XML support is fairly limited, but got a lot better in 8.3, current docs are at link text
Generally speaking, the benefits are the same ones as for any other data type and why you have data types other than text at all:
Data integrityYou can only store valid (well, well-formed) XML values in columns of type xml.
Type safetyYou can only perform operations on XML values that make sense for XML.
One example is the xpath() function (XML Path Language), which only operates on values of type xml, not text.
Indexing and query performance characteristics are not better or worse than for say the text type at the moment.