What are the benefits of using the "xml" datatype versus storing the xml content inside a "text" datatype?
Am I able to query by some specific xml attribute or element?
What about indexing and query performance?
Besides the postgresql manual what other online sources can you point me to?
Right now the biggest thing you get from XML fields over raw text is XPath. So if you had something similar to
CREATE TABLE pages (id int, html xml);
you could get the title of page 4 by
SELECT xpath('/html/head/title/text()', html) FROM pages WHERE id = 4;
Right now XML support is fairly limited, but got a lot better in 8.3, current docs are at link text
Generally speaking, the benefits are the same ones as for any other data type and why you have data types other than text at all:
Data integrityYou can only store valid (well, well-formed) XML values in columns of type xml.
Type safetyYou can only perform operations on XML values that make sense for XML.
One example is the xpath() function (XML Path Language), which only operates on values of type xml, not text.
Indexing and query performance characteristics are not better or worse than for say the text type at the moment.
Related
We have a site which supports different languages. We have millions of data so in search we would like to implement SQL Server Full-Text Search.
The table structure we have currently like below.
CREATE TABLE Product
(
ID INT IDENTITY(1,1),
Code VARCHAR(50),
........
........
)
CREATE TABLE ProductLanguage
(
ID INT,
LanguageID INT,
Name NVARCHAR(200),
........
........
)
We would like to implement Full-Text search in "Name" column so we have created Full-Text index on the Name column. But while creating Full-Text index we can select only one language per column. If we select "English" or "Neutral" its not returning expected data in other languages like Japanese, Chinese, French etc.
So what is the best way to implement Full-Text search in SQL Server for multilingual content.
Do we need to create a different table. If yes then what will be the table structure (We need to keep in mind that the Languages are not fixed, different language can be added later) and what will be search query?
We are using SQL Server 2008 R2.
Certain content (document) types support language settings - e.g. Microsoft Office Documents, PDF, [X]HTML, or XML.
If you change the type of your Name column to XML, you can determine the language of each value (i.e. per row). For instance:
Instead of storing values as strings
name 1
name 2
name 3
...you could store them as XML documents with the appropriate language declarations:
<content xml:lang="en-US">name 1</content>
<content xml:lang="fr-FR">name 2</content>
<content xml:lang="en-UK">name 3</content>
During Full-text index population the correct word breaker/stemmer will be used, based on the language settings of each value (XML document): US English for name 1, French or name 2, and UK English for name 3.
Of course, this would require a significant change in the way your data is managed and consumed.
ML
I'd be concerned about the performance of using XML instead of NVARCHAR(n) - though I have no hard proof for it.
One alternative could be to use dynamic SQL (generate the language specific code on the fly), combined with language specific indexed views on the Product table. Drawback of thsi is the lack of execution plan caching, i.e. again: performance.
Same idea as Matija Lah's answer, but this is the suggested solution outlined in the MS whitepaper.
When the indexed content is of binary type (such as a Microsoft Word
document), the iFilter responsible for processing the text content
before sending it to the word breaker might honor specific language
tags in the binary file. When this is the case, at indexing time the
iFilter invokes the correct word breaker for a specific document or
section of a document specified in a particular language. All you need
to do in this case is to verify after indexing that the multilanguage
content was indexed correctly. Filters for Word, HTML, and XML
documents honor language specification attributes in document content:
Word – language settings
HTML - <meta name=“MS.locale”…>
XML –
xml:lang attribute
When your content is plain text, you
can convert it to the XML data type and add specific language tags to
indicate the language corresponding to that specific document or
document section. Note that for this to work, before you index you
must know the language that will be used.
https://technet.microsoft.com/en-us/library/cc721269%28v=sql.100%29.aspx
How long does an nvarchar field need to be before it is better to use a text field in SQL Server? What are the general indications for using one or the other for textual content that may or may not be queried?
From what I understand, the TEXT datatype should never be used in SQL 2005+. You should start using VARCHAR(MAX) instead.
See this question about VARCHAR(MAX) vs. TEXT.
UPDATE (per comment):
This blog does a good job at explaining the advantages. Taken from it:
But the pain from using the type text comes in when trying to query against it. For example grouping by a text type is not possible.
Another downside to using text types is increased disk IO due to the fact each record now points to a blob (or file).
So basically, VARCHAR(MAX) keeps the data with the record, and gives you the ability to treat it like other VARCHAR types, like using GROUP BY and string functions (LEN, CHARINDEX, etc.).
For TEXT, you almost always have to convert it to VARCHAR to use functions against it.
But back to the root of your question regarding efficiency, I don't think it's ever more efficient to use TEXT vs. VARCHAR(MAX). Looking at this MSDN article (search for "data types"), TEXT is deprecated, and should be replaced with VARCHAR(MAX).
First of all don't use text at all. MSDN says:
ntext, text, and image data types will
be removed in a future version of
Microsoft SQL Server. Avoid using
these data types in new development
work, and plan to modify applications
that currently use them. Use
nvarchar(max), varchar(max), and
varbinary(max) instead.
varchar(max) is what you might need.
If you compare varchar(n) vs varchar(max), these are technically two different datatypes (stored differently):
varchar(n) value is always stored inside of the row. Which means it cannot be greater than max row size, and row cannot be greater than page size, which is 8K.
varchar(max) is stored outsize the row. Row has a pointer to a separate BLOB page. However, under certain condition varchar(max) can store data as a regular row, obviously it should at least fit to the row size.
So if your row is potentially greater than 8K, you have to use varchar(max). If not, using varchar(n) will likely be preferable as it is faster to retrieve in-row data vs from outside page.
MSDN says:
Use varchar(max) when the sizes of the
column data entries vary considerably,
and the size might exceed 8,000 bytes.
The main advantage of VARCHAR over TEXT is that you can run string manipulations and string functions on it. With VARCHAR(max), now you basically have an awesome large (unrestricted) variable that you can manipulate how you want..
I have an XML file and would like to run a search on the nodes for text that matches user input. My options are:
Convert the XML file to a SQL table and run the search against the table records.
Search the XML nodes themselves.
The problem is that I cannot find a open source conversion utility, nor can I figure out how to search the XML nodes.
I can use PHP, Ruby, or Python for the search code.
Any pointers on how can I do 1 or 2?
Thanks
For #2, define an XPath expression that corresponds to the search to perform, then use one of the many XML bindings to to apply it to the XML document.
What data type should I use for data that can be very short, eg. html link (think twitter), or very long eg. html blog post (think wordpress).
I am thinking if I use varchar(4000), it maybe too short for a html formated blog entry? but if I use text, it will take up more space and is less efficient?
[update]
i am still condering using MySQL (if PHP 5.3/Zend Framework) or MSSQL (if ASP.NET MVC 2)
MySQL also has a Text data type for storing an arbitrarily large amount of text. You can find more here: The BLOB and TEXT Types
If you are using Micrsoft SQL server 2008 you can use varchar(max).
Edit:
Text is also available but isn't searchable without text indexing..
I have a sql full text catalog on a cms database (SQL 2005). The database holds the CMS page content within a ntext column which is part of the full text catalog. As expected the searching takes into account the xml tags within the page content so searching for "H1" returns all the pages with H1 tags.
Is it possible to apply filters within the full text search to only index data within the xml tags.
I can see it is possible for SQL full text search to index/search .html binary types or xml columns. However as you can see the setup is slightly different to this.
Many Thanks,
Adam
Unfortunately, you can't change away from the default "text" iFilter on a text/varchar ntext/nvarchar column.
If you can't change the data type of the column to varbinary, your next-best bet might be to add the HTML tag names as stop words, so they get ignored during indexing and searching.
I should add that ntext has been deprecated, so you will need to move away from it eventually anyway.