enclosing all special characters in xml in cdata - cdata

i am trying to parse the xml file but getting exceptions since the file contains the special character. To make file to parse successfully i have to enclose the value in since there are many special characters in my xml files at different places is there any way like declaration or something so that these fields will get replaced with CDATA whicch will reduce the mannual work exerytime?

Well, either your parser is broken, or your XML.
XML has 5 predefined entities, which your parser should support, and whatever generates your XML should use them when appropiate:
< <
> >
& &
&apos; '
" "

Related

Why does Replace '&' with '&' not work for XML data?

I need to download a XML file and its data is retrieved from stored procedure.
My problem is if the data contains any '&' symbol, in XML file it is showing as
'&'
I have used REPLACE function in my Procedure as shown below but...
SELECT #V_NAME = REPLACE(#V_NAME, ' & ', ' & ');
UPDATE #TMP_RS_XML
SET OBJECT_ID=#V_ID,
FNAME=#V_FILE,
DOCUMENT=(SELECT #V_NAME as 'Description',
...
Now, the output is:
&amp;
This is not the way this is supposed to work...
XML is not just some text with fancy extras but with very strict rules. As any text-based container you will need either magic words or special characters to tell the consumer what is the content and what is the markup.
The most important markup characters in XML are < and > - of course. If you want these characters to be part of your content, you'll have to replace them. That is done with xml entities.
Within the content, any XML entity will start with an ampersand (< comes out as <), therefore the ampersand is the third most important special character. If you want an ampersand within the content you must use an entitiy (&) as a code for in this place we want an ampersand.
You must distinguish between the text you see, when you look at the XML and the actual content taken out of the XML.
Try this:
DECLARE #SomeStringWithSpecialCharacters NVARCHAR(200)=N'This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?' + CHAR(13) + CHAR(10) + 'Here is the second line. And an unprintable?' + CHAR(2);
--Here we use FOR XML, all the escaping is done implicitly
SELECT #SomeStringWithSpecialCharacters AS TestIt FOR XML PATH('test');
The result
<test>
<TestIt>This & that -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>
Now I take the XML as it came out of the first part and place it into a XML-typed variable.
Attention: I had to remove the  entity, check it out...
DECLARE #SomeXML XML=
N'<test>
<TestIt>This & that -> let''s see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?</TestIt>
</test>';
--Now we do the magic using .value() against a native XML:
SELECT #SomeXML.value('(/test/TestIt/text())[1]','nvarchar(max)');
The result comes out with all entities re-espaced:
This & -> let's see, why how some foreign characters behave: அரிச். And what about a line break?
Here is the second line. And an unprintable?
The general hint is: Never do the replacements yourself. Pushing content into the XML will need escaping and reading content out of XML will need the opposite. All this is done for you implicitly, when you use the proper tools.
'&' is a special character that is being rendered out of ' &amp ; '
The best practice here would be to decode the XML, adding a reference below:
https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?redirectedfrom=MSDN&view=netframework-4.8#overloads

Getting Unescaped JSON from SQL

I've created a stored procedure to pull data as a JSON object from my SQL Server database. All my data is relational and I'm trying to get it out as a JSON string.
Currently, I am able to get out a JSON string from SQL Server just fine, however this object ALWAYS includes escape characters (e.g. "{\"field\":\"value\"}). I'd like to pull the same JSON but without escaped characters. To test this I'm using some simple queries and getting them into .NET with a SqlDataAdapter using my stored procedure.
The thing that puzzles me is that when I run my query within SSMS, I never see any escape characters, but as soon as it's pulled a .NET application, the escape characters appear. I'd like to prevent this from happening and have my applications get only the unescaped JSON string.
I've tried several suggestions I've found during my research but nothing has produced my desired results. The changes I've seen (documented in MSDN and in other SO posts) have dealt with getting unescaped results, but only within SSMS and not within other applications.
What I've tried:
Simple Json query set to param and then using JSON_QUERY to select the param:
DECLARE #JSON varchar(max)
SET #JSON = (SELECT '{"Field":"Value"}' AS myJson FOR JSON PATH)
SELECT JSON_QUERY(#JSON) AS 'JsonResponse' FOR JSON PATH
This produces the following in a .NET application:
"[{\"JsonResponse\":{\"Field\":\"Value\"}}]"
This produces the following in SSMS:
[{"JsonResponse":[{"myJson":"{\"Field\":\"Value\"}"}]}]
Simple Json query without param using JSON_QUERY:
SELECT JSON_QUERY('{"Field":"Value"}') AS 'JsonResponse' FOR JSON PATH
This produces the following in a .NET application
"[{\"JsonResponse\":{\"Field\":\"Value\"}}]"
This produces the following in SSMS
[{"JsonResponse":{"Field":"Value"}}]
Simple Json query with temp tables using JSON_QUERY:
CREATE TABLE #temp(
jsoncol varchar(255)
)
INSERT INTO #temp VALUES ('{"Field":"Value"}')
SELECT JSON_QUERY(jsoncol) AS 'JsonResponse' FROM #temp FOR JSON PATH
DROP TABLE #temp
This produces the following in a .NET application:
"[{\"JsonResponse\":{\"Field\":\"Value\"}}]"
This produces the following in SSMS:
[{"JsonResponse":{"Field":"Value"}}]
I'm lead to believe that there is no way to get out a JSON string from SQL Server without having the escaped characters. In case the examples above weren't enough, I've included my stored procedure here. Hopefully someone can point me in the right direction.
This depends where you look at the string...
In SSMS a string is marked with single quotes. The double quote can exist within a string without problems:
DECLARE #SomeString = 'This can include "double quotes" but you have to double ''single quote''';
In a C# application the double quote is the string marker. So the above example would look like this:
string SomeString = "This must escape \"double quotes\" but you can use 'single quote' without problems";
Within your IDE (is it VS?) you can look at the string as is or as you'd need to be used in code. Your example shows " at the beginning and at the end of your string. That is a clear hint, that this is the option as in code. You could use this string and place it into your code. The real string, which is used and processed will not contain escape characters.
Hint: Escape characters are only needed in human-readable formats, where there are characters with special meaning (a ; in a CSV, a < in HTML and so on)...
UPDATE Some more explanation
Escape characters are needed to place a string within a string. Somehow you have to mark the beginning and the end of the string, but there is nothing else you can use then some magic characters.
In order to use these characters within the embedded string you have to go one the following ways:
escaping (e.g. XML will replace & with & and JSON will replace a " with \" as JSON uses the " to mark its labels) or
Magic borders (e.g. a CDATA-section in XML, which allows to place unescaped characters as is: <![CDATA[forbidden characters &<> allowed here]]>)
Whatever you do, you must distinguish between the visible string in an editor or in a text-based container like XML or JSON and the value the application will pick out of this.
An example:
<root><a>this & that</a></root>
visible string: "this & that"
real value: "this & that"

Illegal characters in path (Chinese characters)

Getting an "Illegal characters in path" with Directory.GetFiles:
files = Directory.GetFiles(folderName & invoiceFile & "*.pdf")
Given the actual values, the filenames would be like so:
x:\folder1\請 010203.pdf
y:\foldera\folderb\請 040506.pdf
z:\xyz\abc\請 119906.pdf
Hence the * wildcard. Can I use Chinese characters with Directory.GetFiles? I think I can since I was able to use it on a separate VBA project before using ChrW(35531) so I think it shouldn't be a problem with .NET. Anyone know a fix for this?
You need to use Directory.GetFiles Method (String, String), like this:
files = Directory.GetFiles(folderName, invoiceFile & "*.pdf")
Note that the folder name and the filter are separate parameters.

Delete the first characters before parse an XML (SAX)

I have who xml files apparently identical named wrong.xml and good.xml.
The code is the follow:
<?xml version="1.0" encoding="utf-16"?>
<tag>
</tag>
The problem is that the XMLReader class (org.xml.sax.XMLReader) detects the follow error when parse the wrong.xml.
Content is not allowed in prolog
The reason is that exist an hidden characters before prolog.
I only got to see these characters using a basic java file reader and I can see that the first and second characters are -1 and -2.
'-1''-2'<?xml version>......
Notepad, Ultraedit32, Wordpad, Notepad++, etc. neither can see them.
My real problem is that I need read the xml from an FTP automatically, then I need any way for delete these characters before parse with xmlReader without parse all document because some documents are very big.
How delete the first char of a file?
You'll have to remove those characters before the parser sees them, but you don't need to read the whole file and write it back out again with those characters removed.
A sax parser can read from an InputSource based on a Reader. There are many implementations of this Reader interface for reading from a file, url or other data source, but you can also wrap whatever your primary Reader is in a FilterReader extension that you code to perform changes needed to the data before it goes on.
It isn't difficult to code an extension of FilterReader that drops the first two characters but passes on everything else, and that will do just what you need. If the need to drop those characters isn't known until runtime, but can be detected then in a sensible way, this can be to do it only when needed. It might make sense to drop all characters before the first '<'.

Trying to parse non well-formed XML using NSXMLParser

I am parsing XML Data using NSXMLParser and I notice now, that the Elements can contain ALL characters, including for example a &. Since the parser is giving an error when it comes across this character I replaced every Occurence of this character.
Now I want to make sure to handle every of these characters that may cause Errors.
What are they and how do you think I should handle these characters best?
Thanks in advance!
To answer half your question, XML has 5 special characters that you may want to escape:
< -- replace with <
> -- replace with >
& -- replace with &
' -- replace with &apos;
and
" -- replace with "
Now, for the other half--how to find and replace these without also replacing all the tags, etc... Not easy, but I'd look in to regular expressions and NSRegularExpression: http://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
Remember, depending on your use case, to escape the values of the parameters on tags, too; <tag parameter="with "quotes"" />
You should encode these characters for instance & becomes & or " becomes "
When it goes through the parser it should come out ok. Your other option is to use a different XML parser like TBXML which doesn't do format checking.