Beautiful Soup XML parsing safety - beautifulsoup

How dangerous is it to parse untrusted xml with Beautiful Soup?
defusedxml seems to be the go-to module for parsing untrusted xml, however it chokes on malformed xml. I want the leniency of Beautiful Soup, but is it as trustworthy as defusedxml?
List of xml vulnerabilities, for reference:
https://docs.python.org/3.8/library/xml.html#xml-vulnerabilities

Related

What's the meaning of 'soup' in jsoup and Beautiful Soup?

What's the meaning of "soup" in jsoup and Beautiful Soup, and why it is called "soup"?
It's BeautifulSoup, and is named after so-called 'tag soup', which refers to
"syntactically or structurally incorrect HTML written for a web page", from the Wikipedia definition.
jsoup is the Java version of Beautiful Soup.
According to wiki "Beautiful Soup is a Python library for parsing HTML documents (including having malformed markup, i.e. non-closed tags, so named after Tag soup)."
Those were named after Tag soup
Reference : http://en.wikipedia.org/wiki/Beautiful_Soup
Beautiful Soup is used for web-scraping and a great tool for extracting information from large unstructured data. As a Python library used for pulling data from HTML, XML, and other markup language files, Beautiful Soup can extract articles and content and turn it into a Python list or dictionary.

How to parse xml file for cocos2D

I am trying to parse a xml document in my resource folder in my project. The project is a cocos2D project and i am trying to create some type of method to parse the xml file to load levels for my game.
What would be the best way to parse the document? And an example would be great.
You can use any Objective-c XML parser in cocos2d
I suggest you to use TBXML
But instead of having data in your xml, you should save it into plist and use it directly.
If
1) the speed of parsing and
2) the convenience of retrieving the data from xml file is important
3) you do not mind C++ interface
then the best parser for you is open source:
http://code.google.com/p/pugixml/
pugixml is a light-weight C++ XML processing library. It features:
• DOM-like interface with rich traversal/modification capabilities
• Extremely fast non-validating XML parser which constructs the DOM tree from an XML file/buffer
• XPath 1.0 implementation for complex data-driven tree queries
• Full Unicode support with Unicode interface variants and automatic encoding conversions
The library is extremely portable and easy to integrate and use.
I have tested both speed and memory usage against other parsers and use it for last 5 years.
Documentation is here: http://pugixml.googlecode.com/svn/tags/latest/docs/quickstart.html
Even the post is old, I write the solution here to help other people like me in future. Solution is simple:
- Step 1: Read file content by using FileUtils
- Step 2: Parse file content to tinyxml2 (supported
I have tried and success on loading and reading XML file in cocos2d 3.8
#include "tinyxml2/tinyxml2.h"
auto s = FileUtils::getInstance()->getStringFromFile("data.xml");
tinyxml2::XMLDocument *pDoc = new tinyxml2::XMLDocument();
tinyxml2::XMLError errorId = pDoc->Parse(s.c_str(), s.size());
if (errorId == 0) {
// Write your code here
}
data.xml is placed in Resource folder.
Please read http://www.cocos2d-x.org/docs/manual/framework/native/v3/xml-parse/zhparse/zh to know how to iterate tinyxml2 XMLDocument.
Note that you can not use the tutorial in this link to load from file. It always return ERROR_FILE_NOT_FOUND.

Difference between TouchXML and GDataXML parser

I have two options in front of me for parsing really fat XML file,
TouchXML
GDataXML
It's lot of work to do because XML file is very huge. I thought of asking people who have already worked with these parsers.
Which one is better for fat XML files?
I found a blog post which says that TouchXML does not edit/save XML files whereas GDataXML has that feature. What exactly do they mean by edit/save XML file feature?
Lets see if I can answer your questions:
Which one is better for fat XML files? The answer is neither. Both are DOM parsers, which actually load the entire document into memory to make queries faster. If you're parsing a large file, you're better off going with a SAX parser, such as the built-in NSXMLParser, or even the SAX-based version of libxml2.
What exactly do they mean by edit/save XML file feature? Well, suppose you have a XML file that has your app's settings in it. If you open up that file and make changes, you're going to want to save them, right? That's where the writing comes in. The parsers that allow writing let you save the representation of the xml file in the memory into an actual file that can be written to disk.

Where can i find the DTD or XML Schema of surefire generated XML (TEST-<testname>.xml) file?

Where can i find the DTD or XML Schema of surefire generated XML (TEST-.xml) file?
I think that Maven is using the XML result format "owned" by Ant and I am not sure there is an official DTD or Schema. From JUnit 4 XML schematized? on the JUnit-user list:
There's a pretty standard format for JUnit XML output. You've
probably seen it: there's a <testsuite> root element containing
zero-or-more <testcase> elements, each of which may contain a
<failure> or <error> element. (There's also a <properties> element
with zero-or-more <property> elements.) A lot of tools know how to
read this format and report on it, including at least Ant, Maven,
Cruise Control, Hudson, Bamboo, Eclipse and IntelliJ IDEA.
Is this XML standardized anywhere in a DTD or XML Schema or something?
If there isn't a standard, could we go about making a standard and
blessing it? (Perhaps JUnit 4.5 could include an XMLReporter that
could be a reference implementation.)
In particular, I'm curious to know how one would represent that a test
has been ignored in "standard" JUnit XML.
I never found such a "standard". And
imo it is a good idea as you may
enhance the report by using your own
information and still have the tools
working. But if you go with a
DTD/schema and validation then this
will stop working.
See also
JUnit 4 XML schematized? on the JUnit-user list
schema for junit xml output on the ant-dev list
Related questions
Does anyone know where to get the XSD file describing the junitReport.xml file format expected by Hudson?
Spec. for JUnit XML Output
I know this question is 6 years old, but just for future readers...
Actually there is an official schema for surefire generated XML - it can be found here: surefire-test-report.xsd.

Objective-C libraries for XML Parsing

I would like to know some libraries in objective-C for xml parsing. I think it is a very common need, but I found limited resources for handling this task:
Google Code projects: TouchCode (TouchXML)
NSXMLParser
What is your best solution to work with XML in objective-C language? Please advice.
What is the solution that you have used for your product?
NSXMLParser is a stream-oriented class; you set it up and get delegate callbacks when it detects something. Usually this is not what you want to do, but can be much faster and lower memory.
TouchXML will parse the XML itself using libxml, and create an object tree for the entire XML structure. This allows you to easily access the contents of the XML tree, using manual traversal methods or basic XPaths (more sophisticated XPath support is planned).
It serves a narrow purpose, but if your goal is to parse untidy HTML, you might want to try a static library I started called TagScraper. It doesn't handle many/most XML/HTML entities correctly, but it could be pretty easily patched to. URL: http://github.com/searls/TagScraper
Its value is that it provides a simple XPath mechanism that hides the tidying/querying/assembling for you, and then it provides the parsed elements & attributes in a tree-like data structure of Tag.h nodes.