scrapy can't handle "<" character

scrapy can't handle "<" character - scrapy

I'm trying to extract text containing "<" (lower than character). On my localhost everything works fine, on the server however the text after and including "<" gets truncated.
1) hipoksemia tętnicza (PaO<sub>2</sub>/FiO<sub>2</sub> < 300 )
so I receive:
1) hipoksemia t\u0119tnicza (PaO<sub>2</sub>/FiO<sub>2</sub>
There is no problem with scraping > character. Thank you for your help.

< is invalid HTML. It should be <.
Scrapy uses Parsel to parse XML/HTML responses. Parsel uses lxml to parse XML/HTML documents. lxml does not handle broken HTML as well as web browsers and other parsers do.
There is an open issue for Parsel to handle these scenarios. It will probably require supporting an alternative to lxml in Parsel, which is not trivial to implement, so it may take a while before that issue is solved.

Related

Content Security Policy hash not recognized by Safari 11.0.3

I have a meta tag with the following directive inside of it:
<meta http-equiv="Content-Security-Policy" content="base-uri 'self'; script-src 'self' 'sha256-s5EeESrvuQPpk2bpz5I3zn/R8Au2DYB1Z+YUH9p0fUE=' 'sha256-PYYfGnkbZ44B9ZBpgv8NbP3MXT560LMfrDSas2BveJo=';">
I then have 2 inline scripts further down the page, each which should match one of the generated shas in the policy.
In Chrome and Firefox, I get no complaints and my scripts run as expected.
In Safari Version 11.0.3 (13604.5.6), I get the following error:
Refused to execute a script because its hash, its nonce, or 'unsafe-inline' does not appear in the script-src directive of the Content Security Policy
and I am confused as to why!
Unfortunately, I am unable to produce a minimum reproducible repo with the issue inside of it - smaller examples work in Safari for me, so it leads me to believe it's to do with something specific in my app, possibly related to the second thing I have tried below.
Any help would be much appreciated!
Things I have tried:
Are hashes supported?
According to this Stack Overflow post and the Safari release notes, CSP 2.0 which supports hashes was implemented in Safari 10
Correct charset?
Previously, I was seeing issues because I was calculating the hashes based on a UTF-8 charset, but was outputting the JS to the browser without a charset meta tag in place. Special characters in my JS were being mangled and were causing differences in the shas when the browser tried calculating them.
I don't believe this is affecting me now since Chrome and Firefox see no issues, but maybe I'm wrong here?
unsafe-inline for Safari, and then allow hashes to override that in Chrome and Firefox?
According to the CSP spec, unsafe-inline is ignored if a hash or nonce is present. Safari 11 also adheres to this, so adding the unsafe-inline keyword has no effect

Turns out this was a charset issue.
I managed to get a minimal reproducible issue (after some trial and error, and a lot of luck!) and found that one of my characters had a different sha before and after it was rendered in Safari.
Before it was rendered in Safari, the character was the following:
After Safari had rendered the character, it was the following (even in the source of the code):
Strangely, Chrome and Firefox both don't have this issue, so it either must be Safari normalizing the character after it has rendered, or a difference in when the sha256 hashes are calculated between the browsers.
The solution was to turn off character compression in UglifyJS so that the character stays as \uF900 instead of being compressed to the single character in the picture above.
I achieved this with the following option in my webpack.config.js file:
new UglifyJsPlugin({
uglifyOptions: {
output: {
// necessary to stop the minification of escaped unicode sequences into their actual chars.
// some unicode breaks CSP checks in safari
ascii_only: true,
},
},
}),
I have reported this to Apple to see if they will consider fixing this.

Another solution would be to make sure the strings in your script are normalized. This can be achieved in JavaScript using String.normalize.
Normalization ensures each Unicode character is represented in its canonical (NFC) form, which seems to make the CSP hash comparison in Safari work. So if you're dealing with any kind of text which could be using accents, non-Latin scripts etc., it's a good idea to normalize the strings.

Gradle build failing with many "unmappable character for encoding UTF-8" errors

I'm seeing many "unmappable character for encoding UTF-8" errors messages when I run a gradle build like this:
C:\5.0-Maint-New-Techs\src\com\avada\jms\base\JmsUtils.java:136: error: unmappable character for encoding UTF-8
ï¿½ You can use arrays as well as primitive types for the values.
When I go to the lines that are flagged this way, they look perfectly fine in my editor. Not sure if this is a gradle or idea issue...any ideas on how to get around these errors?

Please change the non-ascii characters to double quotes(") or any other character(this is basically work around solution of the problem). This problem will be resolved. I think this issue is due to intellij.

Disable URL encoding of the query string in qtkwebit

I'm using qtwebkit to build a DOM-XSS scanner. By default qtwebkit is automatically URL encoding/escaping the query part of the URL. Javascript gets the URL encoded.
For example, when you visit the URL
http://test.com/?param=value<b>value</b>&a=b
location.href will contain the value
http://test.com/?param=value%3Cb%3Evalue%3C/b%3E&a=b
This is a big problem for me in detecting DOM-XSS vulnerabilities because I don't know if the browser did the encoding or the webpage did it. I'm trying to disable this functionality but I'm lost in the qtwebkit source code.
Anybody can help me by telling me where exactly in the code (in what file) the URL encoding takes place so I can modify the source code and recompile it?
I've been browsing the source code for 3 days now and I didn't make any progress.
Thank you very much in advance for any help.

I have also encounter your problem on Qt5,but Qt4 don't has this problem.
I modified Qt source code "qurlrecode.cpp" 's static function recode
This solved my problem, but I think it is best to modify webkit source code on KURL, but I failed to build webkit on my machine successfully after wasting one whole day.

QURL.setEncodedUrl(const QByteArray & encodedUrl) can solve it
via..http://doc.qt.io/qt-4.8/qurl.html#setEncodedUrl

Same script, different behavior

I just stumbled upon an interesting bug... Still trying to figure out what is exactly happening. Maybe you can help.
First, the context. I'm currently building yet another man to html converter (for some reasons I won't motivate here, but I need it).
So, have a look at the screenshot below (see the link), more precisely at the outlined spots. See? On the upper shell, I have &lt ; and &gt ;, that is, escaped html.
While on the shell below I have < and > directly.
But as you can see (or do I seriously need looking glass ?), the command man 2 semget | webmanneris the same on both sides, as is the which webmanner. The two are executed roughly at the same moment, with no modification made to the script between.
[Oops, cannot post pictures just yet... Here comes the link]
http://aspyct.org/media/webmanner-bug.png
But the shell below is older (open about 1 hour ago). Newer shells all print out &lt ;. So my first guess was that it somehow had a cached reference to the old inode of the file, or old blocks or whatever.
So I modified parts of the script, at the start and then at the end, to print different messages. And, surprise, the message shown up on both terminals. But still, same difference between &lt ; and <.
I'm confused... How to explain that behavior? I'm working on a OSX 10.8 (Mountain Lion)
EDIT: OK, there is one big difference: the shell below uses ruby 1.9.3, while above is 1.8.7. Is there any known difference in string handling between the two versions ?

Are you using the htmlentities library? If so then this bug fix is probably what you are seeing
Ruby 1.9.3 has slightly different behaviour to 1.9.2: the result of
encode was not ASCII even when it only contained ASCII characters.
This may not be important, but this change makes both versions produce
the same result.
https://github.com/threedaymonk/htmlentities/commit/46dafc959de03a02d0c1705bef7f1b157b350025

Executing ebook-convert from rails non-ascii characters display ?? in meta

I want to convert html files to epub with ebook-convert. I provide book metadata through parameters, but non-ascii characters display ?? in the book meta. I use system:
res = system cmd_line
where cmd_line is a string to be executed
When I execute the same command from command line, it works perfectly.
I use Ruby 1.8.7 and Rails 3.0.13

I'm late to the party here but came looking for a solution to a closely related problem yesterday and so have just upvoted the question.
A possible explanation can be found at the not-obviously-applicable:
Executing ebook-convert from rails non-ascii characters display ?? in meta
where I've added my own take on a potential work-around.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

scrapy can't handle "<" character - scrapy

Related

Content Security Policy hash not recognized by Safari 11.0.3

Gradle build failing with many "unmappable character for encoding UTF-8" errors

Disable URL encoding of the query string in qtkwebit

Same script, different behavior

Executing ebook-convert from rails non-ascii characters display ?? in meta

Categories

Resources