I have a Scrapy spider that is fetching a web page (from ESPN as it happens) and then extracting a JSON string from the web page using Selector.re()
sel = Selector(response)
j = sel.re(u'window\.espn\.scoreboardData[ \t]*=[ \t]*{.*?};')
I then parse the JSON as so:
start = j[0].index('{')
json_data = json.loads(j[0][start:-1])
This works fine until the JSON string contains a string with an embedded quote character. If I look at the web page with the browser's View Source, I see something like this as part of the JSON string:
"shortLinkText":"Georgia Tech and "Fab Three" beat Jackson State"
The embedded quote character has been encoded as the HTML equivalent. However, in the string I get back within Scrapy, that has been decoded into a quote character:
"shortLinkText":"Georgia Tech and "Fab Three" beat Jackson State"
This causes the JSON parser to fail for obvious reasons.
Is there any workaround for this situation? A way to force Scrapy not to decode the HTML character?
Related
So I have a set of strings, with some "custom markdown" that I have created. My intention is to render these strings as HTML in the frontend. Let's say, I have this string:
This is a string <color>that I need</color> to\nrender <caution>safely in the browser</caution>. This is some trailing text
I would be expecting to get something like:
This is a string <span class="primaryColor">that I need</span> to<br>render <div class="caution">safely in the browser</div>. This is some trailing text
And the way I do it right now is with some basic Regex:
toHtml = text
.replace(/<color>(.*)<\/color>/gim, "<span class='primaryColor'>$1</span>")
.replace(/\\n/g, "<br>")
.replace(/<caution>(.*)<\/caution>/gims, "<div class='caution'>$1</div>")
This works fine and returns the correct string. And then for printing, in the template I just:
<div id="container" v-html="result"></div>
My problem is that at some point I expect users to be able to enter this strings themselves, and that would be displayed to other users too. So for sure, I am gonna be vulnerable to XSS attacks.
Is there any alternative I can use to avoid this? I have been looking at https://github.com/Vannsl/vue-3-sanitize which looks like a good way of just allowing the div, span and br tags that I am using, and set the allowed attributes to be only class for all the tags. Would this be safe enough? Is there something else I should do?
In that case, I believe it will not be necessary to sanitize it in the backend too, right? Meaning, there will be no way for the web browser to execut malicious code, even if the string in the server contains <script>malicious code</script>, right?
My problem is that at some point I expect users to be able to enter this strings themselves
So, Do we have a form input for the users to enter the string which you mentioned in the post ? If Yes, My suggestion is that you can sanitize the user input at first place before passing to the backend. So that in backend itself no malicious code should be stored.
Hence, By using string.replace() method. You can first replace the malicious tags for ex. <script>, <a, etc. from the input string and then store that in a database.
Steps you can follow :
Create a blacklist variable which will contain the regex of non-allowed characters/strings.
By using string.replace(), replace all the occurrence of the characters available in the string as per the blacklist regex with the empty string.
Store the sanitized string in database.
So that, You will not get worried about the string coming from backend and you can bind that via v-html without any harm.
The URL link below will open a new Google mail window. The problem I have is that Google replaces all the plus (+) signs in the email body with blank space. It looks like it only happens with the + sign. How can I remedy this? (I am working on a ASP.NET web page.)
https://mail.google.com/mail?view=cm&tf=0&to=someemail#somedomain.com&su=some subject&body=Hi there+Hello there
(In the body email, "Hi there+Hello there" will show up as "Hi there Hello there")
The + character has a special meaning in [the query segment of] a URL => it means whitespace: . If you want to use the literal + sign there, you need to URL encode it to %2b:
body=Hi+there%2bHello+there
Here's an example of how you could properly generate URLs in .NET:
var uriBuilder = new UriBuilder("https://mail.google.com/mail");
var values = HttpUtility.ParseQueryString(string.Empty);
values["view"] = "cm";
values["tf"] = "0";
values["to"] = "someemail#somedomain.com";
values["su"] = "some subject";
values["body"] = "Hi there+Hello there";
uriBuilder.Query = values.ToString();
Console.WriteLine(uriBuilder.ToString());
The result:
https://mail.google.com:443/mail?view=cm&tf=0&to=someemail%40somedomain.com&su=some+subject&body=Hi+there%2bHello+there
If you want a plus + symbol in the body you have to encode it as 2B.
For example:
Try this
In order to encode a + value using JavaScript, you can use the encodeURIComponent function.
Example:
var url = "+11";
var encoded_url = encodeURIComponent(url);
console.log(encoded_url)
It's safer to always percent-encode all characters except those defined as "unreserved" in RFC-3986.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
So, percent-encode the plus character and other special characters.
The problem that you are having with pluses is because, according to RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1., "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped"). This way of encoding form data is also given in later HTML specifications, look for relevant paragraphs about application/x-www-form-urlencoded.
Just to add this to the list:
Uri.EscapeUriString("Hi there+Hello there") // Hi%20there+Hello%20there
Uri.EscapeDataString("Hi there+Hello there") // Hi%20there%2BHello%20there
See https://stackoverflow.com/a/34189188/98491
Usually you want to use EscapeDataString which does it right.
Generally if you use .NET API's - new Uri("someproto:with+plus").LocalPath or AbsolutePath will keep plus character in URL. (Same "someproto:with+plus" string)
but Uri.EscapeDataString("with+plus") will escape plus character and will produce "with%2Bplus".
Just to be consistent I would recommend to always escape plus character to "%2B" and use it everywhere - then no need to guess who thinks and what about your plus character.
I'm not sure why from escaped character '+' decoding would produce space character ' ' - but apparently it's the issue with some of components.
I am trying to take a string that I have marked up through vb.net code and cross-check it with the text file it came from originally. This is for proofreading the html output.
To do this, I need to parse an HTML snippet that does not come from a URL.
The examples of HTMLAgilityPack I have seen get their input from a URL. Is there a way to parse a string of marked-up text that does not include a header or similar parts of a well-formed webpage?
Thanks
To parse a string containing an HTML snippet rather than a file or URL, you can use the HtmlDocument as #Oded suggested, but instead of using doc.Load(), use doc.LoadHtml().
String HtmlSnippet = "<p>Example <strong>Html</strong> snippet</p>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlSnippet);
Instead of the WebDocument use HtmlDocument:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
It is the first thing on the HAP examples page.
I develop a facebook api with asp.net , I have to send query string but this querystring may include special characters like ( ı, ç ö, ş, ğ ). When I send query string with special characters, facebook returns me an error-
The URL http://apps.facebook.com/sportsfanarena/Results.aspx?s=13&co=3&ci=Bal%c4%b1kesir&g=0 is not valid.
The "ci" variable's value is "Balıkesir".
Is there any solution to handle it?
I believe you need to use URL encoding to send characters like that, though I may be mistaken.
Here is an online utility which will take text and encode/decode it in URL encoding.
Try encoding the word you are wanting to send using this utility, and then try your API request with the encoded text.
Hey guys, lately, I use the combination of Struts and Velocity frameworks to create some website, the problem is that when I tried to input UTF-8 Japanese character, say, a field name, which I putted in the value of "索", then I click submit ( using ), the data would be passed to an AddForm, which I have the String name field to handle the name field. Problem is that, the received string is some strange letter than the expected string "索", I set all the workspace to UTF-8, in velocity.property ( input.coding/outputcoding = UTF-8 ), content-type/charset = UTF-8, but it always returns strange string, I could set the name field directly with : public void setName(String name) { this.name = "索" } and the confirm Add work fine, but not with normally insert it to name field on the addForm, someone could point me out what was wrong ? Thanks for patient reading :D.
I understood your problems is as follows, is this right?
you can send and display “索” correctly on client browsers,
but when the form is sent back to server, data is corrupted.
This is caused by mismatch between:
encoding in which the request is encoded (UTF-8 as you said) and
encoding by which the server decodes (ISO-8859-1 by default).
It can be solved by specifying server-side encoding (2nd of above) explicitly using CharacterEncodingFilter of Spring Framework.
(note: Japanese frameworks such as Seasar and TERASOLUNA have similar filter and articles on the problem.)