Normalize string from HtmlAgilityPack document - vb.net

I'm trying to get a web page using vb.net and HtmlAgilityPack with this code:
Dim mWPage As New HtmlAgilityPack.HtmlDocument
Dim wC As New WebClient()
mWPage.Load(wC.OpenRead(mUrl))
My problem is to get text from a table but, when I extract InnerText, i get something like this:
Modificat<!--span-->i dati
instead of (Note that I wrote the same string and below it's displayed correctly):
Modificati dati
I've tryed to use the answer here but it doesn't work in this case (or I wasn't able to make it works)
I noticed that contents changes when I change "User-Agent", so I tryed various "User-Agent" but I never got a perfect text.
So my questions are:
can I use the code that is indicated in the answer to solve the problem?
if not, can I get a perfect text using the right "User-Agent"?
If so, how can I find the right "User-Agent"?
If not, how can I fix the receivedstring?

The response from the server based on a new User-Agent is fully dependent on the server so we will not be able to predict which one will yield the response you're looking for.
But... You will be able to use the HttpUtility.HtmlDecode method to get rid of the encoded HTML and turn it into teh string you're looking for.
To filter out the HTML comment you may need to change the XPath you're using. If you append //text(), you should get only the text elements that match the rest of your expression.

Related

Display markdown safely as HTML in Vue3

So I have a set of strings, with some "custom markdown" that I have created. My intention is to render these strings as HTML in the frontend. Let's say, I have this string:
This is a string <color>that I need</color> to\nrender <caution>safely in the browser</caution>. This is some trailing text
I would be expecting to get something like:
This is a string <span class="primaryColor">that I need</span> to<br>render <div class="caution">safely in the browser</div>. This is some trailing text
And the way I do it right now is with some basic Regex:
toHtml = text
.replace(/<color>(.*)<\/color>/gim, "<span class='primaryColor'>$1</span>")
.replace(/\\n/g, "<br>")
.replace(/<caution>(.*)<\/caution>/gims, "<div class='caution'>$1</div>")
This works fine and returns the correct string. And then for printing, in the template I just:
<div id="container" v-html="result"></div>
My problem is that at some point I expect users to be able to enter this strings themselves, and that would be displayed to other users too. So for sure, I am gonna be vulnerable to XSS attacks.
Is there any alternative I can use to avoid this? I have been looking at https://github.com/Vannsl/vue-3-sanitize which looks like a good way of just allowing the div, span and br tags that I am using, and set the allowed attributes to be only class for all the tags. Would this be safe enough? Is there something else I should do?
In that case, I believe it will not be necessary to sanitize it in the backend too, right? Meaning, there will be no way for the web browser to execut malicious code, even if the string in the server contains <script>malicious code</script>, right?
My problem is that at some point I expect users to be able to enter this strings themselves
So, Do we have a form input for the users to enter the string which you mentioned in the post ? If Yes, My suggestion is that you can sanitize the user input at first place before passing to the backend. So that in backend itself no malicious code should be stored.
Hence, By using string.replace() method. You can first replace the malicious tags for ex. <script>, <a, etc. from the input string and then store that in a database.
Steps you can follow :
Create a blacklist variable which will contain the regex of non-allowed characters/strings.
By using string.replace(), replace all the occurrence of the characters available in the string as per the blacklist regex with the empty string.
Store the sanitized string in database.
So that, You will not get worried about the string coming from backend and you can bind that via v-html without any harm.

How to get query string on search page in shopify

I have a shopify store. I am passing new parameters on search page using query string anyone tell me how can i get this new query string on search page
You can't get new parameters outside the default ones using liquid. Liquid is now aware of additional query parameters.
If you really really have to take them in liquid then you will have a hacky option to capture the content_for_header argument and you can extract the arguments from there ( since there is a URL with the query params there ) with a few splits. You will need to look for the pageurl string there. But like I said this is a hacky way which should be used as a last resort.

Openrefine not working as expected

I'm very new to OpenRefine, so please bear with me if i have made a simple mistake.
I'm parsing a HTML website to gather some date.
Everything went fine with fetching the individual pages, but now the parsing of the HTML fails.
I'm creating a new column, based on the one holding all the page's HTML. I'm trying to get to the data in a specific DIV[20].
In the"create column based on this column" window it gives me a preview when using value.parseHtml().select("DIV")[20] , which results in exactly what i need... executing it gives me nothing but blank cells.
it even tells me that it is "filling 0 rows with grel:value.parseHtml().select("DIV")[20]"
Any clue what i'm doing wrong here?
You just need to finalize with .toString() to output the JSON.org object AS a string.
This is explained on our wiki here: https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML#extract-html-attributes-text-links-with-integrated-grel-commands
I also updated the select() function with that example: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#selectelement-e-string-s

Ways to reliably split a URL string and extract required part

I am working on server side application of FB login.
Having converted the example here:
https://developers.facebook.com/docs/authentication/server-side/
To VB, and using System.Net.WebRequest.Create to retrieve the responses I am now able to get a text string including the access_token and the expiry time in the following format:
access_token=ACCESS&expires=2577
Obviously I can split this into an array and split the parts to get the access_token
But, on the FB Developers example above, they do it with PHP like so:
$params['access_token'];
Is there a VB.net way of doing this? This seems more reliable to me than teh aforementioned splitting idea, ie, if FB change the output format.
You can use HttpUtility.ParseQueryString to parse a query string. If you're not using ASP.NET, you should add a reference to System.Web.dll. I also don't believe this will work in .NET 4 client profile.
Imports System.Web
Imports System.Collections.Specialized
Dim qString As NameValueCollection = HttpUtility.ParseQueryString("var1=val1&var2=val2")

Preventing YQL from URL encoding a key

I am wondering if it is possible to prevent YQL from URL encoding a key for a datatable?
Example:
The current guardian API works with IDs like this:
item_id = "environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy"
The problem with these IDs is that they contain slashes (/) and these characters should not be URL encoded in the API call but instead stay as they are.
So If I now have this query
SELECT * FROM guardian.content.item WHERE item_id='environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy'
while using the following url defintion in my datatable
<url>http://content.guardianapis.com/{item_id}</url>
then this results in this API call
http://content.guardianapis.com/environment%2F2010%2Foct%2F29%2Fbiodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
Instead the guardian API expects the call to look like this:
http://content.guardianapis.com/environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
So the problem is really just that the / characters gets encoded as %2F which I don't want to happen in this case.
Any ideas on how this can be achieved?
You can also check the full datatable I am using:
http://github.com/spier/yql-tables/blob/master/guardian/guardian.content.item.xml
The URI-template expansions in YQL (e.g. {item_id}) only follow the version 3 spec. With version 4 it would be possible to simply (only slightly) change the expansion to do what you want, but alas not currently with YQL.
So, a solution. You could bring a very, very basic <execute> block into play: one which adds the item_id value to the path as needed.
<execute><![CDATA[
response.object = request.path(item_id).get().response;
]]></execute>
Finally, see the diff against your table (with a few other, minor tweaks to allow the above to work).