HtmlAgilityPack clean inner text from html

HtmlAgilityPack clean inner text from html - vb.net

I have this html. I'm trying to get its InnerText without any tags in it,
<h1>my h1 content</h1>
<div class="thisclass">
<p> some text</p>
<p> some text</p>
<div style="some_style">
some text
<script type="text/javascript">
<!-- some script -->
</script>
<script type='text/javascript' src='some_script.js'></script>
</div>
<p> some text<em>some text</em>some text.<em> <br /><br /></em><strong><em>some text</em></strong></p>
<p> </p>
</div>
What am trying to do is get the text as the user would see it from the class thisclass.
I want to strip any script tag, and all tags, and just get plain text.
This is what am using:
Dim Tags As HtmlNodeCollection = root.SelectNodes("//div[#class='thisclass'] | //h1")
Does anyone have any ideas?
Thanks.

Try this (warning c# code ahead):
foreach(var script in root.SelectNodes("//script"))
{
script.ParentNode.RemoveChild(script);
}
Console.WriteLine(root.InnerText);
This gave me the following output:
my h1 content some text some textsome text some textsome textsome text. some text
Hope this helps.

Related

How to open a url link from qweb?

I'm having a field to store website link. I need to print that field in qweb and also to open a url link in new tab when click on it.
How can i do it?
Code:
<xpath expr="//div[#class='footer']" position="replace">
<div class="footer">
<hr style="width:100%;border:1px solid black;"/>
<div style="border:1px solid black;width:100%;">
<img t-att-src="'/module/static/description/image.png'" />
</div>
<div style="border:1px solid black;width:100%;float:center;text-align:center;font-size:50px;">
<a t-attf-href="#{doc.company_id.website}">example.com</a>
</div><br/>

Try this
<a t-attf-href="#{doc.company_id.website}"><span t-field="doc.company_id.website"/>
or
${doc.company_id.website}

That's just pure HTML:
example.com
If you have a field value and your record in the context of rendering is o and the field name url, you can also use QWeb:
<a t-att-href="o.url" t-esc="o.url" />

Extracting text and links from the html is not working with bs4

I am struggling to get the wikipedia.com and the name "John Martin" in the above text via bs4. I am new to bs4.
<div class="section" qualifer="allnames">
<div class="container container-2">
<div class="title">
<h1 class="title1">
This is a test
</h1>
</div>
<div class="tile3">
<a class="title4" href="wikipedia.com" title="John Martin">
I tried this
link = soup.find('div', class_='title4')
link = link.a.text()
print(link)
Can someone help? How do I get the links and the names from the above code please?

You're almost there. Try:
link = soup.find_all('a', class_='title4')
for l in link:
print(l['title'])
print(l['href'])
Output:
John Martin
wikipedia.com

Not able to identify check box based on the name given beside it using xpath

I wanted to identify the CheckBox based on the name given to it by using the xpath, but was not able to reach till the text uniquely.
The html code is in below. How can I get the dynamic xpath for 'text1' or 'text2' mentioned in the html?
<html>
<body>
<div class="section-content">
<div>
<input class="cls" type="checkbox"/>
text1
</div>
</div>
<div class="section-content">
<div>
<input class="cls" type="checkbox"/>
text2
</div>
</div>
<div class="section-content">
<div>
<input class="cls" type="checkbox"/>
text3
</div>
</div>
<div class="section-content">
<div>
<input class="cls" type="checkbox"/>
text4
</div>
</div>
</body>
</html>

The below code worked for me
driver.findElement(By.xpath("//div[contains(.,'text4')]/input")).click();

To identify the node with text as text1 you can use the following line of code :
WebElement element_text1 = driver.findElement(By.xpath("//div[#class='section-content']/div[contains(normalize-space(), 'text1')]"));
To get identify the Check Box associated with the node with text as text1 you can use the following line of code :
driver.findElement(By.xpath("//div[#class='section-content']/div[contains(normalize-space(), 'text1')]/input"));

Try this:
WebElement body = dr.findElement(By.xpath("/html/body"));
List<WebElement> div = dr.findElements(By.className("section-content"));
for(int i=1;i<div.size();i++)
{
String checkBoxText = dr.findElement(By.xpath("/html/body/div["+i+"]/div")).getText();
if(checkBoxText.equals("text3"))
{
dr.findElement(By.xpath("/html/body/div["+i+"]/div/input")).click();
break;
}
}

Where can I find the code for 'salespricehtml' in NetSuite?

Where can I find the code for 'salespricehtml' or 'addtocarthtml' in NetSuite?
I am trying to add Schema.org Microdata on my website's product pages, especially for the price, but the quantity amounts and prices are being displayed on the website using the tag/name 'salespricehtml' like so <%=getCurrentAttribute('item','salespricehtml')%>. If you can help, it would be greatly appreciated! :)

Here is my code:
<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<h3>
<span itemprop="priceCurrency" content="AUD">$</span><span itemprop="price" content="<%=getCurrentAttribute("item","salesPrice")%>">
<script language="javascript">
salesPriceRaw('<%=getCurrentAttribute("item","salesPrice")%>');
</script>
</span>
<small>
<script language="javascript">
gstHtml('<%=getCurrentAttribute("item","salestaxcode")%>');
</script>
</small>
<span itemprop="availability" content="In stock"></span>
</h3>
</div>
With the following script in the head to format tax info and comply with microdata standards as closely as possible:
<script type="text/javascript">
function salesPriceRaw (salesPrice) {
var price = salesPrice.replace('$', '');
document.write(price);
}
function gstHtml (gst) {
if (gst == 'GST:TS-AU') {
document.write(' + GST ');
} else {
document.write('(GST Exempt) ');
}
}
</script>
It's a hack really, but it works. In essence you should remove the html part from your getCurrentAttribute tag. Hope this helps!

vb.net split help required

im currently trying to grab an avatar from an html web source, probllem is theres several img sources and containers that have the same name, heres the current part i need
</div>
<div class="content no_margin">
<img src="http://www.gravatar.com/avatar/4787d9302360d807f3e6f94125f7754c?&d=mm&r=g&s=250" /><br />
<br />
<a class="link" href="http://sharefa.st/user/donkey">Uploads</a><br />
<a class="link" href="http://sharefa.st/user/donkey/favorites">Favorites</a><br />
</div>
</div>
<div id="content" class="left">
<div class="header">
Uploads
</div>
<div class="content no_margin">
<div class="profile_box">
<div class="profile_info">
Now the part i need to grab is:
<img src="http://www.gravatar.com/avatar/4787d9302360d807f3e6f94125f7754c?&d=mm&r=g&s=250" /><br />
this image, Any help and id be grateful!

try:
Dim wb As New WebBrowser
wb.Navigate("")
Do While wb.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
Loop
wb.DocumentText = HtmlString 'Your Html
For Each img As HtmlElement In wb.Document.GetElementsByTagName("img")
If InStr(img.GetAttribute("src"), "avatar") Then
MsgBox(img.GetAttribute("src"))
End If
Next

You appear to be attempting to parse HTML 'by hand'. Please don't.
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
see this question for some alternatives How do you parse an HTML in vb.net

Use regular expression to find what you're looking for:
http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx
The example code demonstrates pretty much your scenario.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HtmlAgilityPack clean inner text from html - vb.net

Try this (warning c# code ahead): foreach(var script in root.SelectNodes("//script")) { script.ParentNode.RemoveChild(script); } Console.WriteLine(root.InnerText); This gave me the following output: my h1 content some text some textsome text some textsome textsome text. some text Hope this helps.

Related

How to open a url link from qweb?

Extracting text and links from the html is not working with bs4

Not able to identify check box based on the name given beside it using xpath

Where can I find the code for 'salespricehtml' in NetSuite?

vb.net split help required

Categories

Resources