LXML cuts text at the first nested tag

LXML cuts text at the first nested tag - lxml

Please have a look to this code :
# -*- coding: utf-8 -*-
from lxml import etree
html_fragment = "<body><p>This is html, you can <a href='wikpedia'>learn more</a> on the wikipedia page</p></body>"
tree = etree.fromstring(html_fragment, etree.HTMLParser())
for x in tree.findall(".//p") :
print(x.text)
this print :
This is html, you can
it cuts the text before the a tag. how can i get all the text of p tag ?

find the solution: have to use .text_content() instead of .text
official doc of lxml

Related

issue with parsing wiki.js webpage's HTML content using beautifulsoup

I am using beautifulsoup python module to parse HTML content of a wiki.js based webpage. However, I am having trouble extracting the text component of the header and paragraph tags.
I have tried .getText() method and .text property, but wasn't able to extract the text from the header/paragraph tags.
Below is the code snippet for reference:
import requests
from bs4 import BeautifulSoup
# a random webpage built using wiki.js
url = "https://brgswiki.org/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
heading_tags = ["h1","h2"]
for tags in soup.find_all(heading_tags):
print("=============================================")
print(f"complete Header Tag with the text:\n{tags}")
print("=============================================")
print("just header tag_name and header text_content")
print(tags.name + ' -> ' + tags.text.strip())
And here's the output:
=============================================
complete Header Tag with the text:
<h2 class="toc-header" id="subscribe-to-our-new-newsletter"><a class="toc-anchor" href="#subscribe-to-our-new-newsletter">¶</a> <em>Subscribe to our new newsletter!</em></h2>
=============================================
just header tag_name and header text_content
h2 ->
As you see in the output the h2 tag text -"Subscribe to our new newsletter!" is not being extracted
I see this issue with just the webpages built on wiki.js, the other webpages work just fine.
Any suggestion/guidance on how to get around this issue is appreciated.
Thank you.

extract text from html string with Scrapy

Here is the html string in question.
<div class="def ddef_d db">a <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/book" title="book">book</a> of grammar <a class="query" href="https://dictionary.cambridge.org/us/dictionary/english/rule" title="rules">rules</a>: </div>
With BeautifulSoup, this code
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmltxt, 'lxml')
soup.text
gets me
a book of grammar rules:
which is exactly what I want.
With scrapy, how do I get the same result?
from scrapy import Selector
sel = Selector(text=htmltxt)
sel.css('.ddef_d::text').getall()
this code gets me
['a ', ' of grammar ', ': ']
How should I fix it?

aYou can use this code to get all text inside div and its child:
text = ''.join(sel.css('.ddef_d ::text').getall())
print(text)
your selector returns text only from the div, but part of text located inside child elements (a), that's why you have to add space before ::text to include child text into result.

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

This is what I have:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find("div", {"class":"navigation"}):
print(i)
Currently the print output of "i" is:
<a class="btn btn-primary" href="index.php?page=2">Zur nächsten Seite!</a>
I want to print out the href link "index.php?page=2".
When I try to use BeautifulSoups "find", "select" or "attrs" method on "i" I get an error. For instance with
print(i.attrs["href"])
I get:
AttributeError: 'NavigableString' object has no attribute 'attrs'
How do I avoid the 'NavigableString' error with BeautifulSoup and get the text of href?

The issue seems to be for i in soup.find. If you're looking for only one element, there's no need to iterate that element, and if you're looking for multiple elements, find_all instead of find would probably match the intent.
More concretely, here are the two approaches. Beyond what's been mentioned above, note that i is a div that contains the desired a as a child, so we need an extra step to reach it (this could be more direct with an xpath).
import requests
from bs4 import BeautifulSoup
url = "http://python.beispiel.programmierenlernen.io/index.php"
doc = requests.get(url).content
soup = BeautifulSoup(doc, "html.parser")
for i in soup.find_all("div", {"class": "navigation"}):
print(i.find("a", href=True)["href"])
print(soup.find("div", {"class": "navigation"})
.find("a", href=True)["href"])
Output:
index.php?page=2
index.php?page=2

How to get the "none display" html from selenium

I'm trying to get some context by using selenium, however I can't just get the "display: none" part content. I tried use attribute('innerHTML') but still not work as expected.
Hope if you could share some knowledge.
[Here is the html][1]
[1]: https://i.stack.imgur.com/LdDL4.png
# -*- coding: utf-8 -*-
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import re
from pyvirtualdisplay import Display
from lxml import etree
driver = webdriver.PhantomJS()
driver.get('http://flights.ctrip.com/')
driver.maximize_window()
time.sleep(1)
element_time = driver.find_element_by_id('DepartDate1TextBox')
element_time.clear()
element_time.send_keys(u'2017-10-22')
element_arr = driver.find_element_by_id('ArriveCity1TextBox')
element_arr.clear()
element_arr.send_keys(u'北京')
element_depart = driver.find_element_by_id('DepartCity1TextBox')
element_depart.clear()
element_depart.send_keys(u'南京')
driver.find_element_by_id('search_btn').click()
time.sleep(1)
print(driver.current_url)
driver.find_element_by_id('btnReSearch').click()
print(driver.current_url)
overlay=driver.find_element_by_id("mask_loading")
print(driver.exeucte_script("return arguments[0].getAttribute('style')",overlay))
driver.quit()

To retrieve the attribute "display: none" you can use the following line of code:
String my_display = driver.findElement(By.id("mask_loading")).getAttribute("display");
System.out.println("Display attribute is set to : "+my_display);

if element style attribute has the value display:none, then it is a hidden element. basically selenium doesn't interact with hidden element. you have to go with javascript executor of selenium to interact with it. You can get the style value as given below.
WebElement overlay=driver.findElement(By.id("mask_loading"));
JavascriptExecutor je = (JavascriptExecutor )driver;
String style=je.executeScript("return arguments[0].getAttribute("style");", overlay);
System.out.println("style value of the element is "+style);
It prints the value "z-index: 12;display: none;"
or if you want to get the innerHTML,
String innerHTML=je.executeScript("return arguments[0].innerHTML;",overlay);
In Python,
overlay=driver.find_element_by_id("mask_loading")
style =driver.exeucte_script("return arguments[0].getAttribute('style')",overlay)
or
innerHTML=driver.execute_script("return arguments[0].innerHTML;", overlay)

Why is this BeautifulSoup result []?

I want to get the text in the span. I have checked it, but I don't see the problem
from bs4 import BeautifulSoup
import urllib.request
import socket
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib.request.urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html)
print(soup.findAll('span',attrs={'class': 'b'}))
The result was [], why?

Looking at the site in question, your search result turns up an empty list because there are no spans with a class value of b. BeautifulSoup does not propagate down the CSS like a browser would. In addition, your urllib request looks incorrect. Looking at the site, I think you want to grab all the spans with a class of label, though it's hard when the site isn't in my native language. Here's is how you would go about it:
from bs4 import BeautifulSoup
import urllib2 # Note urllib2
searchurl = "http://suchen.mobile.de/auto/search.html?scopeId=C&isSearchRequest=true&sortOption.sortBy=price.consumerGrossEuro"
f = urllib2.urlopen(searchurl) # Note no need for request
html = f.read()
soup = BeautifulSoup(html)
for s in soup.findAll('span', attrs={"class":"label"}):
print s.text
This gives for the url listed:
Farbe:
Kraftstoffverbr. komb.:
Kraftstoffverbr. innerorts:
Kraftstoffverbr. außerorts:
CO²-Emissionen komb.:
Zugr.-lgd. Treibstoffart:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

LXML cuts text at the first nested tag - lxml

find the solution: have to use .text_content() instead of .text official doc of lxml

Related

issue with parsing wiki.js webpage's HTML content using beautifulsoup

extract text from html string with Scrapy

How do I avoid the 'NavigableString' error with BeautifulSoup and get to the text of href?

How to get the "none display" html from selenium

Why is this BeautifulSoup result []?

Categories

Resources