beautifulsoup: get text (including html tags) between two different tags (</h3> and <h2>) - beautifulsoup

I am trying to scrape an html file structured as follow using beautifulsoup. Basicaly, each unit is constisted of:
one <h2></h2>
one <h3></h3>
more than one <p></p>
Something like follow:
<h2>January, 2020</h2>
<h3>facility</h3>
<p>text1-1</p>
<p>text1-2</p>
<h2>April, 2020</h2>
<h3>scientists</h3>
<p>text2-1</p>
<p>text2-2</p>
<h2>June, 2020</h2>
<h3>lawyers</h3>
<p>text3-1</p>
<h2>.....
I want to get text including the <p> tags between </h3> and the next <h2>. The result should be:
for row #1:
<p>text1-1</p>
<p>text1-2</p>
for row #2:
<p>text2-1</p>
<p>text2-2</p>
for row #3:
<p>text3-1</p>
Here is what I tried so far:
num_h2 = len(soup.find_all('h2'))
for i in range(0,num_h2):
print('---------')
print(i)
p_string = ''
sibling = soup.find_all('h3')[i].find_next_sibling('p').getText()
if sibling:
p_string += sibling
else:
break
print(p_string)
The problem with this solution is that it only shows the content of the first <p> under each unit. I do not know how to find how many <p> are there to generate a for loop. Also, is there a better way to do this than using find_next_silibing()?

Maybe css selectors can help:
for s in soup.select('h3'):
for ns in (s.fetchNextSiblings()):
if ns.name == "h2":
break
else:
if ns.name == "p":
print(ns)
Output:
<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>

Related

Razor-Pages - HTML markup

Following the answer on this post I was using #:<div class="col-md-6">
Problems
The Formatting gets all messed up:
The intellicode gets all messed up (I only deleted the paragraph on the second #:)
Also, my if (i == (Model.CellsNotStarted.Count - 1) && i % 2 != 0) won't work when using the #:; it will always go on the else. If I don't use the #: markup, and if I, for instance put all my html in a if/else instead of only the <div> tag it will work with no problems.
My question is:
Is the #: markup viable? If so, what am I doing wrong. If not what is the alternative (that does not imply putting my entire code in an if/else)
PS: I know I should not post images of code, but I wanted to show the format and syntax errors.
There are neater ways to manage conditions within your Razor code. It's difficult to provide a full example because you don't provide the full code, but you can do this, for example:
<div class="col-md-#(i+1 == Model.CellsNotStarted.Count && i % 2 ! = 0 ? "12" : "6")">
Whenever you are tempted to use a remainder operator (%) in Razor, you should also consider whether a GroupBy would be better. It most often is, as I was shown in this question: Building tables with WebMatrix
You can try to use js to set html.Here is a demo:
<div id="myDiv"></div>
<script>
$(function () {
var html = "";
if ("#(Model.CellsNotStarted != null)"=="True")
{
for (var i = 0; i <#Model.CellsNotStarted.Count; i++)
{
if (i == (#Model.CellsNotStarted.Count - 1) && i % 2 != 0)
{
html += '<div class="col-md-12">';
}
else
{
html += '<div class="col-md-6">';
}
html += i+'</div>';
}
}
$("#myDiv").html(html);
})
</script>
result:

beautifulsoup: find elements after certain element, not necessarily siblings or children

Example html:
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
I want to search for <p>s but only if its position is after span#target.
It should return p4, p5, p6 and p7 in the above example.
I tried to get all <p>s first then filter, but then I don't know how do I judge if an element is after span#target or not, either.
You can do this by using the find_all_next function in beautifulsoup.
from bs4 import BeautifulSoup
doc = # Read the HTML here
# Parse the HTML
soup = BeautifulSoup(doc, 'html.parser')
# Select the first element you want to use as the reference
span = soup.select("span#target")[0]
# Find all elements after the `span` element that have the tag - p
print(span.find_all_next("p"))
The above snippet will result in
[<p>p4</p>, <p>p5</p>, <p>p6</p>, <p>p7</p>]
Edit: As per the request to compare position below by OP-
If you want to compare position of 2 elements, you'll have to rely on sourceline and sourcepos provided by the html.parser and html5lib parsing options.
First off, store the sourceline and/or sourcepos of your reference element in a variable.
span_srcline = span.sourceline
span_srcpos = span.sourcepos
(you don't actually have to store them though, you can just do span.sourcepos directly as long as you have the span stored)
Now iterate through the result of find_all_next and compare the values-
for tag in span.find_all_next("p"):
print(f'line diff: {tag.sourceline - span_srcline}, pos diff: {tag.sourcepos - span_srcpos}, tag: {tag}')
You're most likely interested in line numbers though, as the sourcepos denotes the position on a line.
However, sourceline and sourcepos mean slightly different things for each parser. Check the docs for that info
Try this
html_doc = """
<div>
<p>p1</p>
<p>p2</p>
<p>p3<span id="target">starting from here</span></p>
<p>p4</p>
</div>
<div>
<p>p5</p>
<p>p6</p>
</div>
<p>p7</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find(id="target").findNext('p').contents[0])
Result
p4
try
span = soup.select("span > #target > p")

How to access the details (i.e sub fields) of second/third elements that has the same class name in selenium using python?

I have 3 elements with a particular instance (eg.: there are 3 <div class="sc-1xo2hia-0 TegxE"> under each <div direction="vertical" class="sc-1fp9csv-0 iFnncD"> in the webiste: https://www.blockchain.com/btc/block/00000000000000000004b91bad9ecfa8c0e57c256d0007cca6f0a2a9e54a2ccc ; click 'Inspect element' on the first transaction to view the specific DOM tree)
Now I want to access the some elements from 2nd and 3rd instance of the first tag (sc-1xo2hia-0 TegxE)
How do I do this efficiently?
PS: This code :
from selenium import webdriver
driver=webdriver.Firefox()
driver.get('https://www.blockchain.com/btc/block/00000000000000000004b91bad9ecfa8c0e57c256d0007cca6f0a2a9e54a2ccc')
Txn_elements=driver.find_elements_by_xpath('//div[#class="sc-1fp9csv-0 iFnncD"]')
length=len(Txn_elements)
for i in range(0,length):
element=Txn_elements[i]
data=element.find_elements_by_xpath(".//div[#class='sc-1xo2hia-0 TegxE'][1]")
print data[0].text
still prints details of the 0th <div class="sc-1xo2hia-0 TegxE"> only
i.e. it still prints:
Hash
fc1630ec40d95da3fcca40d499c4be616ea6591dda6f0d3d85a678d47c91ae62
2019-11-06 8:37 PM
where as it should have printed:
17A16QmavnUfCW11DAApiJxp7ARnxN5pGX
2.62352930 BTC
xpath= (.//div[#class='ge5wha-0 bLrlXr']/a)[1] //to get 17A16QmavnUfCW11DAApiJxp7ARnxN5pGX
xpath = (.//div[#class='ge5wha-1 bWdiuU']/span)[1] //to get 2.62352930 BTC
try with this xpath.
Please check below solution its working but I am not sure why you are using for loop if you just want to print two elements in the //div[#class="sc-1fp9csv-0 iFnncD"]
if you want to print only one then remove for loop and try to execute your code
driver.get('https://www.blockchain.com/btc/block/00000000000000000004b91bad9ecfa8c0e57c256d0007cca6f0a2a9e54a2ccc')
Txn_elements=driver.find_elements_by_xpath('//div[#class="sc-1fp9csv-0 iFnncD"]')
length=len(Txn_elements)
for i in range(0,length):
element=Txn_elements[i]
data=element.find_elements_by_xpath("//body/div[#id='__next']/div[#class='sc-1myx216-0 iygrgv']/div[#class='p5q4id-0 fasJHc sc-5vnaz6-1 doVOgS']/div[#class='fieq4h-0 klQmUt']/div[#class='xoxfsb-0 bmukdK']/div[3]/div[2]/div[1]/div[2]/div[1]/div[1]/div[1]/a[1]")
print data[0].text
data1 = element.find_elements_by_xpath(
" //body/div[#id='__next']/div[#class='sc-1myx216-0 iygrgv']/div[#class='p5q4id-0 fasJHc sc-5vnaz6-1 doVOgS']/div[#class='fieq4h-0 klQmUt']/div[#class='xoxfsb-0 bmukdK']/div[3]/div[2]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/span")
print data1[0].text
Try below the solution for transaction ID
driver.get('https://www.blockchain.com/btc/block/00000000000000000004b91bad9ecfa8c0e57c256d0007cca6f0a2a9e54a2ccc')
List1 = driver.find_elements_by_xpath("//body/div[#id='__next']/div[#class='sc-1myx216-0 iygrgv']/div[#class='p5q4id-0 fasJHc sc-5vnaz6-1 doVOgS']/div[#class='fieq4h-0 klQmUt']/div[#class='xoxfsb-0 bmukdK']/div[*]/div[2]/div[1]/div[2]/div/div/div/a")
for items in List1:
print (items.text)
List2 = driver.find_elements_by_xpath("//body/div[#id='__next']/div[#class='sc-1myx216-0 iygrgv']/div[#class='p5q4id-0 fasJHc sc-5vnaz6-1 doVOgS']/div[#class='fieq4h-0 klQmUt']/div[#class='xoxfsb-0 bmukdK']/div[*]/div[2]/div[1]/div[2]/div/div/div/div/span")
for items in List2:
print (items.text)

Extract text from p only if preceding header exists using Beautifulsoup

I want to extract the text in paragraph element using beautifulsoup.
The html looks something like this:
<span class="span_class>
<h1>heading1</h1>
<p>para1</p>
<h1>heading 2</h1>
<p>para2</p>
</span>
I want to extract text from first p only if h1 exists and so on;
So far i have tried
x=soup.findAll('span',{'class':'span_class'})
y=x.findAll('p')[0].text
But i am not getting it.
You can use CSS sibling selector here:
paragraphs = x.select('h1 + p')
# `paragraphs` now contains two elements: <p>para1</p> and <p>para2</p>
This will select only those P elements that have immediate H1 siblings before them.
If you want to do some more logic based on H1 content, you can do this:
for p x.select('h1:first-child + p'):
# `p` contains the element that has `H1` before it.
# `p.previous_sibling` contains `H1`.
if p.previous_sibling.text == 'heading1':
# We got the `P` that has `H1` with content `"heading1"` before it.
print(p, p.previous_sibling)
html = '''<html>
<body>
<span class='span_class'>
<h1>heading1</h1>
<p>content1</p>
<p>content2</p>
<h1>heading2</h1>
<p>content3</p>
</span>
</body>
</html>'''
soup = bs(html, 'lxml')
x = soup.find_all('span',{'class':'span_class'}) #find span
try:
for y in x:
heading = y.find_all('h1') # find h1
for something in heading: # if h1 exist
if something.text == 'heading1':
print(something.text) # print h1
try:
p = something.find_next('p') #try find next p
print(p)
except: # if no next <p>, do nothing
pass
else:
pass #if is is not 'heading1', do nothing
except Exception as e:
print(e)
Is this what you are looking for? It will try to look for your <span> and try to find <h1> from it. For <h1> is in <span> , it will look for the next <p>.

httpagility pack scraping between broken tag

i need to scrape a p tag which has h3 tag after it but does not have a closing p tag. It looks like this :
<script ad>asdasdasd</script>
<p>Translation companies are
-----------------------
-----------------------
<h3 class="this_class">mind blown site</h3>
There is no </p> tag so i cannot parse it completely. Now i have two questions :
1) can this be parsed using httpagility xpath ?
2) i have a function to find text between two strings (getbetween). But i have a doubt - If i use "asdasdasd" and " is it always 100% that vb.net will use the script tag which is just above h3 because there are 2-3 same lines - "asdasdasd"
3) Any other method you guys are aware of ?
(had to write in code so html does not mess up)
Regards,
It might be a good idea to post some more "real" html to really help you, at least the tags between the h3 and the p.
Anyway, this should get you the p-Tag from the h3-Tag.
HtmlDocument doc = new HtmlDocument();
doc.Load(... //Load the Html...
//Either of these lines will do
HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[#class='this_class']/preceding-sibling::p");
//HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[contains(text(),'mind blown site')]/preceding-sibling::p");
string pInnerHtml = pNode.NextSibling.InnerHtml; //Has the text "Translation companies are...."
So in general, to get all the nodes from the opening p tag to the start of a tag you don't want, you could do this:
var p = doc.DocumentNode.SelectSingleNode("//p");
var h3 = p.SelectSingleNode("following-sibling::h3[#class='this_class']");
var following = new List<string>();
for (var current = p.NextSibling; current != h3; current = current.NextSibling)
{
following.Add(current.InnerText);
}
var innerText = String.Concat(following);